Run the bundled script to OCR a local PDF and write Markdown + JSON outputs:
python {baseDir}/scripts/mistral_ocr_extract.py --input path/to/file.pdf --out out/ocr
Output directory layout:
combined.md (all pages concatenated)pages/page-000.md (per-page markdown)raw_response.json (full OCR response)images/ (decoded embedded images, if requested)tables/ (separate tables, if requested)file_id.document_url.table_format=inline unless the user explicitly wants tables split out.--include-image-base64 when the user needs figures/diagrams extracted.--extract-header/--extract-footer if header/footer noise hurts downstream search.scripts/mistral_ocr_extract.py to produce a deterministic on-disk artefact set.document_annotation in addition to page markdown.Example:
```bash
python {baseDir}/scripts/mistral_ocr_extract.py \
--input invoice.pdf \
--out out/invoice \
--annotation-prompt "Extract supplier_name, invoice_number, invoice_date (ISO-8601), currency, total_amount. Return JSON." \
--annotation-format json_object
```
document_url; upload instead.table_format=html for downstream parsing over brittle regex.MISTRAL_API_KEY: set it in the environment before running.--pages) or batch processing.references/mistral_ocr_api.mdreferences/output_mapping.mdreferences/annotation_prompts.md共 1 个版本