← 返回
内容创作 Key 中文

Mistral PDF OCR

Extracts text, tables, and images from PDFs (including scanned PDFs) using the Mistral OCR API. Use when user asks to OCR a PDF/image, extract text from a PD...
使用 Mistral OCR API 从 PDF(含扫描件)中提取文本、表格和图像。适用于用户要求对 PDF 或图像进行 OCR 或文本提取等操作。
tristanmanchester
内容创作 clawhub v1.0.0 1 版本 100000 Key: 需要
★ 0
Stars
📥 825
下载
💾 159
安装
1
版本
#latest

概述

Mistral OCR PDF extraction

Quick start (default)

Run the bundled script to OCR a local PDF and write Markdown + JSON outputs:

python {baseDir}/scripts/mistral_ocr_extract.py --input path/to/file.pdf --out out/ocr

Output directory layout:

  • combined.md (all pages concatenated)
  • pages/page-000.md (per-page markdown)
  • raw_response.json (full OCR response)
  • images/ (decoded embedded images, if requested)
  • tables/ (separate tables, if requested)

Workflow

  1. Pick input mode
    • Local PDF (most common): upload via Files API, then OCR via file_id.
    • Public URL: OCR directly via document_url.
  1. Choose output fidelity (defaults are safe for RAG)
    • Keep table_format=inline unless the user explicitly wants tables split out.
    • Set --include-image-base64 when the user needs figures/diagrams extracted.
    • Use --extract-header/--extract-footer if header/footer noise hurts downstream search.
  1. Run OCR
    • Use scripts/mistral_ocr_extract.py to produce a deterministic on-disk artefact set.
  1. (Optional) Structured extraction from the whole document
    • If the user wants fields (invoice totals, contract parties, etc.), provide an annotation prompt.
    • The OCR API can return a document-level document_annotation in addition to page markdown.

Example:

```bash

python {baseDir}/scripts/mistral_ocr_extract.py \

--input invoice.pdf \

--out out/invoice \

--annotation-prompt "Extract supplier_name, invoice_number, invoice_date (ISO-8601), currency, total_amount. Return JSON." \

--annotation-format json_object

```

Decision rules

  • If the PDF is local and not publicly accessible, upload it (the script does this automatically).
  • If the PDF URL is private or requires authentication, do not pass it as document_url; upload instead.
  • If output quality is critical, prefer table_format=html for downstream parsing over brittle regex.

Common failure modes

  • Missing MISTRAL_API_KEY: set it in the environment before running.
  • URL OCR fails: the URL likely is not publicly accessible; upload the file.
  • Large files: upload supports large files, but very large PDFs may need page selection (--pages) or batch processing.

References

  • API + parameters: references/mistral_ocr_api.md
  • Output mapping rules (placeholders to extracted images/tables): references/output_mapping.md
  • Example annotation prompts for common document types: references/annotation_prompts.md

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-29 10:23 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

content-creation

Baidu Wenku AIPPT

ide-rea
使用百度文库 AI 智能生成 PPT,自动根据内容选择模板。
★ 66 📥 46,133
content-creation

Humanizer

biostartechnology
消除AI写作痕迹,使文本更自然真实。基于维基百科"AI写作特征"指南,识别并修正夸张象征、宣传用语、肤浅-ing分析、模糊归因、破折号滥用、三项排比、AI词汇、负面平行结构及冗长连接词等模式。
★ 857 📥 199,306
content-creation

AdMapix

fly0pants
广告情报与应用数据分析助手,支持搜索广告素材、分析应用排名、下载量、收入及市场洞察,用于广告素材和竞品分析。
★ 295 📥 136,413