← 返回
未分类 Key

MinerU Doc Parser

MinerU AI document parser — intelligent document extraction powered by AI. Parse PDFs, scanned documents, images, Word files, PowerPoint slides, and web page...
MinerU AI 文档解析器 — AI驱动的智能文档提取。解析 PDF、扫描文档、图片、Word 文件、PowerPoint 幻灯片和网页。
mineru-extract mineru-extract 来源
未分类 clawhub v0.2.1 2 版本 99744.2 Key: 需要
★ 9
Stars
📥 1,770
下载
💾 4
安装
2
版本
#latest

概述

AI Document Parsing with MinerU

Intelligent document extraction powered by AI — parse any document format into clean, structured output.

Installation

npm install -g mineru-open-api

Or via Go (macOS/Linux):

go install github.com/opendatalab/MinerU-Ecosystem/cli/mineru-open-api@latest

Verify installation

mineru-open-api version

Two extraction modes

flash-extractextract
---------
Token requiredNoYes (mineru-open-api auth)
SpeedFastNormal
Table recognitionNoYes
Formula recognitionNoYes
OCRYesYes
Output formatsMarkdown onlymd, html, latex, docx, json
Batch modeNoYes
Model selectionpipelineYes (vlm, pipeline, MinerU-HTML)
File size limit10 MBMuch higher
Page limit20 pagesMuch higher
Rate limitPer-IP per-minute capBased on API plan
Best forQuick start, small/simple docsLarge docs, tables, production

flash-extract limits

LimitValue
--------------
File sizeMax 10 MB
Page countMax 20 pages
Supported typesPDF, Images (png/jpg/jpeg/jp2/webp/gif/bmp), Docx, PPTx
IP rate limitPer-minute request caps (HTTP 429 when exceeded)

When any limit is exceeded, the agent should suggest switching to extract with a token (create at https://mineru.net/apiManage/token), which has significantly higher limits.

Core workflow

  1. Start fast (no token): mineru-open-api flash-extract for quick Markdown conversion
  2. Need more? Create token at https://mineru.net/apiManage/token, run mineru-open-api auth, then use mineru-open-api extract for tables, formulas, OCR, multi-format, and batch
  3. Web pages: mineru-open-api crawl to convert web content
  4. Check results: output goes to stdout (default) or -o directory

Authentication

Only required for extract and crawl. Not needed for flash-extract.

Configure your API token (create one at https://mineru.net/apiManage/token):

mineru-open-api auth                    # Interactive token setup
export MINERU_TOKEN="your-token"  # Or set via environment variable

Token resolution order: --token flag > MINERU_TOKEN env > ~/.mineru/config.yaml.

Supported input formats

Formatflash-extractextract
--------:-::-:
PDF (.pdf)YesYes
Images (.png, .jpg, .jpeg, .jp2, .webp, .gif, .bmp)YesYes
Word (.docx)YesYes
Word (.doc)NoYes
PowerPoint (.pptx)YesYes
PowerPoint (.ppt)NoYes
HTML (.html)NoYes
URLs (remote files)YesYes

The crawl command accepts any HTTP/HTTPS URL and extracts web page content.

Commands

flash-extract — Quick extraction (no token needed)

Fast, token-free document extraction. Outputs Markdown only. No table recognition. Limited to 10 MB / 20 pages per file, with IP-based rate limiting.

mineru-open-api flash-extract report.pdf                     # Markdown to stdout
mineru-open-api flash-extract report.pdf -o ./out/           # Save to file
mineru-open-api flash-extract https://example.com/doc.pdf    # URL mode
mineru-open-api flash-extract report.pdf --language en       # Specify language
mineru-open-api flash-extract report.pdf --pages 1-10        # Page range

flash-extract flags

FlagShortDefaultDescription
-----------------------------------
--output-o_(stdout)_Output path (file or directory)
--languagechDocument language
--pages_(all)_Page range, e.g. 1-10
--timeout900Timeout in seconds

extract — Precision extraction (token required)

Convert PDFs, images, and other documents to Markdown or other formats. Supports table/formula recognition, OCR, multiple output formats, and batch mode.

mineru-open-api extract report.pdf                         # Markdown to stdout
mineru-open-api extract report.pdf -f html                 # HTML to stdout
mineru-open-api extract report.pdf -o ./out/               # Save to directory
mineru-open-api extract report.pdf -o ./out/ -f md,docx    # Multiple formats
mineru-open-api extract *.pdf -o ./results/                # Batch extract
mineru-open-api extract --list files.txt -o ./results/     # Batch from file list
mineru-open-api extract https://example.com/doc.pdf        # Extract from URL
cat doc.pdf | mineru-open-api extract --stdin -o ./out/    # From stdin

extract flags

FlagShortDefaultDescription
-----------------------------------
--output-o_(stdout)_Output path (file or directory)
--format-fmdOutput formats: md, json, html, latex, docx (comma-separated)
--model_(auto)_Model: vlm, pipeline, html (see below)
--ocrfalseEnable OCR for scanned documents
--formulatrueEnable/disable formula recognition
--tabletrueEnable/disable table recognition
--languagechDocument language
--pages_(all)_Page range, e.g. 1-10,15
--timeout900/1800Timeout in seconds (single/batch)
--listRead input list from file (one path per line)
--concurrency0Batch concurrency (0 = server default)

Model comparison: vlm vs pipeline

vlmpipeline
---------
Parsing accuracyHigher — better at complex layouts, mixed contentStandard
Hallucination riskMay produce hallucinated text in rare casesNo hallucination — biggest advantage
Best forAcademic papers, complex tables, intricate layoutsGeneral documents where fidelity matters most

When the user values accuracy and the document has complex formatting, suggest --model vlm. When the user prioritizes reliability and no-hallucination guarantee, suggest --model pipeline (or omit --model to use auto).

crawl — Web page extraction (token required)

Fetch web pages and convert to Markdown.

mineru-open-api crawl https://example.com/article              # Markdown to stdout
mineru-open-api crawl https://example.com/article -f html      # HTML to stdout
mineru-open-api crawl https://example.com/article -o ./out/     # Save to file
mineru-open-api crawl url1 url2 -o ./pages/                     # Batch crawl
mineru-open-api crawl --list urls.txt -o ./pages/               # Batch from file list

crawl flags

FlagShortDefaultDescription
-----------------------------------
--output-o_(stdout)_Output path
--format-fmdOutput formats: md, json, html (comma-separated)
--timeout900/1800Timeout in seconds (single/batch)
--listRead URL list from file (one per line)
--stdin-listfalseRead URL list from stdin
--concurrency0Batch concurrency

auth — Authentication management

mineru-open-api auth              # Interactive token setup
mineru-open-api auth --verify     # Verify current token is valid
mineru-open-api auth --show       # Show current token source and masked value

Supported --language values

The --language flag accepts the following values (default: ch). Used by both flash-extract and extract. Values are organized by script/language family — each value covers all languages listed in its group.

Standalone language packs

ValueIncluded languages
--------------------------
chChinese, English, Chinese Traditional
ch_serverChinese, English, Chinese Traditional, Japanese
enEnglish
japanChinese, English, Chinese Traditional, Japanese
koreanKorean, English
chinese_chtChinese, English, Chinese Traditional, Japanese
taTamil, English
teTelugu, English
kaKannada
elGreek, English
thThai, English

Language family packs

ValueScript/FamilyIncluded languages
----------------------------------------
latinLatin scriptFrench, German, Italian, Spanish, Portuguese, Dutch, Swedish, and 40+ more
arabicArabic scriptArabic, Persian, Uyghur, Urdu, Pashto, Kurdish, and more
cyrillicCyrillic scriptRussian, Ukrainian, Bulgarian, Serbian, Kazakh, and 20+ more
east_slavicEast SlavicRussian, Belarusian, Ukrainian, English
devanagariDevanagari scriptHindi, Marathi, Nepali, Sanskrit, and more

Output behavior

  • No -o flag: result goes to stdout; status/progress messages go to stderr
  • With -o flag: result saved to file/directory; progress messages on stderr
  • Batch mode (extract/crawl only): requires -o to specify output directory
  • Binary formats (docx, extract only): cannot output to stdout, must use -o
  • Markdown output includes extracted images saved alongside the .md file

General rules

When using this skill on behalf of the user:

  • Quote file paths that contain spaces or special characters with double quotes in commands. Example: mineru-open-api extract "report 01.pdf", NOT mineru-open-api extract report 01.pdf.
  • Don't run commands blindly on errors — if the user asks "提取失败了怎么办", explain the exit code and troubleshooting steps instead of re-running the command.
  • Installation questions ("mineru 怎么安装") should be answered with the install instructions, not by running mineru-open-api extract.
  • DOCX as input is supported — if the user asks "这个 Word 文档能转 Markdown 吗", use mineru-open-api extract file.docx or mineru-open-api flash-extract file.docx. Note: .doc format is only supported by extract, not flash-extract.
  • Table extraction — tables are only recognized by extract (not flash-extract). If the user mentions tables, use extract.
  • For stdout mode (no -o), only one text format can be output at a time. If the user wants multiple formats, suggest adding -o.

Choosing between flash-extract and extract

The agent MUST follow this decision logic:

  1. Default to flash-extract when:
    • User has NOT configured a token (no ~/.mineru/config.yaml, no MINERU_TOKEN env)
    • User wants a quick/simple extraction without mentioning tables, formulas, OCR, or specific formats
    • File is under 10 MB and under 20 pages
    • User is trying the tool for the first time
  1. Use extract when:
    • User explicitly asks for table recognition, formula recognition, or OCR
    • User requests non-Markdown output formats (html, latex, docx, json)
    • User needs batch processing (multiple files)
    • File is over 10 MB or over 20 pages (exceeds flash-extract limits)
    • User has a token configured and wants precision-quality extraction
  1. If unsure, prefer flash-extract — it's faster and requires no setup, but check file size first.
  1. When the user does NOT specify an output path (-o), the agent MUST generate a default output directory to prevent file overwrites. Use:
~/MinerU-Skill/<name>_<hash>/

Naming rules:

  • : derived from the source, then sanitized for safe directory names.
  • For URLs: last path segment (e.g. https://arxiv.org/pdf/2509.221862509.22186)
  • For local files: filename without extension (e.g. report.pdfreport)
  • Sanitization: replace spaces and shell-unsafe characters with _. Collapse consecutive _ into one. Keep alphanumeric, -, _, ., and CJK characters.
  • : first 6 characters of the MD5 hash of the full original source path or URL (before sanitization).
echo -n "https://arxiv.org/pdf/2509.22186" | md5sum | cut -c1-6
  1. When the user asks to upgrade or update this skill, the agent MUST re-install the CLI binary:
npm install -g mineru-open-api@latest

flash-extract limit handling

When flash-extract fails due to file limits or rate limiting, the agent MUST provide a clear explanation and suggest extract as the upgrade path.

Post-extraction friendly hints

After flash-extract completes successfully, the agent MUST append a brief hint:

> Tip: flash-extract 为快速免登录模式(限 10MB/20页,不含表格识别)。如需解析更大文件、表格/公式识别或多格式导出,请前往 https://mineru.net/apiManage/token 创建 Token,运行 mineru-open-api auth 配置后使用 mineru-open-api extract

Keep the hint to ONE short sentence. Do NOT repeat the hint if the user has already seen it in this session.

Exit codes

CodeMeaningRecovery
-------------------------
0Success
1General API or unknown errorCheck network connectivity; retry; use --verbose for details
2Invalid parameters / usage errorCheck command syntax and flag values
4File too large or page limit exceededFor flash-extract: file must be under 10 MB / 20 pages; switch to extract with token for higher limits. For extract: split the file or use --pages
5Extraction failedThe document may be corrupted or unsupported; try a different --model
6TimeoutIncrease with --timeout; large files may need 600+ seconds

Troubleshooting

  • "no API token found" (on extract/crawl): Run mineru-open-api auth or set MINERU_TOKEN env variable. Or use flash-extract which needs no token.
  • Timeout on large files: Increase with --timeout 1600 (seconds)
  • Batch fails partially: Check stderr for per-file status; succeeded files are still saved
  • Binary format to stdout: Use -o flag; docx cannot stream to stdout
  • Private deployment: Use --base-url https://your-server.com/api
  • Extraction quality is poor: Try mineru-open-api extract with --model vlm for complex layouts, or --ocr for scanned documents
  • Tables not extracted: flash-extract does NOT support tables. Use mineru-open-api extract with a token.
  • HTTP 429 on flash-extract: IP rate limit hit. Wait a few minutes or switch to mineru-open-api extract with token.

Reporting Issues

  • Skill issues: Open an issue at https://github.com/opendatalab/MinerU-Ecosystem/tree/main/cli
  • CLI issues: Open an issue at https://github.com/MinerU-Extract/mineru-ai

版本历史

共 2 个版本

  • v0.2.1 当前
    2026-05-03 02:52 安全 安全
  • v1.0.7
    2026-03-29 22:01

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

office-efficiency

Word / DOCX

ivangdavila
创建、检查和编辑 Microsoft Word 文档及 DOCX 文件,支持样式、编号、修订记录、表格、分节符及兼容性检查等功能。
★ 453 📥 150,728
office-efficiency

Excel / XLSX

ivangdavila
创建、检查和编辑 Microsoft Excel 工作簿及 XLSX 文件,支持可靠的公式、日期、类型、格式、重算及模板保留功能。
★ 376 📥 143,364
office-efficiency

Nano Pdf

steipete
使用nano-pdf CLI通过自然语言指令编辑PDF
★ 276 📥 115,637