← 返回
未分类 Key

mineru precision extract PDF、Document、Images

MinerU precision extract — high-accuracy document extraction with full feature set. Convert PDFs, scanned documents, images, Word (DOC/DOCX), PowerPoint (PPT...
MinerU 精准提取 — 高精度文档提取,完整功能集。支持 PDF、扫描件、图片、Word (DOC/DOCX)、PowerPoint (PPT...
mineru-extract mineru-extract 来源
未分类 clawhub v0.2.1 2 版本 100000 Key: 需要
★ 0
Stars
📥 592
下载
💾 9
安装
2
版本
#latest

概述

Precision Document Extraction with mineru-open-api

Full-featured document extraction with table/formula recognition, OCR, multi-format output, batch processing, and web crawling.

Why use extract?

  • Table recognition — accurately extracts tables from PDFs and images
  • Formula recognition — preserves mathematical formulas as LaTeX
  • Multi-format output — Markdown, HTML, LaTeX, DOCX, JSON
  • Model selection — choose vlm for highest accuracy or pipeline for zero-hallucination
  • Batch processing — process hundreds of files in one command
  • Web crawling — convert web pages to structured Markdown
  • All file formats — PDF, images, DOC, DOCX, PPT, PPTX, HTML
  • Higher limits — much larger file size and page count than quick mode
  • 80+ languages — full language coverage across all script families

Installation

npm install -g mineru-open-api

Or via Go (macOS/Linux):

go install github.com/opendatalab/MinerU-Ecosystem/cli/mineru-open-api@latest

Verify installation

mineru-open-api version

Authentication

Create a token at https://mineru.net/apiManage/token, then configure:

mineru-open-api auth                         # Interactive token setup
export MINERU_TOKEN="your-token"             # Or set via environment variable

Token resolution order: --token flag > MINERU_TOKEN env > ~/.mineru/config.yaml.

Quick start

mineru-open-api extract report.pdf                         # Markdown to stdout
mineru-open-api extract report.pdf -o ./out/               # Save to directory
mineru-open-api extract report.pdf -f md,html,docx -o ./   # Multi-format
mineru-open-api extract report.pdf --model vlm -o ./out/   # High-accuracy mode
mineru-open-api extract *.pdf -o ./results/                # Batch extract
mineru-open-api crawl https://example.com/article          # Web page → Markdown

Supported input formats

FormatSupported
--------:-:
PDF (.pdf)Yes
Images (.png, .jpg, .jpeg, .jp2, .webp, .gif, .bmp)Yes
Word (.doc, .docx)Yes
PowerPoint (.ppt, .pptx)Yes
HTML (.html)Yes
URLs (remote files)Yes

Commands

extract — Precision extraction

mineru-open-api extract <file-or-url> [...] [flags]

Examples

mineru-open-api extract report.pdf                         # Markdown to stdout
mineru-open-api extract report.pdf -f html                 # HTML to stdout
mineru-open-api extract report.pdf -o ./out/               # Save to directory
mineru-open-api extract report.pdf -o ./out/ -f md,docx    # Multiple formats
mineru-open-api extract report.pdf -f latex -o ./out/      # LaTeX output
mineru-open-api extract report.pdf --model vlm -o ./out/   # High-accuracy mode
mineru-open-api extract report.pdf --ocr -o ./out/         # OCR for scanned docs
mineru-open-api extract report.pdf --language en -o ./out/ # Specify language
mineru-open-api extract report.pdf --pages "1-10" -o ./out/  # Page range
mineru-open-api extract *.pdf -o ./results/                # Batch extract
mineru-open-api extract --list files.txt -o ./results/     # Batch from file list
mineru-open-api extract https://example.com/doc.pdf        # Extract from URL
cat doc.pdf | mineru-open-api extract --stdin -o ./out/    # From stdin

extract flags

FlagShortDefaultDescription
-----------------------------------
--output-o_(stdout)_Output path (file or directory)
--format-fmdOutput formats: md, json, html, latex, docx (comma-separated)
--model_(auto)_Model: vlm, pipeline, html (see below)
--ocrfalseEnable OCR for scanned documents
--formulatrueEnable/disable formula recognition
--tabletrueEnable/disable table recognition
--languagechDocument language
--pages_(all)_Page range, e.g. 1-10,15
--timeout900/1800Timeout in seconds (single/batch)
--listRead input list from file (one path per line)
--concurrency0Batch concurrency (0 = server default)

Model comparison: vlm vs pipeline

vlmpipeline
---------
Parsing accuracyHigher — better at complex layouts, mixed contentStandard
Hallucination riskMay produce hallucinated text in rare casesNo hallucination — biggest advantage
Best forAcademic papers, complex tables, intricate layoutsGeneral documents where fidelity matters most

When the user values accuracy and the document has complex formatting, suggest --model vlm. When the user prioritizes reliability and no-hallucination guarantee, suggest --model pipeline (or omit --model to use auto).

crawl — Web page extraction

Fetch web pages and convert to structured Markdown.

mineru-open-api crawl https://example.com/article              # Markdown to stdout
mineru-open-api crawl https://example.com/article -f html      # HTML to stdout
mineru-open-api crawl https://example.com/article -o ./out/    # Save to file
mineru-open-api crawl url1 url2 -o ./pages/                    # Batch crawl
mineru-open-api crawl --list urls.txt -o ./pages/              # Batch from file list

crawl flags

FlagShortDefaultDescription
-----------------------------------
--output-o_(stdout)_Output path
--format-fmdOutput formats: md, json, html (comma-separated)
--timeout900/1800Timeout in seconds (single/batch)
--listRead URL list from file (one per line)
--stdin-listfalseRead URL list from stdin
--concurrency0Batch concurrency

auth — Authentication management

mineru-open-api auth              # Interactive token setup
mineru-open-api auth --verify     # Verify current token is valid
mineru-open-api auth --show       # Show current token source and masked value

Supported --language values

Values are organized by script/language family — each value covers all languages in its group.

Standalone language packs

ValueIncluded languages说明
--------------------------------
chChinese, English, Chinese Traditional中英文(默认值)
ch_serverChinese, English, Chinese Traditional, Japanese繁体、手写体
enEnglish纯英文
japanChinese, English, Chinese Traditional, Japanese日文为主
koreanKorean, English韩文
chinese_chtChinese, English, Chinese Traditional, Japanese繁体中文为主
taTamil, English泰米尔文
teTelugu, English泰卢固文
kaKannada卡纳达文
elGreek, English希腊文
thThai, English泰文

Language family packs

ValueScript/FamilyIncluded languages
----------------------------------------
latinLatin script (拉丁语系)French, German, Afrikaans, Italian, Spanish, Bosnian, Portuguese, Czech, Welsh, Danish, Estonian, Irish, Croatian, Uzbek, Hungarian, Serbian (Latin), Indonesian, Occitan, Icelandic, Lithuanian, Maori, Malay, Dutch, Norwegian, Polish, Slovak, Slovenian, Albanian, Swedish, Swahili, Tagalog, Turkish, Latin, Azerbaijani, Kurdish, Latvian, Maltese, Pali, Romanian, Vietnamese, Finnish, Basque, Galician, Luxembourgish, Romansh, Catalan, Quechua
arabicArabic script (阿拉伯语系)Arabic, Persian, Uyghur, Urdu, Pashto, Kurdish, Sindhi, Balochi, English
cyrillicCyrillic script (西里尔语系)Russian, Belarusian, Ukrainian, Serbian (Cyrillic), Bulgarian, Mongolian, Abkhazian, Adyghe, Kabardian, Avar, Dargin, Ingush, Chechen, Lak, Lezgin, Tabasaran, Kazakh, Kyrgyz, Tajik, Macedonian, Tatar, Chuvash, Bashkir, Malian, Moldovan, Udmurt, Komi, Ossetian, Buryat, Kalmyk, Tuvan, Sakha, Karakalpak, English
east_slavicEast Slavic (东斯拉夫语系)Russian, Belarusian, Ukrainian, English
devanagariDevanagari script (天城文语系)Hindi, Marathi, Nepali, Bihari, Maithili, Angika, Bhojpuri, Magahi, Santali, Newari, Konkani, Sanskrit, Haryanvi, English

Global flags

FlagShortDescription
--------------------------
--tokenAPI token (overrides env and config)
--base-urlAPI base URL (for private deployments)
--verbose-vVerbose mode, print HTTP details

Output behavior

  • No -o flag: result goes to stdout; status/progress messages go to stderr
  • With -o flag: result saved to file/directory; progress messages on stderr
  • Batch mode (extract/crawl): requires -o to specify output directory
  • Binary formats (docx): cannot output to stdout, must use -o
  • Markdown output includes extracted images saved alongside the .md file

Agent guidelines

When using this skill on behalf of the user:

  • Quote file paths that contain spaces or special characters with double quotes. Example: mineru-open-api extract "report 01.pdf".
  • Don't run commands blindly on errors — explain the exit code and troubleshooting steps.
  • Installation questions ("mineru 怎么安装") should be answered with the install instructions above.
  • For stdout mode (no -o), only one text format can be output at a time. If the user wants multiple formats, suggest adding -o.
  • If the user hasn't authenticated yet, guide them to create a token at https://mineru.net/apiManage/token and run mineru-open-api auth.

Default output directory

When the user does NOT specify -o, generate a default output directory:

~/MinerU-Skill/<name>_<hash>/
  • : derived from the source, then sanitized (replace spaces and shell-unsafe characters with _, collapse consecutive _).
  • For URLs: last path segment (e.g. https://arxiv.org/pdf/2509.221862509.22186)
  • For local files: filename without extension (e.g. report.pdfreport)
  • : first 6 characters of MD5 hash of the full original source.
echo -n "source" | md5sum | cut -c1-6   # Linux
echo -n "source" | md5 | cut -c1-6      # macOS

When the user specifies -o: use the user's path as-is.

Skill upgrade = CLI upgrade

When the user asks to upgrade this skill, re-install the CLI first:

npm install -g mineru-open-api@latest

Exit codes

CodeMeaningRecovery
-------------------------
0Success
1General API or unknown errorCheck network; retry; use --verbose
2Invalid parameters / usage errorCheck command syntax and flag values
3Authentication errorCreate or refresh token at https://mineru.net/apiManage/token, then run mineru-open-api auth
4File too large or page limit exceededSplit the file or use --pages
5Extraction failedDocument may be corrupted; try a different --model
6TimeoutIncrease with --timeout; large files may need 1600+ seconds

Troubleshooting

  • "no API token found": Run mineru-open-api auth or set MINERU_TOKEN env variable. Create token at https://mineru.net/apiManage/token.
  • Timeout on large files: Increase with --timeout 1600
  • Batch fails partially: Check stderr for per-file status; succeeded files are still saved
  • Binary format to stdout: Use -o flag; docx cannot stream to stdout
  • Private deployment: Use --base-url https://your-server.com/api
  • Extraction quality is poor: Try --model vlm for complex layouts, or --ocr for scanned documents
  • Tables not extracted correctly: Try --model vlm for better table recognition

Reporting Issues

  • Skill issues: Open an issue at https://github.com/opendatalab/MinerU-Ecosystem/tree/main/cli
  • CLI issues: Open an issue at https://github.com/MinerU-Extract/mineru-document-extractor

版本历史

共 2 个版本

  • v0.2.1 当前
    2026-05-03 06:55 安全 安全
  • v1.0.2
    2026-03-31 05:39

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

office-efficiency

Word / DOCX

ivangdavila
创建、检查和编辑 Microsoft Word 文档及 DOCX 文件,支持样式、编号、修订记录、表格、分节符及兼容性检查等功能。
★ 464 📥 155,292
office-efficiency

Gog

steipete
Google Workspace 命令行工具,支持 Gmail、日历、云端硬盘、通讯录、表格和文档。
★ 932 📥 187,364
office-efficiency

Excel / XLSX

ivangdavila
创建、检查和编辑 Microsoft Excel 工作簿及 XLSX 文件,支持可靠的公式、日期、类型、格式、重算及模板保留功能。
★ 388 📥 147,604