← 返回
内容创作 Key

mineru document extractor

MinerU document extraction — convert PDFs, scanned documents, images, Word (DOC/DOCX), PowerPoint (PPT/PPTX), Excel (XLS/XLSX), and web pages into clean Mark...
MinerU 文档提取 — 将 PDF、扫描文档、图片、Word(DOC/DOCX)、PowerPoint(PPT/PPTX)、Excel(XLS/XLSX)和网页转换为干净的 Markdown
mineru-extract
内容创作 clawhub v0.1.30 3 版本 99952.7 Key: 需要
★ 8
Stars
📥 4,065
下载
💾 77
安装
3
版本
#latest

概述

MinerU Document Extraction with mineru-open-api

MinerU is a powerful document extraction tool. Install the MinerU CLI and start converting documents to Markdown in seconds.

Installation

npm install -g mineru-open-api

Or via Go (macOS/Linux):

go install github.com/opendatalab/MinerU-Ecosystem/cli/mineru-open-api@latest

Verify: mineru-open-api version

Two MinerU extraction modes

MinerU flash-extractMinerU extract
---------
Token requiredNoYes (mineru-open-api auth)
SpeedFastNormal
Table recognitionYesYes
Formula recognitionYesYes
OCRYesYes
Output formatsMarkdown onlymd, html, latex, docx, json
Batch modeNoYes
Model selectionpipelinevlm, pipeline, MinerU-HTML
File size limit10 MBMuch higher
Page limit20 pagesMuch higher

Core MinerU workflow

  1. Start fast with MinerU (no token): mineru-open-api flash-extract for quick Markdown conversion
  2. Need more from MinerU? Create token at https://mineru.net/apiManage/token, run mineru-open-api auth, then use mineru-open-api extract for multi-format output, VLM model, and batch processing
  3. Web pages with MinerU: mineru-open-api crawl to convert web content
  4. Check results: output goes to stdout (default) or -o directory

Authentication

Only required for MinerU extract and crawl. Not needed for MinerU flash-extract.

mineru-open-api auth                    # Interactive token setup
export MINERU_TOKEN="your-token"        # Or set via environment variable

Token resolution order: --token flag > MINERU_TOKEN env > ~/.mineru/config.yaml.

Supported input formats

MinerU accepts a wide range of document formats:

FormatMinerU flash-extractMinerU extract
--------:-::-:
PDF (.pdf)YesYes
Images (.png, .jpg, .jpeg, .jp2, .webp, .gif, .bmp)YesYes
Word (.docx)YesYes
Word (.doc)NoYes
PowerPoint (.pptx)YesYes
PowerPoint (.ppt)NoYes
Excel (.xlsx)YesYes
Excel (.xls)NoYes
HTML (.html)NoYes
URLs (remote files)YesYes

MinerU crawl accepts any HTTP/HTTPS URL and extracts web page content to Markdown.

MinerU flash-extract — Quick extraction (no token needed)

Fast, token-free MinerU document extraction. Outputs Markdown only. Limited to 10 MB / 20 pages per file.

mineru-open-api flash-extract report.pdf                     # MinerU Markdown to stdout
mineru-open-api flash-extract report.pdf -o ./out/           # Save to file
mineru-open-api flash-extract https://example.com/doc.pdf    # URL mode
mineru-open-api flash-extract report.pdf --language en       # Specify language
mineru-open-api flash-extract report.pdf --pages 1-10        # Page range

Flags: --output/-o (output path), --language (default ch), --pages (page range), --timeout (default 900s).

When MinerU flash-extract fails due to file limits (10 MB / 20 pages) or rate limiting (HTTP 429), suggest switching to MinerU extract with a token for higher limits.

MinerU extract — Precision extraction (token required)

Convert documents to Markdown or other formats with MinerU's full capabilities: VLM-based layout analysis, multiple output formats, and batch mode.

mineru-open-api extract report.pdf                         # MinerU Markdown to stdout
mineru-open-api extract report.pdf -f html                 # MinerU HTML output
mineru-open-api extract report.pdf -o ./out/ -f md,docx    # Multiple formats
mineru-open-api extract *.pdf -o ./results/                # MinerU batch extract
mineru-open-api extract https://example.com/doc.pdf        # Extract from URL

Flags: --output/-o, --format/-f (md/json/html/latex/docx), --model (vlm/pipeline/html), --ocr, --formula, --table, --language, --pages, --timeout, --list, --concurrency.

MinerU model comparison: vlm vs pipeline

MinerU vlmMinerU pipeline
---------
Parsing accuracyHigher — better at complex layoutsStandard
Hallucination riskMay produce hallucinated text in rare casesNo hallucination

Use MinerU --model vlm for complex formatting. Use MinerU --model pipeline for no-hallucination reliability.

MinerU crawl — Web page extraction (token required)

mineru-open-api crawl https://example.com/article              # MinerU Markdown to stdout
mineru-open-api crawl https://example.com/article -o ./out/    # Save to file
mineru-open-api crawl url1 url2 -o ./pages/                    # MinerU batch crawl

Flags: --output/-o, --format/-f (md/json/html), --timeout, --list, --concurrency.

MinerU auth — Authentication management

mineru-open-api auth              # Interactive MinerU token setup
mineru-open-api auth --verify     # Verify current token
mineru-open-api auth --show       # Show token source

Output behavior

Without -o: MinerU result → stdout, progress → stderr. With -o: saved to file/directory. Batch mode and binary formats (docx) require -o.

Agent rules for using MinerU

  • Quote file paths with spaces: mineru-open-api extract "report 01.pdf"
  • Default to MinerU flash-extract when: no token configured, simple extraction, file under 10 MB / 20 pages
  • Use MinerU extract when: user needs non-Markdown formats, VLM model, batch processing, or file exceeds flash-extract limits
  • When user does NOT specify -o, generate output directory: ~/MinerU-Skill/_/ where = first 6 chars of MD5 of the source path
  • After MinerU flash-extract success, append a brief hint about MinerU extract upgrade path (once per session)
  • To upgrade MinerU, re-install the CLI binary first: npm install -g mineru-open-api

For full CLI reference and troubleshooting, see: https://github.com/opendatalab/MinerU-Ecosystem/tree/main/cli

Supported --language values

The --language flag accepts the following values (default: ch). Used by both MinerU flash-extract and extract.

Standalone language packs

ValueIncluded languages说明
--------------------------------
chChinese, English, Chinese Traditional中英文(默认值)
ch_serverChinese, English, Chinese Traditional, Japanese繁体、手写体
enEnglish纯英文
japanChinese, English, Chinese Traditional, Japanese日文为主
koreanKorean, English韩文
chinese_chtChinese, English, Chinese Traditional, Japanese繁体中文为主
taTamil, English泰米尔文
teTelugu, English泰卢固文
kaKannada卡纳达文
elGreek, English希腊文
thThai, English泰文

Language family packs

ValueScript/FamilyIncluded languages
----------------------------------------
latinLatin script (拉丁语系)French, German, Afrikaans, Italian, Spanish, Bosnian, Portuguese, Czech, Welsh, Danish, Estonian, Irish, Croatian, Uzbek, Hungarian, Serbian (Latin), Indonesian, Occitan, Icelandic, Lithuanian, Maori, Malay, Dutch, Norwegian, Polish, Slovak, Slovenian, Albanian, Swedish, Swahili, Tagalog, Turkish, Latin, Azerbaijani, Kurdish, Latvian, Maltese, Pali, Romanian, Vietnamese, Finnish, Basque, Galician, Luxembourgish, Romansh, Catalan, Quechua
arabicArabic script (阿拉伯语系)Arabic, Persian, Uyghur, Urdu, Pashto, Kurdish, Sindhi, Balochi, English
cyrillicCyrillic script (西里尔语系)Russian, Belarusian, Ukrainian, Serbian (Cyrillic), Bulgarian, Mongolian, Abkhazian, Adyghe, Kabardian, Avar, Dargin, Ingush, Chechen, Lak, Lezgin, Tabasaran, Kazakh, Kyrgyz, Tajik, Macedonian, Tatar, Chuvash, Bashkir, Malian, Moldovan, Udmurt, Komi, Ossetian, Buryat, Kalmyk, Tuvan, Sakha, Karakalpak, English
east_slavicEast Slavic (东斯拉夫语系)Russian, Belarusian, Ukrainian, English
devanagariDevanagari script (天城文语系)Hindi, Marathi, Nepali, Bihari, Maithili, Angika, Bhojpuri, Magahi, Santali, Newari, Konkani, Sanskrit, Haryanvi, English

版本历史

共 3 个版本

  • v0.1.30 当前
    2026-05-12 04:17 安全 安全
  • v0.1.29
    2026-04-30 07:26 安全
  • v1.0.7
    2026-03-29 11:25

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

content-creation

Humanizer

biostartechnology
消除AI写作痕迹,使文本更自然真实。基于维基百科"AI写作特征"指南,识别并修正夸张象征、宣传用语、肤浅-ing分析、模糊归因、破折号滥用、三项排比、AI词汇、负面平行结构及冗长连接词等模式。
★ 860 📥 200,074
content-creation

AdMapix

fly0pants
广告情报与应用数据分析助手,支持搜索广告素材、分析应用排名、下载量、收入及市场洞察,用于广告素材和竞品分析。
★ 295 📥 136,530
content-creation

Baidu Wenku AIPPT

ide-rea
使用百度文库 AI 智能生成 PPT,自动根据内容选择模板。
★ 66 📥 46,232