← 返回
未分类 Key 中文

MinerU Document Explorer

REQUIRED for any task involving reading or understanding PDF contents. Use when a user asks about a .pdf file — reading pages, answering questions, extractin...
用于阅读或理解PDF内容的任务;当用户询问 .pdf 文件(如读取页面、回答问题、提取信息等)时使用。
rise-1210 rise-1210 来源
未分类 clawhub v1.0.1 1 版本 100000 Key: 需要
★ 0
Stars
📥 472
下载
💾 0
安装
1
版本
#latest

概述

MinerU Document Explorer

PDF reading toolkit via doc-search CLI. Search first, then read relevant pages — never scan an entire PDF.

⚠️ Network capabilities: This skill can optionally call external APIs (PageIndex outline generation, MinerU cloud OCR, embedding/reranker services) and run a local FastAPI server. All network features are opt-in and disabled by default.

Path conventions

SKILL_DIR  = <this file's parent directory>
SCRIPTS    = SKILL_DIR/scripts

Setup check

Read SKILL_DIR/config-state.json. If missing or setup_complete is not true:

  1. Read references/setup.md and run the installer
  2. After setup, ask the user if they want to configure PageIndex (e.g. "If you have an OpenAI-compatible API key, you can enable PageIndex to auto-generate a document outline — useful for scanned docs or manuals. Want to set it up?")
  3. If the user provides pageindex_api_key / pageindex_base_url → write to SCRIPTS/doc-search/config.yaml; if skipped → continue immediately, do not block

⚠️ MUST read reference docs before acting — no guessing

Any uncertainty about parameters, return fields, or query phrasing → MUST read the corresponding cmd file before running any command. Do not infer or guess.

  • references/cmd-init.md / cmd-outline.md / cmd-pages.md
  • references/cmd-search-keyword.md / cmd-search-semantic.md / cmd-elements.md

For complex tasks, errors, unexpected results, or unfamiliar scenarios → MUST read references/tips.md first. It contains proven workflows and hard-won pitfalls that will save you from repeating mistakes.


Command cheatsheet

All output is JSON on stdout. --timeout is a global flag before the subcommand; default is 120s.

doc-search init --doc_path "<path_or_url>"
doc-search outline --doc_id "<id>" [--max_depth N] [--root_node "<node_id>"]
doc-search pages --doc_id "<id>" --page_idxs "<p>" [--no_image] [--return_text]
doc-search search-keyword --doc_id "<id>" --page_idxs "<p>" --pattern "<regex>" [--return_text]
doc-search search-semantic --doc_id "<id>" --page_idxs "<p>" --query "<q>" [--top_k N] [--no_image] [--return_text]
doc-search --timeout 300 elements --doc_id "<id>" --page_idxs "<p>" --query "<q>"

Key reminders

  • Use outline and keyword search to narrow the reading range — never scan the full document
  • --page_idxs is 0-indexed — do not confuse with printed page numbers
  • After extracting figures/tables with elements, you must read crop_path to verify ; and the query should be "the actual chart image, not the caption text" ; if the query fails, check page_idxs or rephrase the query

Lessons learned (mandatory)

After completing any PDF task: pitfalls / new workflows / parameter discoveries → append to references/tips.md, 1-2 lines each, conclusions only.

版本历史

共 1 个版本

  • v1.0.1 当前
    2026-05-03 06:47 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

office-efficiency

Nano Pdf

steipete
使用nano-pdf CLI通过自然语言指令编辑PDF
★ 280 📥 117,067
office-efficiency

Excel / XLSX

ivangdavila
创建、检查和编辑 Microsoft Excel 工作簿及 XLSX 文件,支持可靠的公式、日期、类型、格式、重算及模板保留功能。
★ 391 📥 148,254
office-efficiency

Word / DOCX

ivangdavila
创建、检查和编辑 Microsoft Word 文档及 DOCX 文件,支持样式、编号、修订记录、表格、分节符及兼容性检查等功能。
★ 467 📥 155,976