← 返回
未分类 中文

PDF Utils

PDF Utils enables OCR of image-based PDFs, extraction of arXiv IDs from text or OCR output, and scriptable PDF tasks like merging, splitting, and rendering.
PDF Utils 支持对图像型 PDF 进行 OCR 识别、从文本或 OCR 结果中提取 arXiv ID,以及合并、拆分、渲染等可脚本化的 PDF 处理任务。
wangwllu wangwllu 来源
未分类 clawhub v1.0.1 1 版本 100000 Key: 无需
★ 0
Stars
📥 516
下载
💾 20
安装
1
版本
#latest

概述

PDF Utils

Use this skill for local, scriptable PDF processing. It is a stable 1.x skill for OCR, arXiv reference mining, and repeatable PyMuPDF workflows. Prefer the built-in pdf tool for AI-style reading, summarization, question-answering, and semantic analysis of PDF content.

Choose the right tool

  • Use the built-in pdf tool for summary, Q&A, extraction by meaning, or general document understanding.
  • Use scripts/extract_refs.py when the PDF already has extractable text and you need arXiv IDs or batch downloads.
  • Use scripts/ocr_pdf.py when the PDF is scanned/image-based and text extraction is poor or empty.
  • Use scripts/pdf_ops.py for repeatable local PDF operations such as merge, split, and rendering a page to an image.

Core workflows

Extract arXiv IDs from a text PDF

Run:

python3 scripts/extract_refs.py paper.pdf

If needed, download the referenced papers:

python3 scripts/extract_refs.py paper.pdf --download --out ~/papers/

OCR a scanned PDF

Run OCR on all pages:

python3 scripts/ocr_pdf.py paper.pdf --all

To OCR and immediately extract arXiv IDs from the OCR output:

python3 scripts/ocr_pdf.py paper.pdf --all --extract-refs

Dependencies

Install these before using OCR features:

brew install tesseract
brew install tesseract-lang
pip3 install pytesseract Pillow pymupdf --break-system-packages

Read more only if needed

  • Read references/usage.md for CLI examples, programmatic API notes, PDF ops usage, and known limits.
  • Read the scripts directly if you need to patch behavior or reuse helper functions.

Practical guidance

  • For very large PDFs, OCR in page ranges or batches instead of all at once.
  • For handwritten or low-resolution scans, expect OCR quality to drop.
  • If a PDF yields partial references, inspect the reference pages first instead of assuming extraction is complete.
  • For merge/split/page rendering, use scripts/pdf_ops.py first before writing one-off snippets.

版本历史

共 1 个版本

  • v1.0.1 当前
    2026-03-30 21:33 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

office-efficiency

Excel / XLSX

ivangdavila
创建、检查和编辑 Microsoft Excel 工作簿及 XLSX 文件,支持可靠的公式、日期、类型、格式、重算及模板保留功能。
★ 387 📥 147,473
office-efficiency

Word / DOCX

ivangdavila
创建、检查和编辑 Microsoft Word 文档及 DOCX 文件,支持样式、编号、修订记录、表格、分节符及兼容性检查等功能。
★ 464 📥 155,171
professional

SEC Finance

wangwllu
从SEC EDGAR和SEC XBRL companyfacts获取美国上市公司(尤其是中国发行人)的结构化财务数据和文件元数据,在需要时使用...
★ 0 📥 587