概述

PDF Utils

Use this skill for local, scriptable PDF processing. It is a stable 1.x skill for OCR, arXiv reference mining, and repeatable PyMuPDF workflows. Prefer the built-in pdf tool for AI-style reading, summarization, question-answering, and semantic analysis of PDF content.

Choose the right tool

Use the built-in pdf tool for summary, Q&A, extraction by meaning, or general document understanding.
Use scripts/extract_refs.py when the PDF already has extractable text and you need arXiv IDs or batch downloads.
Use scripts/ocr_pdf.py when the PDF is scanned/image-based and text extraction is poor or empty.
Use scripts/pdf_ops.py for repeatable local PDF operations such as merge, split, and rendering a page to an image.

Core workflows

Extract arXiv IDs from a text PDF

Run:

python3 scripts/extract_refs.py paper.pdf

If needed, download the referenced papers:

python3 scripts/extract_refs.py paper.pdf --download --out ~/papers/

OCR a scanned PDF

Run OCR on all pages:

python3 scripts/ocr_pdf.py paper.pdf --all

To OCR and immediately extract arXiv IDs from the OCR output:

python3 scripts/ocr_pdf.py paper.pdf --all --extract-refs

Dependencies

Install these before using OCR features:

brew install tesseract
brew install tesseract-lang
pip3 install pytesseract Pillow pymupdf --break-system-packages

Practical guidance

For very large PDFs, OCR in page ranges or batches instead of all at once.
For handwritten or low-resolution scans, expect OCR quality to drop.
If a PDF yields partial references, inspect the reference pages first instead of assuming extraction is complete.
For merge/split/page rendering, use scripts/pdf_ops.py first before writing one-off snippets.

版本历史

共 1 个版本

v1.0.1 当前

2026-03-30 21:33 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)

安全，无风险

查看报告

PDF Utils

概述

PDF Utils

Choose the right tool

Core workflows

Extract arXiv IDs from a text PDF

OCR a scanned PDF

Dependencies

Read more only if needed

Practical guidance

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

Excel / XLSX

Word / DOCX

SEC Finance