← 返回
效率工具 中文

Extract PDF Text

Extract text from PDF files using PyMuPDF. Parse tables, forms, and complex layouts. Supports OCR for scanned documents.
使用 PyMuPDF 从 PDF 提取文本,支持解析表格、表单及复杂排版,并提供扫描件 OCR 识别。
ivangdavila
效率工具 clawhub v1.0.2 1 版本 99947.8 Key: 无需
★ 0
Stars
📥 1,916
下载
💾 139
安装
1
版本
#latest

概述

When to Use

Agent needs to extract text from PDFs. Use PyMuPDF (fitz) for fast local extraction. Works with text-based documents, scanned pages with OCR, forms, and complex layouts.

Quick Reference

TopicFile
-------------
Code examplesexamples.md
OCR setupocr.md
Troubleshootingtroubleshooting.md

Core Rules

1. Install PyMuPDF First

pip install PyMuPDF

Import as fitz (historical name):

import fitz  # PyMuPDF

2. Basic Text Extraction

import fitz

doc = fitz.open("document.pdf")
text = ""
for page in doc:
    text += page.get_text()
doc.close()

3. Pick the Right Method

PDF TypeMethod
------------------
Text-basedpage.get_text() — fast, accurate
ScannedOCR with pytesseract — slower
MixedCheck each page, use OCR when needed

4. Check for Text Before OCR

def needs_ocr(page):
    text = page.get_text().strip()
    return len(text) < 50  # Likely scanned if very little text

5. Handle Errors Gracefully

try:
    doc = fitz.open(path)
except fitz.FileDataError:
    print("Invalid or corrupted PDF")
except fitz.PasswordError:
    doc = fitz.open(path, password="secret")

Extraction Traps

TrapWhat HappensFix
-------------------------
OCR on text PDFSlow + worse accuracyCheck get_text() first
Forget to close docMemory leakUse with or doc.close()
Assume page orderWrong reading flowUse sort=True in get_text()
Ignore encodingGarbled charactersPyMuPDF handles UTF-8

Scope

This skill provides instructions for using PyMuPDF to extract PDF text.

This skill ONLY:

  • Gives code examples for PyMuPDF
  • Explains OCR setup when needed
  • Troubleshoots common issues

This skill NEVER:

  • Accesses files without user request
  • Sends data externally
  • Modifies original PDFs

Security & Privacy

All processing is local:

  • PyMuPDF runs entirely on your machine
  • No external API calls
  • No data leaves your system

Output Formats

Plain Text

text = page.get_text()

Structured (dict)

blocks = page.get_text("dict")["blocks"]
for b in blocks:
    if b["type"] == 0:  # text block
        for line in b["lines"]:
            for span in line["spans"]:
                print(span["text"], span["size"])

JSON

import json
data = page.get_text("json")
parsed = json.loads(data)

Full Example

import fitz

def extract_pdf(path):
    """Extract text from PDF, with OCR fallback for scanned pages."""
    doc = fitz.open(path)
    results = []
    
    for i, page in enumerate(doc):
        text = page.get_text()
        method = "text"
        
        # If very little text, might be scanned
        if len(text.strip()) < 50:
            # OCR would go here (see ocr.md)
            method = "needs_ocr"
        
        results.append({
            "page": i + 1,
            "text": text,
            "method": method
        })
    
    doc.close()
    return {
        "pages": len(results),
        "content": results,
        "word_count": sum(len(r["text"].split()) for r in results)
    }

# Usage
result = extract_pdf("document.pdf")
print(f"Extracted {result['word_count']} words from {result['pages']} pages")

Feedback

  • Useful? clawhub star extract-pdf-text
  • Stay updated: clawhub sync

版本历史

共 1 个版本

  • v1.0.2 当前
    2026-03-29 02:26 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

productivity

Word / DOCX

ivangdavila
创建、检查和编辑 Microsoft Word 文档及 DOCX 文件,支持样式、编号、修订记录、表格、分节符及兼容性检查等功能。
★ 440 📥 148,267
productivity

Baidu web search

ide-rea
使用百度AI搜索引擎(BDSE)进行网络搜索。适用于获取实时信息、文档资料或研究课题。
★ 239 📥 105,826
ai-intelligence

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,363 📥 319,167