← 返回
未分类 中文

pdf-ocr-extraction

Extract text from image-based or scanned PDFs using Tesseract OCR.
使用 Tesseract OCR 从图片或扫描 PDF 中提取文本
bilicen700
未分类 clawhub v1.0.3 1 版本 100000 Key: 无需
★ 2
Stars
📥 1,048
下载
💾 224
安装
1
版本
#latest

概述

PDF OCR Extractor

Use this skill to extract text from scanned PDFs or image-based PDFs that lack a native text layer. It's completely free, doesn't utilize third-party APIs, and offers unlimited usage. It renders PDF pages to images and runs optical character recognition (OCR).

Dependencies

This skill requires:

  1. System Binary: tesseract (along with required language data packs like chi_sim or eng).
  2. Python Packages: pypdfium2, pytesseract, and Pillow.

Note: Do not run automated pip install commands at runtime. Rely on the user or the environment to pre-install the dependencies defined in the metadata block.

Quick Start

Create a Python script (e.g., extract.py) in a temporary directory to handle the extraction safely:

import pypdfium2 as pdfium
import pytesseract
from PIL import Image
import sys
import os

def extract(pdf_path):
    doc = pdfium.PdfDocument(pdf_path)
    full_text = []
    for i, page in enumerate(doc):
        # Render page to a high-resolution image
        bitmap = page.render(scale=2)
        tmp_img = f"/tmp/page_{i}.png"
        bitmap.to_pil().save(tmp_img)
        
        # Run OCR (assuming English and Simplified Chinese packs are installed)
        text = pytesseract.image_to_string(Image.open(tmp_img), lang='chi_sim+eng')
        full_text.append(text)
        
        # Cleanup temporary file
        os.remove(tmp_img)
        
    return "\n".join(full_text)

if __name__ == "__main__":
    if len(sys.argv) > 1:
        print(extract(sys.argv[1]))

Then execute the script:

python3 extract.py /path/to/document.pdf

Security & Sandbox Constraints

  • Write temporary images only to /tmp/ and clean them up immediately after extraction.
  • Do not attempt to dynamically download or install language packs via shell commands; notify the user if a specific language is missing.

版本历史

共 1 个版本

  • v1.0.3 当前
    2026-03-30 05:56 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-intelligence

ontology

oswalpalash
类型化知识图谱,用于结构化智能体记忆与可组合技能。支持创建/查询实体(人员、项目、任务、事件、文档)及关联...
★ 711 📥 243,715
security-compliance

Skill Vetter

spclaudehome
AI智能体技能安全预审工具。安装ClawdHub、GitHub等来源技能前,检查风险信号、权限范围及可疑模式。
★ 1,215 📥 266,418
ai-intelligence

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,356 📥 318,059