← 返回
未分类 中文

OpenDataLoader PDF

Parse PDFs into Markdown, JSON, or HTML with OCR, table extraction, and AI-enriched descriptions for building RAG pipelines and knowledge bases.
将 PDF 解析为 Markdown、JSON 或 HTML,具备 OCR、表格提取和 AI 生成描述,用于构建 RAG 管道和知识库。
zmy1006-sudo zmy1006-sudo 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 316
下载
💾 0
安装
1
版本
#ai#document#latest#parser#pdf#rag

概述

OpenDataLoader PDF Skill

Quick Install

# Basic (CPU, ~20 pages/sec)
pip install -U opendataloader-pdf

# Hybrid mode (AI-enhanced, for complex docs, ~2 pages/sec)
pip install -U "opendataloader-pdf[hybrid]"

# LangChain integration
pip install langchain-opendataloader-pdf

Requirements: Java 11+ (for hybrid mode), Python 3.10+


Core Usage Patterns

1. Parse PDF → Markdown (best for RAG chunking)

from opendataloader_pdf import convert

convert(
    input_path=["file1.pdf", "folder/"],
    output_dir="output/",
    format="markdown"  # clean text, LLM-ready
)

2. Parse PDF → JSON (with bounding boxes for citations)

convert(
    input_path=["report.pdf"],
    output_dir="output/",
    format="json",           # structured data + coordinates
    image_output="embedded"  # "off" | "embedded" | "external"
)

3. LangChain + RAG Pipeline

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = OpenDataLoaderPDFLoader(file_path="document.pdf", format="text")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
# → embed → vector store → RAG

CLI Commands

# Basic: single file or folder
opendataloader-pdf file1.pdf file2.pdf folder/

# Complex tables / nested structure (hybrid mode)
opendataloader-pdf --hybrid docling-fast file1.pdf

# Start hybrid backend first, then:
opendataloader-pdf-hybrid --port 5002
# (in another terminal)
opendataloader-pdf --hybrid docling-fast file1.pdf

# OCR for scanned PDFs
opendataloader-pdf-hybrid --port 5002 --force-ocr file1.pdf

# Math formula extraction (LaTeX)
opendataloader-pdf-hybrid --enrich-formula
opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf

# Chart/image AI description
opendataloader-pdf-hybrid --enrich-picture-description
opendataloader-pdf --hybrid docling-fast --hybrid-mode full file1.pdf

# Security: sanitize prompt injection
opendataloader-pdf file1.pdf --sanitize

Output Format Selection Guide

Document TypeRecommended FormatMode
---------------------------------------
Standard digital PDFmarkdownBasic
Complex/nested tablesjsonHybrid
Scanned PDFsany + --force-ocrHybrid
Math formulasmarkdown + --enrich-formulaHybrid
Charts needing descriptionmarkdown + --enrich-picture-descriptionHybrid
Medical reports (cite-able)jsonHybrid
RAG knowledge basemarkdownBasic or Hybrid

Key Reference Files


Benchmark Results (v2.0)

MetricScore
---------------
Overall Accuracy0.90
Reading Order0.94
Table Accuracy0.93
Heading Accuracy0.83

License: Apache 2.0 | GitHub: opendataloader-project/opendataloader-pdf

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-07 16:06 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

office-efficiency

Excel / XLSX

ivangdavila
创建、检查和编辑 Microsoft Excel 工作簿及 XLSX 文件,支持可靠的公式、日期、类型、格式、重算及模板保留功能。
★ 384 📥 146,063
office-efficiency

Gog

steipete
Google Workspace 命令行工具,支持 Gmail、日历、云端硬盘、通讯录、表格和文档。
★ 931 📥 187,131
office-efficiency

Word / DOCX

ivangdavila
创建、检查和编辑 Microsoft Word 文档及 DOCX 文件,支持样式、编号、修订记录、表格、分节符及兼容性检查等功能。
★ 461 📥 153,659