概述

ODL PDF to Markdown

The #1 ranked open-source PDF parser. Convert any PDF to Markdown, JSON, and HTML with bounding-box coordinates for precise source citations in RAG pipelines.

Why ODL PDF?

Feature	Others	ODL PDF
---------	--------	---------
Benchmark accuracy	0.75 avg	0.90
Table accuracy	0.70 avg	0.93
Bounding-box JSON	No	Yes
Hybrid AI mode	No	Yes
Built-in OCR	Extra setup	Yes
Multi-column layout	Basic	XY-Cut++
Self-hosted	Yes	Yes
API key needed	Often	No

Strong Points

#1 in benchmarks — 0.90 overall extraction accuracy, 0.93 table accuracy across 200 real-world PDFs
Bounding-box JSON — every element (heading, paragraph, table, image) has pixel coordinates for RAG citations
Hybrid AI mode — routes complex pages to AI backend for charts, formulas, and borderless tables
Built-in OCR — 80+ languages, works with scanned PDFs at 300 DPI+
XY-Cut++ reading order — correct reading order for multi-column academic papers
No API key — fully self-hosted, open-source Apache 2.0
Multiple formats — Markdown (clean text), JSON (structured), HTML (rich layout)

Markdown — clean readable text with correct reading order
JSON — structured data with bounding boxes, font sizes, page numbers
HTML — rich HTML output preserving layout
OCR — built-in OCR for scanned PDFs (80+ languages)
Tables — complex table extraction (0.93 accuracy)
Reading order — XY-Cut++ algorithm for multi-column layouts
No API key — fully self-hosted, open-source

Requirements

Java 11+ (symlink setup)

OpenDataLoader requires Java. After installing, create a symlink so Python subprocesses can find it:

# Find your Java install
ls ~/jdk-*/bin/java 2>/dev/null || ls /opt/jdk*/bin/java 2>/dev/null

# Create symlink
ln -sf /path/to/java/bin/java ~/.local/bin/java

Python 3.10+

pip install opendataloader-pdf

Or use the auto-install script (handles Java + Python automatically):

curl -fsSL https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/main/scripts/install.sh | bash

Usage

CLI

# Basic — markdown + json output
pdf2md document.pdf ./output

# HTML + JSON output
pdf2md document.pdf ./output html,json

# Markdown only
pdf2md document.pdf ./output markdown

Python

import opendataloader_pdf

opendataloader_pdf.convert(
    input_path="document.pdf",
    output_dir="./output",
    format="markdown,json"
)

Supported Input Formats

Type	Example	OCR Needed
------	---------	------------
Digital PDF	Text-based PDFs	No
Scanned PDF	Image-only scans	Yes (built-in)
Tagged PDF	Accessibility PDFs	No
Multi-column	Academic papers	No
Tables	Data reports	No

Output Formats

Markdown

Clean text with heading hierarchy, bullet lists, and paragraph structure.

JSON

{
  "file name": "document.pdf",
  "number of pages": 5,
  "author": "Author Name",
  "kids": [
    {
      "type": "heading",
      "level": "Doctitle",
      "page number": 1,
      "bounding box": [100.0, 744.5, 404.0, 773.1],
      "font": "Helvetica-Bold",
      "font size": 24.0,
      "content": "Document Title"
    },
    {
      "type": "paragraph",
      "page number": 1,
      "bounding box": [100.0, 676.8, 316.3, 713.0],
      "font": "Helvetica",
      "font size": 14.0,
      "content": "Paragraph text..."
    }
  ]
}

Installation

OpenClaw Skill Install

clawhub install pdf-to-markdown

Manual Install

# Install dependencies
pip install opendataloader-pdf

# Make script executable
chmod +x scripts/pdf2md

Architecture

PDF Input
    │
    ▼
OpenDataLoader PDF (JVM)
    │
    ├── PDFBox    ──► Text extraction + layout analysis
    ├── veraPDF   ──► PDF validation + structure
    └── Tesseract ──► OCR (scanned PDFs)
    │
    ▼
Output: Markdown / JSON / HTML

Benchmark

OpenDataLoader ranked #1 against 9 other open-source and commercial PDF parsers on a test set of 200 real-world PDFs:

Metric	Score
--------	-------
Overall extraction accuracy	0.90
Table extraction accuracy	0.93
Processing speed (local mode)	0.05s/page

Common Use Cases

RAG pipelines — convert PDFs to chunkable markdown
Document parsing — extract text from research papers
Accessibility — convert PDFs to structured data
Data extraction — pull tables from reports
Content migration — PDF to markdown for wikis/docs

版本历史

共 1 个版本

v1.1.0 当前

2026-05-20 06:15 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)

安全，无风险

查看报告

ODL PDF to Markdown

概述

ODL PDF to Markdown

Why ODL PDF?

Strong Points

Requirements

Java 11+ (symlink setup)

Python 3.10+

Usage

CLI

Python

Supported Input Formats

Output Formats

Markdown

JSON

Installation

OpenClaw Skill Install

Manual Install

Architecture

Benchmark

Common Use Cases

See Also

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

Openclaw Continuous Learning

Private Web Search Searchxng

Self Improvement