The #1 ranked open-source PDF parser. Convert any PDF to Markdown, JSON, and HTML with bounding-box coordinates for precise source citations in RAG pipelines.
| Feature | Others | ODL PDF |
|---|---|---|
| --------- | -------- | --------- |
| Benchmark accuracy | 0.75 avg | 0.90 |
| Table accuracy | 0.70 avg | 0.93 |
| Bounding-box JSON | No | Yes |
| Hybrid AI mode | No | Yes |
| Built-in OCR | Extra setup | Yes |
| Multi-column layout | Basic | XY-Cut++ |
| Self-hosted | Yes | Yes |
| API key needed | Often | No |
OpenDataLoader requires Java. After installing, create a symlink so Python subprocesses can find it:
# Find your Java install
ls ~/jdk-*/bin/java 2>/dev/null || ls /opt/jdk*/bin/java 2>/dev/null
# Create symlink
ln -sf /path/to/java/bin/java ~/.local/bin/java
pip install opendataloader-pdf
Or use the auto-install script (handles Java + Python automatically):
curl -fsSL https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/main/scripts/install.sh | bash
# Basic — markdown + json output
pdf2md document.pdf ./output
# HTML + JSON output
pdf2md document.pdf ./output html,json
# Markdown only
pdf2md document.pdf ./output markdown
import opendataloader_pdf
opendataloader_pdf.convert(
input_path="document.pdf",
output_dir="./output",
format="markdown,json"
)
| Type | Example | OCR Needed |
|---|---|---|
| ------ | --------- | ------------ |
| Digital PDF | Text-based PDFs | No |
| Scanned PDF | Image-only scans | Yes (built-in) |
| Tagged PDF | Accessibility PDFs | No |
| Multi-column | Academic papers | No |
| Tables | Data reports | No |
Clean text with heading hierarchy, bullet lists, and paragraph structure.
{
"file name": "document.pdf",
"number of pages": 5,
"author": "Author Name",
"kids": [
{
"type": "heading",
"level": "Doctitle",
"page number": 1,
"bounding box": [100.0, 744.5, 404.0, 773.1],
"font": "Helvetica-Bold",
"font size": 24.0,
"content": "Document Title"
},
{
"type": "paragraph",
"page number": 1,
"bounding box": [100.0, 676.8, 316.3, 713.0],
"font": "Helvetica",
"font size": 14.0,
"content": "Paragraph text..."
}
]
}
clawhub install pdf-to-markdown
# Install dependencies
pip install opendataloader-pdf
# Make script executable
chmod +x scripts/pdf2md
PDF Input
│
▼
OpenDataLoader PDF (JVM)
│
├── PDFBox ──► Text extraction + layout analysis
├── veraPDF ──► PDF validation + structure
└── Tesseract ──► OCR (scanned PDFs)
│
▼
Output: Markdown / JSON / HTML
OpenDataLoader ranked #1 against 9 other open-source and commercial PDF parsers on a test set of 200 real-world PDFs:
| Metric | Score |
|---|---|
| -------- | ------- |
| Overall extraction accuracy | 0.90 |
| Table extraction accuracy | 0.93 |
| Processing speed (local mode) | 0.05s/page |
共 1 个版本