← 返回
未分类

ODL PDF to Markdown

Convert any PDF to Markdown, JSON, and HTML using OpenDataLoader — the #1 ranked open-source PDF parser. Extract text from digital PDFs, scanned PDFs with bu...
使用 OpenDataLoader 将任意 PDF 转换为 Markdown、JSON 和 HTML(排名第一的开源 PDF 解析器),可提取数字 PDF、扫描PDF 等的文字。
adelpro
未分类 clawhub v1.1.0 1 版本 99397.6 Key: 无需
★ 0
Stars
📥 165
下载
💾 0
安装
1
版本
#latest

概述

ODL PDF to Markdown

The #1 ranked open-source PDF parser. Convert any PDF to Markdown, JSON, and HTML with bounding-box coordinates for precise source citations in RAG pipelines.

Why ODL PDF?

FeatureOthersODL PDF
--------------------------
Benchmark accuracy0.75 avg0.90
Table accuracy0.70 avg0.93
Bounding-box JSONNoYes
Hybrid AI modeNoYes
Built-in OCRExtra setupYes
Multi-column layoutBasicXY-Cut++
Self-hostedYesYes
API key neededOftenNo

Strong Points

  • #1 in benchmarks — 0.90 overall extraction accuracy, 0.93 table accuracy across 200 real-world PDFs
  • Bounding-box JSON — every element (heading, paragraph, table, image) has pixel coordinates for RAG citations
  • Hybrid AI mode — routes complex pages to AI backend for charts, formulas, and borderless tables
  • Built-in OCR — 80+ languages, works with scanned PDFs at 300 DPI+
  • XY-Cut++ reading order — correct reading order for multi-column academic papers
  • No API key — fully self-hosted, open-source Apache 2.0
  • Multiple formats — Markdown (clean text), JSON (structured), HTML (rich layout)
  • Markdown — clean readable text with correct reading order
  • JSON — structured data with bounding boxes, font sizes, page numbers
  • HTML — rich HTML output preserving layout
  • OCR — built-in OCR for scanned PDFs (80+ languages)
  • Tables — complex table extraction (0.93 accuracy)
  • Reading order — XY-Cut++ algorithm for multi-column layouts
  • No API key — fully self-hosted, open-source

Requirements

Java 11+ (symlink setup)

OpenDataLoader requires Java. After installing, create a symlink so Python subprocesses can find it:

# Find your Java install
ls ~/jdk-*/bin/java 2>/dev/null || ls /opt/jdk*/bin/java 2>/dev/null

# Create symlink
ln -sf /path/to/java/bin/java ~/.local/bin/java

Python 3.10+

pip install opendataloader-pdf

Or use the auto-install script (handles Java + Python automatically):

curl -fsSL https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/main/scripts/install.sh | bash

Usage

CLI

# Basic — markdown + json output
pdf2md document.pdf ./output

# HTML + JSON output
pdf2md document.pdf ./output html,json

# Markdown only
pdf2md document.pdf ./output markdown

Python

import opendataloader_pdf

opendataloader_pdf.convert(
    input_path="document.pdf",
    output_dir="./output",
    format="markdown,json"
)

Supported Input Formats

TypeExampleOCR Needed
---------------------------
Digital PDFText-based PDFsNo
Scanned PDFImage-only scansYes (built-in)
Tagged PDFAccessibility PDFsNo
Multi-columnAcademic papersNo
TablesData reportsNo

Output Formats

Markdown

Clean text with heading hierarchy, bullet lists, and paragraph structure.

JSON

{
  "file name": "document.pdf",
  "number of pages": 5,
  "author": "Author Name",
  "kids": [
    {
      "type": "heading",
      "level": "Doctitle",
      "page number": 1,
      "bounding box": [100.0, 744.5, 404.0, 773.1],
      "font": "Helvetica-Bold",
      "font size": 24.0,
      "content": "Document Title"
    },
    {
      "type": "paragraph",
      "page number": 1,
      "bounding box": [100.0, 676.8, 316.3, 713.0],
      "font": "Helvetica",
      "font size": 14.0,
      "content": "Paragraph text..."
    }
  ]
}

Installation

OpenClaw Skill Install

clawhub install pdf-to-markdown

Manual Install

# Install dependencies
pip install opendataloader-pdf

# Make script executable
chmod +x scripts/pdf2md

Architecture

PDF Input
    │
    ▼
OpenDataLoader PDF (JVM)
    │
    ├── PDFBox    ──► Text extraction + layout analysis
    ├── veraPDF   ──► PDF validation + structure
    └── Tesseract ──► OCR (scanned PDFs)
    │
    ▼
Output: Markdown / JSON / HTML

Benchmark

OpenDataLoader ranked #1 against 9 other open-source and commercial PDF parsers on a test set of 200 real-world PDFs:

MetricScore
---------------
Overall extraction accuracy0.90
Table extraction accuracy0.93
Processing speed (local mode)0.05s/page

Common Use Cases

  • RAG pipelines — convert PDFs to chunkable markdown
  • Document parsing — extract text from research papers
  • Accessibility — convert PDFs to structured data
  • Data extraction — pull tables from reports
  • Content migration — PDF to markdown for wikis/docs

See Also

版本历史

共 1 个版本

  • v1.1.0 当前
    2026-05-20 06:15 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

data-analysis

Openclaw Continuous Learning

adelpro
OpenClaw的本能学习系统。分析会话、检测模式、生成带置信度评分的原子学习并提供优化建议。
★ 0 📥 859
developer-tools

Private Web Search Searchxng

adelpro
使用 SearXNG 的自托管私有网页搜索。适用于注重隐私、外部 API 受阻、需要无追踪搜索或希望避免付费限制的场景。
★ 0 📥 801
ai-intelligence

Self Improvement

adelpro
Generic agent self-improvement skill built on OpenClaw-RL research (arxiv.org/abs/2603.10165). Captures evaluative signa
★ 1 📥 3,452