← 返回
未分类

PDF to Markdown

Convert any PDF to Markdown, JSON, and HTML using OpenDataLoader. Supports digital PDFs, scanned PDFs with OCR, and complex layouts with table extraction and...
adelpro
未分类 clawhub v1.0.0 100000 Key: 无需
★ 0
Stars
📥 155
下载
💾 0
安装

概述

pdf-to-markdown

Convert any PDF to Markdown, JSON, or HTML using OpenDataLoader PDF — the #1 ranked open-source PDF parser.

Features

  • Markdown — clean readable text with correct reading order
  • JSON — structured data with bounding boxes, font sizes, page numbers
  • HTML — rich HTML output preserving layout
  • OCR — built-in OCR for scanned PDFs (80+ languages)
  • Tables — complex table extraction (0.93 accuracy)
  • Reading order — XY-Cut++ algorithm for multi-column layouts
  • No API key — fully self-hosted, open-source

Requirements

Java 11+ (symlink setup)

OpenDataLoader requires Java. After installing, create a symlink so Python subprocesses can find it:

# Find your Java install
ls ~/jdk-*/bin/java 2>/dev/null || ls /opt/jdk*/bin/java 2>/dev/null

# Create symlink
ln -sf /path/to/java/bin/java ~/.local/bin/java

Python 3.10+

pip install opendataloader-pdf

Or use the auto-install script (handles Java + Python automatically):

curl -fsSL https://raw.githubusercontent.com/opendataloader-project/opendataloader-pdf/main/scripts/install.sh | bash

Usage

CLI

# Basic — markdown + json output
pdf2md document.pdf ./output

# HTML + JSON output
pdf2md document.pdf ./output html,json

# Markdown only
pdf2md document.pdf ./output markdown

Python

import opendataloader_pdf

opendataloader_pdf.convert(
    input_path="document.pdf",
    output_dir="./output",
    format="markdown,json"
)

Supported Input Formats

TypeExampleOCR Needed
---------------------------
Digital PDFText-based PDFsNo
Scanned PDFImage-only scansYes (built-in)
Tagged PDFAccessibility PDFsNo
Multi-columnAcademic papersNo
TablesData reportsNo

Output Formats

Markdown

Clean text with heading hierarchy, bullet lists, and paragraph structure.

JSON

{
  "file name": "document.pdf",
  "number of pages": 5,
  "author": "Author Name",
  "kids": [
    {
      "type": "heading",
      "level": "Doctitle",
      "page number": 1,
      "bounding box": [100.0, 744.5, 404.0, 773.1],
      "font": "Helvetica-Bold",
      "font size": 24.0,
      "content": "Document Title"
    },
    {
      "type": "paragraph",
      "page number": 1,
      "bounding box": [100.0, 676.8, 316.3, 713.0],
      "font": "Helvetica",
      "font size": 14.0,
      "content": "Paragraph text..."
    }
  ]
}

Installation

OpenClaw Skill Install

clawhub install pdf-to-markdown

Manual Install

# Install dependencies
pip install opendataloader-pdf

# Make script executable
chmod +x scripts/pdf2md

Architecture

PDF Input
    │
    ▼
OpenDataLoader PDF (JVM)
    │
    ├── PDFBox    ──► Text extraction + layout analysis
    ├── veraPDF   ──► PDF validation + structure
    └── Tesseract ──► OCR (scanned PDFs)
    │
    ▼
Output: Markdown / JSON / HTML

Benchmark

MetricScore
---------------
Overall extraction accuracy0.90
Table extraction accuracy0.93
Processing speed (local)0.05s/page

Benchmarks on 200 real-world PDFs including multi-column and scientific papers.

Common Use Cases

  • RAG pipelines — convert PDFs to chunkable markdown
  • Document parsing — extract text from research papers
  • Accessibility — convert PDFs to structured data
  • Data extraction — pull tables from reports
  • Content migration — PDF to markdown for wikis/docs

See Also

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-20 06:17 安全 安全

安全检测

暂无安全检测报告