← 返回
未分类 中文

OpenDataLoader PDF Parser (乌贼版)

PDF parsing tool for AI/RAG. Convert PDF to Markdown, JSON, HTML with layout preservation, bounding boxes, and image extraction. Use when you need to extract...
PDF解析工具,专为AI/RAG设计。支持将PDF转换为Markdown、JSON、HTML,保留布局、边界框和图片提取,适用于需要提取的场景。
wtjjacobj wtjjacobj 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 302
下载
💾 0
安装
1
版本
#latest

概述

opendataloader-pdf Skill

PDF parsing tool for AI/RAG scenarios. Converts PDF to Markdown, JSON, HTML with layout preservation.

Installation

pipx install opendataloader-pdf

Requires Java runtime (bundled JAR is included).

Quick Usage

# PDF to Markdown (most common)
opendataloader-pdf input.pdf -o output_dir -f markdown

# PDF to JSON (with bounding boxes)
opendataloader-pdf input.pdf -o output_dir -f json

# Multiple formats at once
opendataloader-pdf input.pdf -o output_dir -f json,markdown,html

# Extract specific pages
opendataloader-pdf input.pdf -o output_dir -f markdown --pages "1,3,5-10"

# Extract images
opendataloader-pdf input.pdf -o output_dir -f markdown --image-dir images/

# Use PDF structure tree (for tagged PDFs)
opendataloader-pdf input.pdf -o output_dir -f markdown --use-struct-tree

# Output to stdout
opendataloader-pdf input.pdf -f markdown --to-stdout

Output Formats

FormatDescription
---------------------
jsonStructured JSON with bounding boxes, fonts, reading order
markdownMarkdown text with images as references
htmlHTML with styling
textPlain text
pdfRebuilt PDF
markdown-with-htmlMarkdown with HTML for complex elements
markdown-with-imagesMarkdown with embedded base64 images

Key Options

OptionDescription
---------------------
--pagesPage range, e.g., "1,3,5-10"
--image-dirDirectory for extracted images
--use-struct-treeUse PDF structure tree for reading order
--table-methodTable detection: default (border-based) or cluster
--reading-orderAlgorithm: off or xycut (default)
--hybridHybrid AI mode: docling-fast for complex tables
--sanitizeRemove sensitive data (emails, phones, etc.)
--include-header-footerInclude page headers/footers

Examples

Basic Conversion

# Convert to markdown
opendataloader-pdf document.pdf -o ./output -f markdown

# Convert to JSON with structure
opendataloader-pdf document.pdf -o ./output -f json --use-struct-tree

Batch Processing

# Multiple files
opendataloader-pdf "file1.pdf" "file2.pdf" "folder/" -o output/

# All PDFs in directory
opendataloader-pdf ./pdfs/ -o ./output/ -f markdown

Advanced Options

# Use AI hybrid mode for complex tables
opendataloader-pdf input.pdf -o output/ -f markdown --hybrid docling-fast

# Extract only pages 1-5
opendataloader-pdf input.pdf -o output/ -f markdown --pages "1-5"

# Sanitize sensitive data
opendataloader-pdf input.pdf -o output/ -f json --sanitize

Performance Notes

  • Each convert() call spawns a JVM process
  • For batch processing, pass multiple files in one call
  • ~6 seconds for typical 300-page PDF
  • Images extracted to {output_name}_images/ directory

Troubleshooting

Java not found

Ensure Java runtime is installed. The tool bundles its own PDFBox JAR.

Font warnings

Warnings about missing fonts are normal and don't affect output quality.

Slow performance

Use batch mode (multiple files in one call) instead of calling repeatedly.

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-07 12:17 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

螃蟹投研-股票行业Collector

wtjjacobj
收集A股指定行业的全部上市公司信息,含代码、名称、交易所、行业、财务指标(ROE、毛利率、净利率、资产负债率、现金流、增长率)及控股股东持股比例。
★ 0 📥 427

蜡烛图分析

wtjjacobj
使用akshare获取股票数据,按Steve Nison《日本蜡烛图技术》识别形态,生成深色盯盘K线图(含信号带+成交量)
★ 0 📥 469

螃蟹投研-压力支撑位蜡烛图

wtjjacobj
基于波峰波谷分析识别A股压力位和支撑位,绘制K线图。使用Baostock获取数据(规避AKShare东财接口限制)。
★ 1 📥 471