← 返回
数据分析 Key

PaddleOCR Document Parsing

Use this skill to extract structured Markdown/JSON from PDFs and document images—tables with cell-level precision, formulas as LaTeX, figures, seals, charts,...
使用此技能从 PDF 和文档图像中提取结构化的 Markdown/JSON,表格单元格级精确,公式转为 LaTeX,提取图像、印章、图表等。
bobholamovic
数据分析 clawhub v3.0.0 9 版本 97252.5 Key: 需要
★ 48
Stars
📥 14,190
下载
💾 1,731
安装
9
版本
#latest

概述

PaddleOCR Document Parsing

When to Use This Skill

Use this skill for:

  • Documents with tables (invoices, financial reports, spreadsheets)
  • Documents with mathematical formulas (academic papers, scientific documents)
  • Documents with charts and diagrams
  • Multi-column layouts (newspapers, magazines, brochures)
  • Complex document structures requiring layout analysis

Usage

Basic Document Parsing

From URL:

paddleocr api \
  --model_type doc_parsing \
  --file_url "https://example.com/report.pdf"

From local file:

paddleocr api \
  --model_type doc_parsing \
  --file_path "./document.pdf"

Common Options

# With specific model
paddleocr api \
  --model_type doc_parsing \
  --model PP-StructureV3 \
  --file_path "./report.pdf"

# Disable preprocessing (faster, for flat/well-oriented images)
paddleocr api \
  --model_type doc_parsing \
  --file_path "./document.pdf" \
  --use_doc_unwarping False \
  --use_doc_orientation_classify False

# With page ranges
paddleocr api \
  --model_type doc_parsing \
  --file_path "./large.pdf" \
  --page_ranges "1-5,10,15-20"

# Save result and resources
paddleocr api \
  --model_type doc_parsing \
  --file_url "https://..." \
  --output result.json \
  --save_resources ./resources

# Prettify markdown output
paddleocr api \
  --model_type doc_parsing \
  --file_path "./document.pdf" \
  --prettify_markdown True

Output Format

{
  "jobId": "job-xxx",
  "pages": [
    {
      "markdownText": "# Title\n\nContent...",
      "markdownImages": {
        "img1": "https://...",
        "img2": "https://..."
      },
      "outputImages": {
        "layout1": "https://..."
      }
    }
  ]
}

Important Notes

Preprocessing options: For flat, well-oriented images (screenshots, properly scanned documents), you can disable preprocessing for faster results:

paddleocr api --model_type doc_parsing --file_path "./document.pdf" --use_doc_unwarping False --use_doc_orientation_classify False

Keep preprocessing enabled when:

  • The input is a photo of a curved or folded document
  • The document has significant perspective distortion
  • Orientation is uncertain (rotated 90/180/270 degrees)

Display complete results: Always show the full extracted content to users. Do not truncate with "..." unless content exceeds 10,000 characters. When multiple pages are processed, summarize if needed but provide complete results when explicitly requested.

Handle errors gracefully: When the CLI returns an error, inform the user of the specific issue rather than silently failing. Common errors:

  • Authentication: PADDLEOCR_ACCESS_TOKEN invalid or missing
  • Quota: API rate limit exceeded
  • No content detected: Document may be blank or contain no extractable text

CLI Reference

Run paddleocr api --help for all options.

For full documentation, see: PaddleOCR Official Documentation

版本历史

共 9 个版本

  • v3.0.0 当前
    2026-06-07 05:15
  • v2.0.16
    2026-04-30 06:05 安全 安全
  • v2.0.10
    2026-03-27 23:49 安全 安全
  • v2.0.4
    2026-03-26 21:15
  • v2.0.7
    2026-03-17 23:05
  • v2.0.5
    2026-03-14 00:37
  • v2.0.1
    2026-03-11 10:57
  • v1.0.3
    2026-03-11 09:33
  • v1.0.2
    2026-03-07 11:39

安全检测

腾讯云安全 (Keen)

队列中

腾讯云安全 (Sanbu)

队列中

🔗 相关推荐

content-creation

PaddleOCR Text Recognition

bobholamovic
当用户想要从图像、照片、扫描件、截图或已扫描的 PDF 中提取文本时使用此技能。返回精确的机器可读字符串,...
★ 13 📥 6,127
data-analysis

Excel / XLSX

ivangdavila
创建、检查和编辑 Microsoft Excel 工作簿及 XLSX 文件,支持可靠的公式、日期、类型、格式、重算及模板保留功能。
★ 366 📥 139,941
data-analysis

Data Analysis

ivangdavila
{"answer":"数据分析与可视化。查询数据库、生成报告、自动化电子表格,将原始数据转化为清晰可行的见解。适用于:(1) 您……"}
★ 198 📥 64,842