← 返回
未分类 Key

Doc to JSON

Convert documents (docx, doc, PDF, xlsx, xls) to structured JSON via MinerU. Full pipeline: file to mineru-open-api extract to Markdown then to JSON. Use whe...
Convert documents (docx, doc, PDF, xlsx, xls) to structured JSON via MinerU. Full pipeline: file to mineru-open-api extract to Markdown then to JSON. Use whe...
kounlong
未分类 clawhub v1.0.0 1 版本 100000 Key: 需要
★ 0
Stars
📥 297
下载
💾 0
安装
1
版本
#latest

概述

Doc to JSON

Convert office documents to structured JSON using MinerU as the extraction engine.

Supported Formats

  • .doc / .docx — Word documents
  • .pdf — PDF files
  • .xlsx / .xls — Excel spreadsheets

Prerequisites

  • mineru-open-api CLI must be installed (v0.5+)
  • MINERU_TOKEN environment variable must be set
  • Check: mineru-open-api version

Quick Usage

# Full pipeline: document -> MinerU Markdown -> JSON
python3 scripts/doc_to_json.py /path/to/file.docx -o output.json

# Keep temp files for debugging
python3 scripts/doc_to_json.py /path/to/file.pdf -o out.json --keep-temp

Manual Two-Step Pipeline

If the full pipeline script fails, run steps manually:

Step 1: MinerU Extract

export MINERU_TOKEN="your_token"
mineru-open-api extract input_file.pdf -o /tmp/mineru_out/

Output: .md file in the output directory.

Step 2: Markdown -> JSON

python3 scripts/markdown_to_json.py /tmp/mineru_out/output.md -o output.json

JSON Structure

The output JSON preserves:

  • Metadata fields — course name, code, credits, hours, etc. (extracted from plain text)
  • Heading hierarchy — 一、二、三... sections become nested keys
  • Tables — stored as array of arrays (row cells), keyed as "表格"
  • Numbered lists — stored as array of strings under section title
  • Paragraph text — merged into "text" field per section

For Knowledge Base Preparation

After JSON conversion, common next steps:

  1. Chunk by section — split the JSON into per-section documents for embedding
  2. Table extraction — convert "表格" arrays to flattened rows for database import
  3. Metadata extraction — pull course code, name, etc. as document metadata
  4. Embedding — feed cleaned text chunks into vector database

See references/kb-prep.md for detailed KB preparation patterns.

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-08 00:42 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

developer-tools

Github

steipete
使用 `gh` CLI 与 GitHub 交互,通过 `gh issue`、`gh pr`、`gh run` 和 `gh api` 管理议题、PR、CI 运行及高级查询。
★ 666 📥 323,779
ai-intelligence

self-improving agent

pskoett
捕获经验教训、错误和纠正,以实现持续改进。使用时机:(1)命令或操作意外失败;(2)用户纠正……
★ 4,055 📥 795,764
ai-intelligence

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,349 📥 317,690