Convert office documents to structured JSON using MinerU as the extraction engine.
.doc / .docx — Word documents.pdf — PDF files.xlsx / .xls — Excel spreadsheetsmineru-open-api version# Full pipeline: document -> MinerU Markdown -> JSON
python3 scripts/doc_to_json.py /path/to/file.docx -o output.json
# Keep temp files for debugging
python3 scripts/doc_to_json.py /path/to/file.pdf -o out.json --keep-temp
If the full pipeline script fails, run steps manually:
export MINERU_TOKEN="your_token"
mineru-open-api extract input_file.pdf -o /tmp/mineru_out/
Output: .md file in the output directory.
python3 scripts/markdown_to_json.py /tmp/mineru_out/output.md -o output.json
The output JSON preserves:
"表格""text" field per sectionAfter JSON conversion, common next steps:
"表格" arrays to flattened rows for database importSee references/kb-prep.md for detailed KB preparation patterns.
共 1 个版本