This tool builds a high-precision document parsing pipeline: using GLM-OCR for layout element extraction, calling GLM-4.7 for logical interpretation of table data, and calling GLM-4.6V for multimodal visual interpretation of images and charts.
This Skill consists of two core script stages, orchestrated through glm_ocr_pipeline.py:
scripts/glm_ocr_extract.py)scripts/glm_understanding.py)# Run complete pipeline: extraction -> cropping -> understanding analysis, supports input in .pdf, .jpg, .png and other formats
python scripts/glm_ocr_pipeline.py \
--file_path "/data/report_page.jpg" \
--output_dir "/data/output"
| Parameter | Type | Required | Description |
|---|---|---|---|
| --- | --- | --- | --- |
| file_path | string | ✅ | Absolute path to input file (supports .pdf, .png, .jpg) |
| output_dir | string | ✅ | Result output directory (used to save cropped images and JSON reports) |
The tool returns a list containing layout elements and their deep understanding:
[
{
"type": "table",
"bbox": [100, 200, 500, 600],
"content_info": "| Revenue | Q1 |\n|---|---|\n| 100M | ... |",
"deep_understanding": "(Generated by GLM-4.7) This table shows Q1 2024 revenue data. Combined with the 'market expansion strategy' mentioned in paragraph 3 of the body text, it can be seen that..."
},
{
"type": "image",
"bbox": [100, 700, 500, 900],
"content_info": "/data/output/images/report_page_img_2.png",
"deep_understanding": "(Generated by GLM-4.6V) This is a system architecture diagram. Visually, it shows the flow of clients connecting to servers through a Load Balancer. Combined with the title 'Fig 3' and context, this diagram is mainly used to illustrate..."
}
]
ZHIPU_API_KEY must be configuredzhipuai, pillow, beautifulsoup4All understanding is based on the complete layout logic of the document (Markdown Context), not isolated fragment analysis.
Multi-page PDFs default to processing the first page. For batch processing, please extend the loop logic at the script level.
共 1 个版本