← 返回
内容创作 中文

DOCX Toolkit

Extract text, tables, and images from .docx and legacy .doc files. Handles large documents, CJK text, and complex table structures. Includes deduplication an...
从.docx及旧版.doc文件中提取文本、表格和图像。支持大文档、CJK文本及复杂表格结构,包含去重功能。
zacjiang
内容创作 clawhub v1.0.0 1 版本 99768.5 Key: 无需
★ 0
Stars
📥 1,724
下载
💾 179
安装
1
版本
#latest

概述

DOCX Toolkit

A complete toolkit for processing Microsoft Word documents (.docx and legacy .doc formats).

Capabilities

1. Text + Table Extraction (.docx)

python3 {baseDir}/scripts/extract_text.py input.docx output.txt

Extracts all paragraphs and tables with structure preserved. Tables are formatted as pipe-delimited rows for easy parsing.

2. Text Extraction (Legacy .doc)

python3 {baseDir}/scripts/extract_doc_text.py input.doc output.txt

Handles legacy OLE2 .doc format using olefile. Extracts Unicode text from the WordDocument stream.

3. Image Extraction (.docx)

python3 {baseDir}/scripts/extract_images.py input.docx output_dir/

Extracts all embedded images with:

  • Automatic deduplication (MD5 hash comparison)
  • Size filtering (skips tiny icons <5KB by default)
  • Sequential renaming (img_001.png, img_002.jpg, etc.)

4. Image Compression

python3 {baseDir}/scripts/resize_images.py input_dir/ output_dir/ [--max-width 1024]

Batch resize/compress images for API processing (saves 50-70% on vision API costs).

Dependencies

  • Python 3.6+
  • python-docx — for .docx processing
  • olefile — for legacy .doc processing
  • Pillow — for image resizing (optional, only needed for resize script)

Install:

pip3 install python-docx olefile Pillow

Use Cases

  • Document analysis: Extract text for AI review/summarization
  • Migration: Pull content from Word docs into other formats
  • Image audit: Extract and review all embedded images
  • Cost optimization: Compress images before sending to vision APIs
  • Batch processing: Process multiple documents in a pipeline

Notes

  • Large .doc files (>200MB) may require significant RAM for olefile processing
  • Image extraction preserves original format (png/jpg/gif/etc.)
  • Deduplication catches exact duplicates; near-duplicates still pass through
  • CJK (Chinese/Japanese/Korean) text is fully supported in both extractors

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-29 23:57 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

content-creation

Baidu Wenku AIPPT

ide-rea
使用百度文库 AI 智能生成 PPT,自动根据内容选择模板。
★ 66 📥 46,160
ai-intelligence

RSS Feed Digest

zacjiang
获取、过滤并汇总RSS/Atom订阅源,生成简洁的每日或每周摘要。支持多源订阅、关键词过滤、去重,并以Markdown格式输出。
★ 1 📥 2,194
content-creation

Humanizer

biostartechnology
消除AI写作痕迹,使文本更自然真实。基于维基百科"AI写作特征"指南,识别并修正夸张象征、宣传用语、肤浅-ing分析、模糊归因、破折号滥用、三项排比、AI词汇、负面平行结构及冗长连接词等模式。
★ 859 📥 199,546