Skill工具集

全部技能分类浏览

← 返回

内容创作中文

DOCX Toolkit

Extract text, tables, and images from .docx and legacy .doc files. Handles large documents, CJK text, and complex table structures. Includes deduplication an...

从.docx及旧版.doc文件中提取文本、表格和图像。支持大文档、CJK文本及复杂表格结构，包含去重功能。

zacjiang

内容创作 clawhub v1.0.0 1 版本 99768.5 Key: 无需

★ 0

Stars

📥 1,724

下载

💾 179

安装

1

版本

#latest

概述

DOCX Toolkit

A complete toolkit for processing Microsoft Word documents (.docx and legacy .doc formats).

Capabilities

1. Text + Table Extraction (.docx)

python3 {baseDir}/scripts/extract_text.py input.docx output.txt

Extracts all paragraphs and tables with structure preserved. Tables are formatted as pipe-delimited rows for easy parsing.

2. Text Extraction (Legacy .doc)

python3 {baseDir}/scripts/extract_doc_text.py input.doc output.txt

Handles legacy OLE2 .doc format using olefile. Extracts Unicode text from the WordDocument stream.

3. Image Extraction (.docx)

python3 {baseDir}/scripts/extract_images.py input.docx output_dir/

Extracts all embedded images with:

Automatic deduplication (MD5 hash comparison)
Size filtering (skips tiny icons <5KB by default)
Sequential renaming (img_001.png, img_002.jpg, etc.)

4. Image Compression

python3 {baseDir}/scripts/resize_images.py input_dir/ output_dir/ [--max-width 1024]

Batch resize/compress images for API processing (saves 50-70% on vision API costs).

Dependencies

Python 3.6+
python-docx — for .docx processing
olefile — for legacy .doc processing
Pillow — for image resizing (optional, only needed for resize script)

Install:

pip3 install python-docx olefile Pillow

Use Cases

Document analysis: Extract text for AI review/summarization
Migration: Pull content from Word docs into other formats
Image audit: Extract and review all embedded images
Cost optimization: Compress images before sending to vision APIs
Batch processing: Process multiple documents in a pipeline

Notes

Large .doc files (>200MB) may require significant RAM for olefile processing
Image extraction preserves original format (png/jpg/gif/etc.)
Deduplication catches exact duplicates; near-duplicates still pass through
CJK (Chinese/Japanese/Korean) text is fully supported in both extractors

版本历史

共 1 个版本

v1.0.0 当前

2026-03-29 23:57 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)

安全，无风险

查看报告

🔗 相关推荐

content-creation

Baidu Wenku AIPPT

ide-rea

使用百度文库 AI 智能生成 PPT，自动根据内容选择模板。

★ 66 📥 46,160

ai-intelligence

RSS Feed Digest

zacjiang

获取、过滤并汇总RSS/Atom订阅源，生成简洁的每日或每周摘要。支持多源订阅、关键词过滤、去重，并以Markdown格式输出。

★ 1 📥 2,194

content-creation

Humanizer

biostartechnology

消除AI写作痕迹，使文本更自然真实。基于维基百科"AI写作特征"指南，识别并修正夸张象征、宣传用语、肤浅-ing分析、模糊归因、破折号滥用、三项排比、AI词汇、负面平行结构及冗长连接词等模式。

★ 859 📥 199,546