Run scripts/install_deps.py --check to verify installed packages, or scripts/install_deps.py to install all.
Core dependencies:
pip install pymupdf python-docx openpyxl python-pptx beautifulsoup4 mammoth chardet striprtf
Use scripts/extract_text.py to read content from any supported format.
python scripts/extract_text.py <file> [--sheet NAME] [--pages RANGE] [-o output.txt]
Examples:
python scripts/extract_text.py report.pdf --pages 1-5
python scripts/extract_text.py data.xlsx --sheet "Q1 Sales"
python scripts/extract_text.py slides.pptx -o content.txt
For large files, extract to a temp file then read selectively to avoid context overflow.
Use scripts/convert_format.py to convert between formats.
python scripts/convert_format.py <input> --to <format> [-o output]
Supported conversions:
| Source | Targets |
|---|---|
| -------- | --------- |
| txt, md | |
| DOCX | txt, md, html |
| XLSX | csv, json, txt |
| PPTX | txt, md |
| HTML | txt |
| CSV | json, xlsx |
| JSON | csv |
| TXT | pdf, docx |
Use scripts/search_doc.py to find text patterns (supports regex).
python scripts/search_doc.py <file> <pattern> [-i] [-C N]
Flags: -i case-insensitive, -C N show N lines of context.
For summarization, combine extraction with Claude's analysis:
scripts/extract_text.pyFor document comparison:
User request involves a document?
├── Read/extract content? → extract_text.py
├── Convert format? → convert_format.py
├── Search for text? → search_doc.py
├── Summarize? → extract_text.py + Claude analysis
├── Compare documents? → extract both + diff/Claude analysis
└── Create new document? → convert_format.py (txt -> target format)
--pages to extract in chunks (e.g., 1-10, 11-20)--sheet to process one sheet at a time-o, then read portions as neededscripts/install_deps.py to install--encoding gbk or --encoding gb18030 for Chinese documents共 1 个版本