← 返回
未分类

document-skills

Comprehensive document processing skill for reading, extracting, converting, searching, and summarizing content across multiple formats: PDF, DOCX, XLSX, PPTX, HTML, Markdown, CSV, JSON, TXT, RTF, ODT. Use when: (1) user asks to read or extract text from documents, (2) user wants to convert between document formats, (3) user needs to search within documents, (4) user asks to summarize document content, (5) user mentions processing PDF/Word/Excel/PowerPoint files, (6) user needs to analyze or com
>您的文档助手,支持处理各种类型的文档
user_5b9d6131
未分类 community v1.0.0 1 版本 99680.5 Key: 无需
★ 1
Stars
📥 292
下载
💾 0
安装
1
版本
#latest

概述

Document Skills

Dependencies

Run scripts/install_deps.py --check to verify installed packages, or scripts/install_deps.py to install all.

Core dependencies:

pip install pymupdf python-docx openpyxl python-pptx beautifulsoup4 mammoth chardet striprtf

Core Operations

1. Extract Text

Use scripts/extract_text.py to read content from any supported format.

python scripts/extract_text.py <file> [--sheet NAME] [--pages RANGE] [-o output.txt]

Examples:

python scripts/extract_text.py report.pdf --pages 1-5
python scripts/extract_text.py data.xlsx --sheet "Q1 Sales"
python scripts/extract_text.py slides.pptx -o content.txt

For large files, extract to a temp file then read selectively to avoid context overflow.

2. Convert Formats

Use scripts/convert_format.py to convert between formats.

python scripts/convert_format.py <input> --to <format> [-o output]

Supported conversions:

SourceTargets
-----------------
PDFtxt, md
DOCXtxt, md, html
XLSXcsv, json, txt
PPTXtxt, md
HTMLtxt
CSVjson, xlsx
JSONcsv
TXTpdf, docx

3. Search Documents

Use scripts/search_doc.py to find text patterns (supports regex).

python scripts/search_doc.py <file> <pattern> [-i] [-C N]

Flags: -i case-insensitive, -C N show N lines of context.

4. Summarize Documents

For summarization, combine extraction with Claude's analysis:

  1. Extract text using scripts/extract_text.py
  2. For large documents, extract section by section (e.g., page ranges for PDF)
  3. Apply Claude's summarization capabilities to the extracted content

5. Compare Documents

For document comparison:

  1. Extract text from both documents
  2. Use diff tools or Claude to identify differences
  3. For XLSX: extract both sheets and compare cell-by-cell

Workflow Decision Tree

User request involves a document?
├── Read/extract content?  → extract_text.py
├── Convert format?        → convert_format.py
├── Search for text?       → search_doc.py
├── Summarize?             → extract_text.py + Claude analysis
├── Compare documents?     → extract both + diff/Claude analysis
└── Create new document?   → convert_format.py (txt -> target format)

Handling Large Documents

  • PDF: Use --pages to extract in chunks (e.g., 1-10, 11-20)
  • XLSX: Use --sheet to process one sheet at a time
  • General: Extract to file with -o, then read portions as needed

Error Handling

  • Missing library errors: Run scripts/install_deps.py to install
  • Encoding issues: Try --encoding gbk or --encoding gb18030 for Chinese documents
  • Corrupted files: Try alternative libraries (e.g., PyPDF2 vs pymupdf for PDF)

版本历史

共 1 个版本

  • v1.0.0 Initial release 当前
    2026-05-25 18:14 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

dev-programming

Github

steipete
使用 `gh` CLI 与 GitHub 交互,通过 `gh issue`、`gh pr`、`gh run` 和 `gh api` 管理议题、PR、CI 运行及高级查询。
★ 676 📥 325,436
ai-agent

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,379 📥 320,443
ai-agent

self-improving agent

pskoett
捕获经验教训、错误及修正内容,以实现持续改进。适用于以下场景:(1)命令或操作意外失败;(2)用户纠正Claude(如“不,那不对……”“实际上……”);(3)用户请求的功能不存在;(4)外部API或工具出现故障;(5)Claude发现自身
★ 4,082 📥 810,241