← 返回
未分类 中文

document-parser

Parse and extract content from .docx, .pdf, and .txt documents. Extracts plain text and tables for analysis. Use when the user uploads a document file or ask...
解析并提取 .docx、.pdf 和 .txt 文件内容,获取纯文本和表格供分析。适用于用户上传文档或提问时。
mjk39966-glitch mjk39966-glitch 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 488
下载
💾 0
安装
1
版本
#document#docx#latest#parser#pdf

概述

Document Parser

Extract text and tables from documents (.docx, .pdf, .txt) for analysis and question-answering.

Quick Start

Parse a document:

python scripts/parse_document.py /path/to/document.pdf

Output is JSON with extracted text, tables, and metadata.

Installation

First use only: Install dependencies by running:

  • Linux/macOS: bash scripts/install_dependencies.sh
  • Windows: scripts\install_dependencies.bat

This installs: python-docx, PyPDF2, pdfplumber

Supported Formats

FormatTextTablesNotes
-----------------------------
.txtDirect text extraction
.docxParagraphs + structured tables
.pdfPage-by-page extraction

Workflow

  1. Parse the document using scripts/parse_document.py
  2. Analyze the output (text and tables in JSON)
  3. Answer the user's question using extracted content

Example: Answering questions about a document

User: "What's the total revenue in quarterly_report.docx?"

Steps:

  1. Run: python scripts/parse_document.py quarterly_report.docx
  2. Locate tables in output
  3. Find revenue column and calculate total
  4. Reply with answer

Output Format

Default JSON output:

{
  "text": "Full document text...",
  "tables": [
    [["Header 1", "Header 2"], ["Data 1", "Data 2"]]
  ],
  "metadata": {
    "format": "pdf",
    "pages": 3,
    "tables": 1
  }
}

Human-readable format (add --format text):

==========================================================
EXTRACTED TEXT:
==========================================================
Document content here...

==========================================================
TABLES FOUND: 2
==========================================================

Table 1:
Name | Age | City
John | 30 | NYC
Jane | 25 | LA

Advanced Usage

For detailed examples and edge cases, see references/usage_examples.md.

Error Handling

If dependencies are missing, the script returns an error with installation instructions. Run the appropriate install script to resolve.

Notes

  • Large PDFs: Processing may take time for documents >50 pages
  • Scanned PDFs: OCR not supported; text must be selectable
  • Complex tables: PDF table extraction works best with clear borders

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-03 07:22 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

office-efficiency

腾讯文档 TENCENT DOCS

u_b0de8114
腾讯文档(docs.qq.com)-在线云文档平台,是创建、编辑、管理文档的首选 skill。涉及"新建/创建/编辑/读取/查看/搜索文档"、"保存文件"、"云文档"、"腾讯文档"、"docs.qq.com"等操作,请优先使用本 skill
★ 178 📥 125,722
office-efficiency

Gog

steipete
Google Workspace 命令行工具,支持 Gmail、日历、云端硬盘、通讯录、表格和文档。
★ 937 📥 187,749
office-efficiency

Excel / XLSX

ivangdavila
创建、检查和编辑 Microsoft Excel 工作簿及 XLSX 文件,支持可靠的公式、日期、类型、格式、重算及模板保留功能。
★ 399 📥 149,877