概述

Document Parser

Extract text and tables from documents (.docx, .pdf, .txt) for analysis and question-answering.

Quick Start

Parse a document:

python scripts/parse_document.py /path/to/document.pdf

Output is JSON with extracted text, tables, and metadata.

Installation

First use only: Install dependencies by running:

Linux/macOS: bash scripts/install_dependencies.sh
Windows: scripts\install_dependencies.bat

This installs: python-docx, PyPDF2, pdfplumber

Supported Formats

Format	Text	Tables	Notes
--------	------	--------	-------
.txt	✅	❌	Direct text extraction
.docx	✅	✅	Paragraphs + structured tables
.pdf	✅	✅	Page-by-page extraction

Workflow

Parse the document using scripts/parse_document.py
Analyze the output (text and tables in JSON)
Answer the user's question using extracted content

Example: Answering questions about a document

User: "What's the total revenue in quarterly_report.docx?"

Steps:

Run: python scripts/parse_document.py quarterly_report.docx
Locate tables in output
Find revenue column and calculate total
Reply with answer

Output Format

Default JSON output:

{
  "text": "Full document text...",
  "tables": [
    [["Header 1", "Header 2"], ["Data 1", "Data 2"]]
  ],
  "metadata": {
    "format": "pdf",
    "pages": 3,
    "tables": 1
  }
}

Human-readable format (add --format text):

==========================================================
EXTRACTED TEXT:
==========================================================
Document content here...

==========================================================
TABLES FOUND: 2
==========================================================

Table 1:
Name | Age | City
John | 30 | NYC
Jane | 25 | LA

Advanced Usage

For detailed examples and edge cases, see references/usage_examples.md.

Error Handling

If dependencies are missing, the script returns an error with installation instructions. Run the appropriate install script to resolve.

Notes

Large PDFs: Processing may take time for documents >50 pages
Scanned PDFs: OCR not supported; text must be selectable
Complex tables: PDF table extraction works best with clear borders

版本历史

共 1 个版本

v1.0.0 当前

2026-05-03 07:22 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)