Extract text and tables from documents (.docx, .pdf, .txt) for analysis and question-answering.
Parse a document:
python scripts/parse_document.py /path/to/document.pdf
Output is JSON with extracted text, tables, and metadata.
First use only: Install dependencies by running:
bash scripts/install_dependencies.shscripts\install_dependencies.batThis installs: python-docx, PyPDF2, pdfplumber
| Format | Text | Tables | Notes |
|---|---|---|---|
| -------- | ------ | -------- | ------- |
| .txt | ✅ | ❌ | Direct text extraction |
| .docx | ✅ | ✅ | Paragraphs + structured tables |
| ✅ | ✅ | Page-by-page extraction |
scripts/parse_document.pyUser: "What's the total revenue in quarterly_report.docx?"
Steps:
python scripts/parse_document.py quarterly_report.docxDefault JSON output:
{
"text": "Full document text...",
"tables": [
[["Header 1", "Header 2"], ["Data 1", "Data 2"]]
],
"metadata": {
"format": "pdf",
"pages": 3,
"tables": 1
}
}
Human-readable format (add --format text):
==========================================================
EXTRACTED TEXT:
==========================================================
Document content here...
==========================================================
TABLES FOUND: 2
==========================================================
Table 1:
Name | Age | City
John | 30 | NYC
Jane | 25 | LA
For detailed examples and edge cases, see references/usage_examples.md.
If dependencies are missing, the script returns an error with installation instructions. Run the appropriate install script to resolve.
共 1 个版本