Extract structured data from accounting documents (invoices, POs, bank statements) into Excel tracking sheets with JSON backups. Handles digital PDFs, scanned PDFs, and images via automatic OCR.
Install system OCR dependencies before first use. See {baseDir}/references/ocr-setup.md for full guide.
# Ubuntu / Debian
sudo apt install tesseract-ocr tesseract-ocr-vie poppler-utils
# Verify
uv run {baseDir}/scripts/ocr_utils.py check
uv run {baseDir}/scripts/classify_document.py /path/to/document.pdf
Returns JSON with type (invoice / po / statement / other), confidence, and a ready-to-run extraction command.
uv run {baseDir}/scripts/extract_invoice.py /path/to/invoice.pdf -o invoice_tracking.xlsx
Appends to the Excel tracking sheet. Use --dry-run to preview parsed data without writing.
uv run {baseDir}/scripts/extract_statement.py /path/to/statement.pdf
Creates statement_{bank}_{date}.xlsx with transactions. Use -o to specify output path.
uv run {baseDir}/scripts/extract_po.py /path/to/po.pdf -o po_tracking.xlsx
Tracks delivery dates and flags overdue/urgent POs.
uv run {baseDir}/scripts/generate_templates.py all -o ~/accounting/
Creates blank tracking sheets: invoice_tracking.xlsx, po_tracking.xlsx, statement_template.xlsx.
| Flag | Description | ||
|---|---|---|---|
| ------ | ------------- | ||
| `--format excel\ | json\ | both` | Output format (default: both) |
--dry-run | Parse and validate only, print JSON to stdout | ||
--json-dir DIR | Directory for JSON backup files | ||
-o FILE | Output Excel file path |
File → classify_document.py → route → extract_*.py → Excel + JSON
For a folder of mixed documents, classify first, then route:
for f in /path/to/docs/*; do
uv run {baseDir}/scripts/classify_document.py "$f" --output-dir ~/accounting/
done
Then run the suggested extraction commands from each classification result.
All scripts share {baseDir}/scripts/ocr_utils.py which auto-selects the best extraction method:
Each result includes ocr_confidence and extraction_confidence percentages. Documents below 85% are automatically flagged needs_review.
Read these for field schemas, Vietnamese format details, and validation logic:
{baseDir}/references/invoice-fields.md — Vietnamese VAT invoice fields, tax rates, patterns{baseDir}/references/bank-formats.md — Vietnamese bank names, transaction formats, amount patterns{baseDir}/references/po-fields.md — PO fields, delivery status logic, payment terms{baseDir}/references/ocr-setup.md — OCR installation, troubleshooting, confidence scoring共 1 个版本