Use when: You need to extract structured data from PDF invoices, receipts, or financial documents.
# Install dependencies
pip install --break-system-packages PyPDF2 pymupdf pillow pytesseract
# Parse a single invoice
python3 scripts/parse-invoice.py invoice.pdf --output invoice_data.csv
# Parse multiple invoices
python3 scripts/parse-invoices.py ./invoices/ --output consolidated.csv
python3 scripts/parse-invoice.py path/to/invoice.pdf --output output.csv
python3 scripts/parse-invoices.py ./invoice_directory/ --output consolidated.xlsx
python3 scripts/parse-invoice.py scanned_invoice.pdf --ocr --output output.csv
| Field | Description |
|---|---|
| ------- | ------------- |
| vendor_name | Company/issuer name |
| invoice_number | Invoice ID/reference |
| invoice_date | Date of invoice |
| due_date | Payment due date |
| line_items | Array of {description, quantity, unit_price, total} |
| subtotal | Pre-tax total |
| tax | Tax amount |
| total | Grand total |
| currency | Detected currency (USD, EUR, etc.) |
CSV columns:
vendor_name,invoice_number,invoice_date,due_date,description,quantity,unit_price,line_total,subtotal,tax,total,currency
Each line item becomes a row, with invoice-level fields repeated.
Install system dependencies:
# Ubuntu/Debian
sudo apt-get install -y tesseract-ocr
# macOS
brew install tesseract
Input: invoice_1234.pdf
Output (output.csv):
vendor_name,invoice_number,invoice_date,due_date,description,quantity,unit_price,line_total,subtotal,tax,total,currency
Acme Corp,INV-2026-0042,2026-03-15,2026-04-14,Widget A,10,25.00,250.00,250.00,25.00,275.00,USD
Acme Corp,INV-2026-0042,2026-03-15,2026-04-14,Widget B,5,40.00,200.00,250.00,25.00,275.00,USD
For MoltyWork projects requiring PDF data extraction:
parse-invoices.py on the directorypython3 scripts/parse-invoices.py ./project_pdfs/ --output deliverable.xlsx
共 1 个版本