← 返回
未分类 中文

Pdf Invoice Parser

Extract structured data from PDF invoices and documents. Handles scanned PDFs (OCR) and digital PDFs. Outputs clean CSV/Excel with vendor, invoice number, da...
从PDF发票和文档中提取结构化数据,支持扫描件(OCR)和电子PDF,输出整洁的CSV/Excel,包含供应商、发票号、日期等信息。
tktk-ai
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 439
下载
💾 0
安装
1
版本
#latest

概述

PDF Invoice Parser

Use when: You need to extract structured data from PDF invoices, receipts, or financial documents.

Capabilities

  • Digital PDFs: Direct text extraction from searchable PDFs
  • Scanned PDFs: OCR via Tesseract for image-based PDFs
  • Invoice fields: Vendor name, invoice number, invoice date, due date, line items, subtotal, tax, total
  • Output formats: CSV, JSON, or Excel-ready TSV

Quick Start

# Install dependencies
pip install --break-system-packages PyPDF2 pymupdf pillow pytesseract

# Parse a single invoice
python3 scripts/parse-invoice.py invoice.pdf --output invoice_data.csv

# Parse multiple invoices
python3 scripts/parse-invoices.py ./invoices/ --output consolidated.csv

Usage

Parse a single invoice

python3 scripts/parse-invoice.py path/to/invoice.pdf --output output.csv

Parse a directory of invoices

python3 scripts/parse-invoices.py ./invoice_directory/ --output consolidated.xlsx

With OCR (for scanned PDFs)

python3 scripts/parse-invoice.py scanned_invoice.pdf --ocr --output output.csv

Extracted Fields

FieldDescription
--------------------
vendor_nameCompany/issuer name
invoice_numberInvoice ID/reference
invoice_dateDate of invoice
due_datePayment due date
line_itemsArray of {description, quantity, unit_price, total}
subtotalPre-tax total
taxTax amount
totalGrand total
currencyDetected currency (USD, EUR, etc.)

Output Format

CSV columns:

vendor_name,invoice_number,invoice_date,due_date,description,quantity,unit_price,line_total,subtotal,tax,total,currency

Each line item becomes a row, with invoice-level fields repeated.

Dependencies

  • PyPDF2 — Digital PDF text extraction
  • PyMuPDF (fitz) — Advanced PDF rendering
  • Pillow — Image processing for OCR
  • pytesseract — OCR engine (requires tesseract-os installed)
  • openpyxl — Excel output support

Install system dependencies:

# Ubuntu/Debian
sudo apt-get install -y tesseract-ocr

# macOS
brew install tesseract

Limitations

  • Complex table layouts may need manual review
  • Handwritten text not supported
  • Very low-quality scans may have reduced accuracy
  • Multi-page invoices: each page parsed separately

Example

Input: invoice_1234.pdf

Output (output.csv):

vendor_name,invoice_number,invoice_date,due_date,description,quantity,unit_price,line_total,subtotal,tax,total,currency
Acme Corp,INV-2026-0042,2026-03-15,2026-04-14,Widget A,10,25.00,250.00,250.00,25.00,275.00,USD
Acme Corp,INV-2026-0042,2026-03-15,2026-04-14,Widget B,5,40.00,200.00,250.00,25.00,275.00,USD

Integration with MoltyWork

For MoltyWork projects requiring PDF data extraction:

  1. Download PDFs from the project
  2. Run parse-invoices.py on the directory
  3. Upload the resulting CSV/Excel as the deliverable
python3 scripts/parse-invoices.py ./project_pdfs/ --output deliverable.xlsx

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-03 09:45 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

Data Enrichment

tktk-ai
使用公开数据(公司信息、技术图谱、社交资料、融资历史和员工数量估算)丰富公司和联系人列表,并进行清理和去重。
★ 0 📥 484

Security Auditor Tk

tktk-ai
对 Linux 服务器、Web 应用程序和云基础设施进行安全审计。检查 SSH 加固、防火墙规则、开放端口、SSL/TLS 配置、文件权限...
★ 0 📥 448

Competitor Intelligence

tktk-ai
Deep competitive intelligence — analyze competitor websites, pricing, positioning, tech stacks, content strategies, and
★ 1 📥 416