← 返回
未分类 中文

Invoice Extractor

Extract structured data from invoices and receipts (PDFs and images). Output JSON, CSV, or build a running expense ledger. Use when someone shares an invoice...
从发票和收据(PDF或图片)中提取结构化数据,输出 JSON、CSV,或生成费用流水账。适用于有人分享发票时。
99rebels
未分类 clawhub v1.2.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 382
下载
💾 0
安装
1
版本
#latest

概述

Invoice Extractor 📄

Turn invoices and receipts into structured expense data. Extract from PDFs and images, auto-categorize spending, and maintain a running CSV ledger.

Hybrid approach: A Python script handles PDF text extraction and ledger management, while you (the agent) parse the invoice content — LLMs understand varied formats far better than regex.


When to Use

  • "Extract data from this invoice"
  • "Track my expenses" / "Add to my expense ledger"
  • "Categorize this receipt"
  • "Process these invoices" / "Batch process receipts"
  • "Show me my spending summary"
  • "Prepare tax documents" / "Get my expenses for April"

Setup

pip install pdfplumber
# Fallback: PyPDF2 (auto-used if pdfplumber unavailable)

Script: scripts/extract.py (relative to this skill directory)

Config: expense-config.json (same directory)


⚡ Single Invoice Workflow

PDF Invoices

python3 scripts/extract.py pdf <file-path>

Read the output text, parse it into structured JSON (see schema below), then confirm with the user before adding to ledger.

Image Invoices (jpg, png, webp, gif)

Use the image tool with a prompt like:

"Extract all invoice/receipt data from this image. Return vendor, invoice number, date, line items, subtotal, tax, total, and currency."

Parse the result into structured JSON, then confirm with the user before adding to ledger.

🔒 Confirm Then Add

Always present extracted data for user review before writing to the ledger:

📋 Invoice Extracted
Vendor: Amazon
Date: 2026-04-01
Invoice #: INV-2026-001
Description: Office supplies — keyboard and monitor
Total: €539.96 (incl. €100.97 tax)
Category: office (auto)

Add to ledger? (yes/edit/skip)

Format output for the current channel — adapt formatting to match what the platform supports. See references/formatting.md for platform-specific examples.

On confirmation, write the JSON to a temp file and run:

python3 scripts/extract.py ledger add /tmp/invoice-entry.json

Or pipe via stdin:

echo '<json>' | python3 scripts/extract.py ledger add -

If the user says "edit", modify the requested fields and re-confirm. If "skip", discard.


📦 Batch Processing

python3 scripts/extract.py batch <folder-path>
  1. Run the batch command to get a JSON list of all PDFs and images
  2. Process each file one at a time (PDFs via pdf command, images via image tool)
  3. Collect all results — do NOT confirm each one individually
  4. Present a summary of ALL extracted data at the end
  5. Ask the user to confirm once: add all, edit specific entries, or skip

Show this summary after processing all files:

📦 Batch Results — 8 files processed

1. Amazon EU S.a.r.l.  —  €191.84  —  office
2. Tesco              —  €25.26   —  food
3. DigitalOcean LLC    —  €35.81   —  software
4. Insomnia Coffee     —  €9.84    —  food
5. ACME Solutions Ltd  —  €3,867.11 —  uncategorized ⚠️
... (errors shown separately)

Total: €4,129.86 across 5 entries (1 error)

Add all to ledger? (yes/edit/skip)

On confirmation, add all entries at once. If the user wants to edit, modify specific entries and re-confirm.


📊 Viewing Expenses & Summaries

View entries with optional filters:

python3 scripts/extract.py ledger view [filters]
--from DATE       Entries from this date (YYYY-MM-DD)
--to DATE         Entries up to this date
--category CAT    Filter by category name
--vendor VENDOR   Filter by vendor (partial match)
--format json|csv Output format (default: json)

Edit an entry:

python3 scripts/extract.py ledger edit --id N --vendor "New Name"
python3 scripts/extract.py ledger edit --id N --total 250.00 --category software
python3 scripts/extract.py ledger edit --id N --date 2026-04-02

Editable fields: --vendor, --total, --date, --description, --category, --currency, --subtotal, --tax. Multiple fields in one command. Auto-recalculates the dedup hash.

Delete an entry:

python3 scripts/extract.py ledger delete --id N

Removes the entry, renumbers remaining IDs, creates a backup.

Undo last add:

python3 scripts/extract.py ledger undo

Removes the most recently added entry (highest ID). One-level undo only.

Category summaries:

python3 scripts/extract.py ledger summary [--period week|month|year]

JSON Schema

Structure all extracted invoice data as:

{
  "vendor": "Amazon",
  "invoiceNumber": "INV-2026-001",
  "date": "2026-04-01",
  "dueDate": "2026-04-30",
  "description": "Office supplies — keyboard and monitor",
  "lineItems": [
    {"description": "Mechanical Keyboard", "quantity": 1, "unitPrice": 89.99},
    {"description": "USB-C Monitor", "quantity": 1, "unitPrice": 349.00}
  ],
  "subtotal": 438.99,
  "tax": 100.97,
  "total": 539.96,
  "currency": "EUR",
  "category": "office"
}

Required for ledger: vendor, total, date

Optional: everything else — the script handles missing fields gracefully


🏷️ Auto-Categorization

Auto-categorizes based on keyword matching in expense-config.json. Checks vendor name and description against category keywords (case-insensitive).

python3 scripts/extract.py categories

Users can customize by editing the config. Suggest adding new keywords when a vendor doesn't match.


📤 Exporting the Ledger

Export ledger entries in platform-specific CSV formats for direct import into accounting software.

python3 scripts/extract.py ledger export --platform <name> [filters] [--output FILE]

Filters: --from DATE, --to DATE, --category CAT, --vendor VENDOR

Built-in Platforms

PlatformUse CaseNotes
---------------------------
xeroBills/Expenses importDD/MM/YYYY dates, includes AccountCode & TaxRate
freeagentOut-of-pocket expensesNo header row, needs claimantName in config
waveBank transactionsNegative amounts for expenses
genericExcel/Google SheetsFull detail, clean format

Examples

# Export all entries for Xero
python3 scripts/extract.py ledger export --platform xero

# Export April expenses to a file
python3 scripts/extract.py ledger export --platform xero --from 2026-04-01 --to 2026-04-30 --output /tmp/xero-export.csv

# Filter by category for FreeAgent
python3 scripts/extract.py ledger export --platform freeagent --category travel --output /tmp/freeagent-travel.csv

Custom Presets

Define custom export formats in expense-config.json under exportPresets:

{
  "exportPresets": {
    "my-accounting": {
      "columns": ["date", "vendor", "amount", "category", "notes"],
      "headerRow": true,
      "dateFormat": "%m/%d/%Y",
      "amountHandling": "positive",
      "fieldMapping": {
        "date": "date",
        "vendor": "vendor",
        "amount": "total",
        "category": "category",
        "notes": "description"
      }
    }
  }
}

The fieldMapping maps CSV column names → ledger field names. Use: --platform my-accounting

Sending the File

If no --output is specified, CSV goes to stdout. For file attachments:

  1. Use --output /tmp/invoice-export--.csv
  2. Send via MEDIA:
Here's your Xero import file (12 entries, April 2026).
MEDIA:/tmp/invoice-export-xero-20260406.csv

🔍 Unknown Platform? (LLM Discovery Flow)

If the user names a platform that isn't built-in and isn't in their custom presets:

  1. Use web_search to find "[platform name] CSV import format expenses"
  2. Identify the required columns and their format
  3. Create a fieldMapping from our ledger fields to their columns
  4. Add the preset to the user's expense-config.json under exportPresets
  5. Tell the user the preset was created and saved
  6. Proceed with the export using the new preset

⚙️ Config

The config file (expense-config.json) lives in the skill root directory. See references/configuration.md for the full config reference.

# Use a custom config
python3 scripts/extract.py --config /path/to/config.json <command>

⚠️ Important Notes

  • Always confirm before adding to ledger — never auto-add extracted data
  • Duplicate detection — entries are auto-checked against existing ledger (vendor + date + total hash). Duplicates are skipped with a warning. Use --force to override
  • Dates must be YYYY-MM-DD — convert if the invoice uses a different format
  • Currency symbols — normalize to ISO codes (€ → EUR, £ → GBP, $ → USD)
  • Backups — the script automatically backs up the ledger before each write (keeps last 5)

For edge cases (encrypted PDFs, scanned/image-only PDFs, dependency errors), see references/notes.md.


Edge Cases

Ambiguous Dates

  • "03/04/2026" is ambiguous (March 4 US, April 3 EU)
  • If the invoice doesn't specify a format, check the config defaults.dateFormat
  • If still unclear, ask the user: "Is this March 4th or April 3rd?"
  • Common formats: DD/MM/YYYY (Ireland, UK, EU), MM/DD/YYYY (US), YYYY-MM-DD (ISO — always prefer this)

Missing Fields

  • If no invoice number: leave blank in JSON, the script handles it
  • If no line items: just use the description field
  • If no tax breakdown: set tax to 0 and note "tax not specified"
  • If no currency: use the config default (EUR)
  • If no vendor name but there's a company logo in the image: best effort from context
  • Always show the user what was extracted — even incomplete data — and let them confirm or edit

Credit Notes and Refunds

  • Negative totals indicate a credit/refund
  • Still add to ledger — negative entries are valid expenses (they reduce totals)
  • Category as normal based on vendor
  • In the confirmation prompt, note it's a credit: "⚠️ Credit note detected (negative total)"

Multi-page PDFs

  • pdfplumber extracts text from all pages into one output
  • The LLM sees all text and can find totals on any page
  • No special handling needed — it just works

Non-invoice PDFs

  • If the extracted text doesn't look like an invoice (no vendor, no amounts, no date), tell the user: "This doesn't appear to be an invoice or receipt. Want to skip it?"
  • Don't force extraction on something that clearly isn't an invoice

Very Small Receipts

  • Coffee receipts, parking tickets — often low-quality images or tiny text
  • The LLM should still attempt extraction but flag low confidence: "⚠️ Low confidence — please verify the amounts"

版本历史

共 1 个版本

  • v1.2.0 当前
    2026-05-07 06:32 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

Skill Polisher

99rebels
优化独立技能和多技能包在ClawHub上的可读性,同时不牺牲LLM效果。用于改进技能列表...
★ 0 📥 407

Github Growth Tracker

99rebels
跟踪 GitHub 仓库增长(星标、分叉、议题、提交),提供定期摘要和趋势分析,将仓库与关注列表对比,用于检查...
★ 0 📥 445

Gmail Checker

99rebels
检查 Gmail 中未读的收件箱邮件,按优先级过滤。当被要求检查邮件、检查收件箱、邮件摘要、邮件概览或“任何新邮件”时使用。
★ 0 📥 438