Complete pipeline for converting bank statement PDFs into structured Excel files with per-bank sheets.
Install dependencies in managed Python venv:
pip install PyMuPDF openpyxl Pillow requests
Set environment variables for OCR API:
OCR_API_URL or OPENAI_BASE_URL: Vision API endpoint
OCR_API_KEY or OPENAI_API_KEY: API key
OCR_MODEL: Model name (default: gpt-4o)
Before starting any work, check if the user has configured the OCR API credentials.
Check environment variables:
OCR_API_KEY or OPENAI_API_KEY
OCR_API_URL or OPENAI_BASE_URL
OCR_MODEL (optional, defaults to gpt-4o)
If any required variable is missing (especially API_KEY), ask the user to provide the following information using ask_followup_question:
请提供OCR识别所需的API配置信息:
1. API Key(必填):用于调用vision模型的密钥
2. API URL(可选):API端点地址,默认为 https://api.openai.com/v1/chat/completions
3. Model(可选):模型名称,默认为 gpt-4o
After receiving the user's input, set the environment variables for the current session (do NOT write them to any file or persist them):
# Set for current session only
$env:OCR_API_KEY = "user_provided_key"
$env:OCR_API_URL = "user_provided_url" # if provided
$env:OCR_MODEL = "user_provided_model" # if provided
Do NOT proceed to Step 1 until API_KEY is confirmed.
Create directory structure in the workspace:
batch_v2/ # Will hold PNG images per PDF
batch_v2_results/ # Will hold OCR JSON results
Use PyMuPDF (fitz) to convert each PDF to PNG pages at 200 DPI.
For each PDF file, create a subdirectory under batch_v2/ (e.g., pdf1/, pdf2/) and save pages as page_001.png, page_002.png, etc.
import fitz, os
doc = fitz.open("input.pdf")
out_dir = "batch_v2/pdf1"
os.makedirs(out_dir, exist_ok=True)
for i in range(len(doc)):
pix = doc[i].get_pixmap(dpi=200)
pix.save(os.path.join(out_dir, f"page_{i+1:03d}.png"))
doc.close()
Important: If replacing an existing PDF's images, clear old PNGs first. Always verify actual page count matches PDF.
Run scripts/batch_worker.py for each PDF label. The script supports auto-parallel mode — pages are distributed across multiple workers automatically.
Auto-parallel mode (recommended):
# Process all pages of a PDF (auto 3 workers)
python scripts/batch_worker.py <label> <start_page> <end_page>
# Example: process pages 1-59 with 3 workers (default)
python scripts/batch_worker.py pdf1 1 59
# Custom worker count
python scripts/batch_worker.py pdf1 1 59 --workers 5
Arguments: label start_page end_page [--workers N]
OCR_WORKERS env var)
Legacy single-worker mode (still supported):
python scripts/batch_worker.py W1 pdf1 1 2 3 4
Key behaviors:
batch_v2_results/{label}_ocr.json immediately after processing
type=error pages. Re-process any failed pages after completion
After all pages are processed, verify by checking each page's bank name, record count, and error status:
import json
with open("batch_v2_results/pdf1_ocr.json", "r", encoding="utf-8") as f:
data = json.load(f)
for p in data:
print(f"page {p['pdf_page']}: bank={p.get('bank','')} type={p.get('type','')} records={len(p.get('records',[]))}")
Critical verification steps:
type=error pages → re-process
Common OCR misidentification: CMB (招商银行) pages often get misidentified as CCB (建设银行) or ICBC (工商银行). See references/troubleshooting.md for details.
Create overrides.json in workspace root to handle:
{
"pdf_config": [
{"key": "pdf1", "name": "文件名", "owner": "默认户名"}
],
"page_bank_override": {
"pdf3_10": "招商银行"
},
"pdf_max_page": {
"pdf2": 12
}
}
owner field explanation: The owner in pdf_config is the default account holder name for the PDF. During Excel generation, the actual page-level account holder (name field from OCR) is used with forward-inheritance:
name value, it becomes the current page's owner
name (empty), it inherits the owner from the previous page
name, the owner from pdf_config is used as fallback
Key format for page_bank_override: "{pdf_label}_{page_number}": "correct_bank_name".
Important: Always ask the user to confirm the correct bank-per-page mapping and identify the account holder(s) for each PDF before generating Excel.
Run scripts/generate_excel.py:
python scripts/generate_excel.py --workspace /path/to/workspace
Output: One Excel file per PDF (named {原PDF名}_AI整理.xlsx), each containing:
Each Excel file contains sheets named by bank. Each sheet has:
scripts/batch_worker.py: OCR worker for processing PNG pages via vision API
scripts/generate_excel.py: Excel generator from OCR JSON results
references/troubleshooting.md: OCR misidentification patterns and solutions
共 1 个版本