← 返回
内容创作 Key 中文

Azure Document OCR

Extract text and structured data from documents using Azure Document Intelligence (formerly Form Recognizer). Supports OCR for PDFs, images, scanned document...
使用 Azure Document Intelligence(前身为 Form Recognizer)从文档中提取文本和结构化数据。支持 PDF、图像及扫描文档的 OCR 处理...
li-hongmin
内容创作 clawhub v1.0.0 1 版本 99907.7 Key: 需要
★ 0
Stars
📥 1,083
下载
💾 84
安装
1
版本
#latest

概述

Azure Document Intelligence OCR

Extract text and structured data from documents using Azure Document Intelligence REST API.

Quick Start

1. Environment Setup

Set your Azure Document Intelligence credentials:

export AZURE_DOC_INTEL_ENDPOINT="https://your-resource.cognitiveservices.azure.com"
export AZURE_DOC_INTEL_KEY="your-api-key"

2. Single File OCR

# Basic text extraction from PDF
python scripts/ocr_extract.py document.pdf

# Extract with layout (tables, structure)
python scripts/ocr_extract.py document.pdf --model prebuilt-layout --format markdown

# Process invoice
python scripts/ocr_extract.py invoice.pdf --model prebuilt-invoice --format json

# OCR from URL
python scripts/ocr_extract.py --url "https://example.com/document.pdf"

# Save output to file
python scripts/ocr_extract.py document.pdf --output result.txt

# Extract specific pages
python scripts/ocr_extract.py document.pdf --pages 1-3,5

3. Batch Processing

# Process all documents in a folder
python scripts/batch_ocr.py ./documents/

# Custom output directory and format
python scripts/batch_ocr.py ./documents/ --output-dir ./extracted/ --format markdown

# Use layout model with 8 workers
python scripts/batch_ocr.py ./documents/ --model prebuilt-layout --workers 8

# Filter specific extensions
python scripts/batch_ocr.py ./documents/ --ext .pdf,.png

Model Selection Guide

Document TypeRecommended ModelUse Case
--------------------------------------------
General textprebuilt-readPure text extraction, any document
Structured docsprebuilt-layoutTables, forms, paragraphs, figures
Invoicesprebuilt-invoiceVendor info, line items, totals
Receiptsprebuilt-receiptMerchant, items, totals, dates
IDs/Passportsprebuilt-idDocumentIdentity documents
Business cardsprebuilt-businessCardContact information
W-2 formsprebuilt-tax.us.w2US tax documents
Insurance cardsprebuilt-healthInsuranceCard.usHealth insurance info

See references/models.md for detailed model documentation.

Supported Input Formats

  • PDF: .pdf (including scanned PDFs)
  • Images: .png, .jpg, .jpeg, .tiff, .bmp
  • URLs: Direct links to documents

Output Formats

  • text: Plain text concatenation of all extracted content
  • markdown: Structured output with headers and tables (best with layout model)
  • json: Raw API response with full extraction details

Features

  • Handwriting Recognition: Extracts handwritten text alongside printed text
  • CJK Support: Full support for Chinese, Japanese, Korean characters
  • Table Extraction: Preserves table structure (use layout model)
  • Multi-page Processing: Handles documents with multiple pages
  • Concurrent Processing: Batch script supports parallel processing
  • URL Input: Process documents directly from URLs

Environment Variables

VariableRequiredDescription
---------------------------------
AZURE_DOC_INTEL_ENDPOINTYesAzure Document Intelligence endpoint URL
AZURE_DOC_INTEL_KEYYesAPI subscription key

Error Handling

  • Invalid credentials: Check endpoint URL and API key
  • Unsupported format: Ensure file extension matches supported types
  • Timeout: Large documents may need longer processing (max 300s)
  • Rate limiting: Reduce concurrent workers for batch processing

Examples

Extract text from scanned PDF

python scripts/ocr_extract.py scanned_contract.pdf --model prebuilt-read

Process invoices with structured output

python scripts/ocr_extract.py invoice.pdf --model prebuilt-invoice --format json --output invoice_data.json

Batch process with layout analysis

python scripts/batch_ocr.py ./reports/ --model prebuilt-layout --format markdown --workers 4

Extract specific pages from large document

python scripts/ocr_extract.py large_doc.pdf --pages 1,3-5,10 --format text

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-29 12:54 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

content-creation

AdMapix

fly0pants
广告情报与应用数据分析助手,支持搜索广告素材、分析应用排名、下载量、收入及市场洞察,用于广告素材和竞品分析。
★ 295 📥 136,553
content-creation

YouTube

byungkyu
使用托管OAuth集成YouTube Data API,支持搜索视频、管理播放列表、获取频道数据及评论互动,适用于用户需要时使用此技能。
★ 142 📥 41,115
content-creation

Baidu Wenku AIPPT

ide-rea
使用百度文库 AI 智能生成 PPT,自动根据内容选择模板。
★ 66 📥 46,247