← 返回
未分类 中文

LiteParse Document Parser

Use when parsing PDFs, DOCX, PPTX, XLSX, or images locally. Supports text extraction, JSON output with bounding boxes, batch processing, and page screenshots...
用于本地解析 PDF、DOCX、PPTX、XLSX 或图片。支持文本提取、含边界框的 JSON 输出、批量处理以及页面截图。
ricanwarfare ricanwarfare 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 412
下载
💾 0
安装
1
版本
#latest

概述

LiteParse

Parse unstructured documents (PDF, DOCX, PPTX, XLSX, images, and more) locally with LiteParse: fast, lightweight, no cloud dependencies or LLM required.

Installation

Already installed via Homebrew:

brew install llamaindex-liteparse

Verify:

lit --version

Supported Formats

CategoryFormats
-------------------
PDF.pdf
Word.doc, .docx, .docm, .odt, .rtf
PowerPoint.ppt, .pptx, .pptm, .odp
Spreadsheets.xls, .xlsx, .xlsm, .ods, .csv, .tsv
Images.jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .svg

Dependencies:

  • Office documents → LibreOffice (brew install --cask libreoffice)
  • Images → ImageMagick (brew install imagemagick)

Usage

Parse a Single File

# Basic text extraction
lit parse document.pdf

# JSON output with bounding boxes
lit parse document.pdf --format json -o output.json

# Specific page range
lit parse document.pdf --target-pages "1-5,10,15-20"

# Disable OCR (faster, text-only PDFs)
lit parse document.pdf --no-ocr

# Higher DPI for better quality
lit parse document.pdf --dpi 300

Batch Parse a Directory

lit batch-parse ./input-directory ./output-directory

# Only PDFs, recursively
lit batch-parse ./input ./output --extension .pdf --recursive

Generate Page Screenshots

# All pages
lit screenshot document.pdf -o ./screenshots

# Specific pages
lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots

# High-DPI PNG
lit screenshot document.pdf --dpi 300 --format png -o ./screenshots

Key Options

OptionDescription
---------------------
--format jsonStructured JSON with bounding boxes
--format textPlain text (default)
--target-pages "1-5,10"Parse specific pages
--dpi 300Higher rendering quality
--no-ocrDisable OCR (faster for text PDFs)
--ocr-language fraSet OCR language
-o output.jsonSave to file

Config File

For repeated use, create liteparse.config.json:

{
  "ocrLanguage": "en",
  "ocrEnabled": true,
  "maxPages": 1000,
  "dpi": 150,
  "outputFormat": "json",
  "preciseBoundingBox": true
}

Use with:

lit parse document.pdf --config liteparse.config.json

When to Use

  • PDF text extraction — fast local parsing
  • Document conversion — Office docs to text/JSON
  • Screenshot generation — for LLM visual analysis
  • Batch processing — multiple files at once
  • Offline/air-gapped — no cloud required

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-07 07:25 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

office-efficiency

Word / DOCX

ivangdavila
创建、检查和编辑 Microsoft Word 文档及 DOCX 文件,支持样式、编号、修订记录、表格、分节符及兼容性检查等功能。
★ 468 📥 156,097
it-ops-security

Proxmox Complete

ricanwarfare
通过 REST API 管理 Proxmox VE 集群,列出节点、虚拟机和容器;控制电源状态;管理快照、备份、存储和任务。在用户请求时使用。
★ 0 📥 498
office-efficiency

Excel / XLSX

ivangdavila
创建、检查和编辑 Microsoft Excel 工作簿及 XLSX 文件,支持可靠的公式、日期、类型、格式、重算及模板保留功能。
★ 392 📥 148,361