概述

LiteParse

Parse unstructured documents (PDF, DOCX, PPTX, XLSX, images, and more) locally with LiteParse: fast, lightweight, no cloud dependencies or LLM required.

Installation

Already installed via Homebrew:

brew install llamaindex-liteparse

Verify:

lit --version

Supported Formats

Category	Formats
----------	---------
PDF	`.pdf`
Word	`.doc`, `.docx`, `.docm`, `.odt`, `.rtf`
PowerPoint	`.ppt`, `.pptx`, `.pptm`, `.odp`
Spreadsheets	`.xls`, `.xlsx`, `.xlsm`, `.ods`, `.csv`, `.tsv`
Images	`.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.tiff`, `.webp`, `.svg`

Dependencies:

Office documents → LibreOffice (brew install --cask libreoffice)
Images → ImageMagick (brew install imagemagick)

Usage

Parse a Single File

# Basic text extraction
lit parse document.pdf

# JSON output with bounding boxes
lit parse document.pdf --format json -o output.json

# Specific page range
lit parse document.pdf --target-pages "1-5,10,15-20"

# Disable OCR (faster, text-only PDFs)
lit parse document.pdf --no-ocr

# Higher DPI for better quality
lit parse document.pdf --dpi 300

Batch Parse a Directory

lit batch-parse ./input-directory ./output-directory

# Only PDFs, recursively
lit batch-parse ./input ./output --extension .pdf --recursive

Generate Page Screenshots

# All pages
lit screenshot document.pdf -o ./screenshots

# Specific pages
lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots

# High-DPI PNG
lit screenshot document.pdf --dpi 300 --format png -o ./screenshots

Key Options

Option	Description
--------	-------------
`--format json`	Structured JSON with bounding boxes
`--format text`	Plain text (default)
`--target-pages "1-5,10"`	Parse specific pages
`--dpi 300`	Higher rendering quality
`--no-ocr`	Disable OCR (faster for text PDFs)
`--ocr-language fra`	Set OCR language
`-o output.json`	Save to file

Config File

For repeated use, create liteparse.config.json:

{
  "ocrLanguage": "en",
  "ocrEnabled": true,
  "maxPages": 1000,
  "dpi": 150,
  "outputFormat": "json",
  "preciseBoundingBox": true
}

Use with:

lit parse document.pdf --config liteparse.config.json

When to Use

PDF text extraction — fast local parsing
Document conversion — Office docs to text/JSON
Screenshot generation — for LLM visual analysis
Batch processing — multiple files at once
Offline/air-gapped — no cloud required

版本历史

共 1 个版本

v1.0.0 当前

2026-05-07 07:25 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)