← 返回
未分类 中文

Local RAG

Semantic search over local files using all-MiniLM-L6-v2 embeddings and ms-marco-MiniLM-L-6-v2 cross-encoder reranking with ChromaDB and parent-child chunking...
使用 all‑MiniLM‑L6‑v2 嵌入和 ms‑marco‑MiniLM‑L‑6‑v2 跨编码器重排序,结合 ChromaDB 与父子分块实现本地文件的语义搜索。
lookupmark
未分类 clawhub v1.9.1 1 版本 99793.8 Key: 无需
★ 0
Stars
📥 484
下载
💾 1
安装
1
版本
#latest

概述

Local RAG

Semantic search over indexed local files with parent-child chunking for precise retrieval with full context.

Architecture

ComponentModelSize
------------------------
Embeddingssentence-transformers/all-MiniLM-L6-v2~80MB
Rerankercross-encoder/ms-marco-MiniLM-L-6-v2~80MB
Vector DBChromaDB (persistent, cosine similarity, HNSW)varies
ChunkingParent-child

Memory strategy: Embedding model loaded first → freed with gc.collect() → reranker loaded → freed after scoring. This keeps peak RAM ~400MB on ARM.

Chunking Strategy

  • Child chunks: 128 words, 24 overlap → embedded for semantic search
  • Parent chunks: 768 words → stored as full context, returned to user
  • When a child matches → its parent is returned, giving surrounding context

Running

All scripts must use the venv Python:

VENV=~/.local/share/local-rag/venv/bin/python

Indexing

# Incremental index (default — skips unchanged files via SHA-256 hash)
$VENV ~/.openclaw/workspace/skills/lookupmark-local-rag/scripts/index.py

# Re-index from scratch
$VENV ~/.openclaw/workspace/skills/lookupmark-local-rag/scripts/index.py --reindex

# Custom paths
$VENV ~/.openclaw/workspace/skills/lookupmark-local-rag/scripts/index.py --paths ~/Documenti ~/Progetti

# Batch indexing (per-subfolder with git checkpoints, for low-RAM systems)
bash ~/.openclaw/workspace/skills/lookupmark-local-rag/scripts/index-batch.sh

Querying

# Basic query
$VENV ~/.openclaw/workspace/skills/lookupmark-local-rag/scripts/query.py "what are the termination clauses?"

# More results
$VENV ~/.openclaw/workspace/skills/lookupmark-local-rag/scripts/query.py "Falcon LLM" --top-k 30 --top-n 5

# JSON output for programmatic use
$VENV ~/.openclaw/workspace/skills/lookupmark-local-rag/scripts/query.py "transformer architecture" --json

# With timeout
$VENV ~/.openclaw/workspace/skills/lookupmark-local-rag/scripts/query.py "deep learning" --timeout 60

Options:

  • --top-k N — Child candidates from vector search (default: 20)
  • --top-n N — Final parent results after reranking (default: 3)
  • --json — JSON output
  • --timeout N — Max seconds per query (default: 120)

Monitoring

$VENV ~/.openclaw/workspace/skills/lookupmark-local-rag/scripts/monitor.py              # Status
$VENV ~/.openclaw/workspace/skills/lookupmark-local-rag/scripts/monitor.py --watch      # Auto-refresh
$VENV ~/.openclaw/workspace/skills/lookupmark-local-rag/scripts/monitor.py --log        # Logs
$VENV ~/.openclaw/workspace/skills/lookupmark-local-rag/scripts/monitor.py --errors     # Errors only
$VENV ~/.openclaw/workspace/skills/lookupmark-local-rag/scripts/monitor.py --git        # Git checkpoints

Supported Formats

Documents only (no code files):

  • Text: .txt, .md, .csv, .json, .yaml, .yml, .toml, .tex, .bib
  • Documents: .pdf (pdfminer.six), .docx (python-docx), .pptx

Excluded: .py, .js, .sh, .ipynb, .html, .css and all code files.

Limits (for 4GB ARM)

  • PDF max size: 5MB (larger PDFs cause OOM with pdfminer)
  • Max file size: 30MB
  • Embedding batch size: 1 (conservative)
  • Excluded dirs: .git, .venv, node_modules, __pycache__, labs, exercises, src, scripts, ablation, test*, fixtures

Storage

PathPurpose
---------------
~/.local/share/local-rag/chromadb/ChromaDB data (git repo for rollback)
~/.local/share/local-rag/venv/Python venv with dependencies
~/.local/share/local-rag/index.lockPrevents concurrent indexing
~/.local/share/local-rag/index-batch.logBatch indexing log
~/.local/share/local-rag/queries.logQuery history log

Security

  • ALLOWED_ROOTS: Only ~/Documenti/github/thesis, ~/Documenti/github/polito, ~/Documenti, ~/Scaricati
  • BLOCKED_PATTERNS: .ssh, .gnupg, .env, credentials, tokens, .config/openclaw
  • Credentials directory is blacklisted — never indexed

Workflow

  1. Run index.py — builds/rebuilds the index (incremental via SHA-256 hash check)
  2. Run periodically to pick up new/changed files (daily cron recommended)
  3. Use query.py to search with natural language
  4. Results include: file path, relevance score, matched snippet, full parent context
  5. Check monitor.py for stats and queries.log for query history

版本历史

共 1 个版本

  • v1.9.1 当前
    2026-05-03 05:42 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

Tailscale Manager

lookupmark
在聊天中管理 Tailscale tailnet,检查状态、列出设备、ping 主机、运行网络诊断、检查 serve/funnel 配置,所有公网 IP 自动...
★ 0 📥 412

File Sender

lookupmark
Find and send local files to a chat channel (Telegram, Discord, WhatsApp, Signal, Slack). Also manages encrypted credent
★ 0 📥 458

File Indexer

lookupmark
快速文件系统目录,用于按名称、日期、类型或大小查找文件。仅索引元数据(不含内容),使用SQLite实现即时查询。触发于 "fin...
★ 0 📥 415