← 返回
数据分析 中文

Peer Review

Multi-model peer review layer using local LLMs via Ollama to catch errors in cloud model output. Fan-out critiques to 2-3 local models, aggregate flags, synthesize consensus. Use when: validating trade analyses, reviewing agent output quality, testing local model accuracy, checking any high-stakes Claude output before publishing or acting on it. Don't use when: simple fact-checking (just search the web), tasks that don't benefit from multi-model consensus, time-critical decisions where 60s lat
利用 Ollama 调用本地 LLM 构建多模型同行评审层,用于捕捉云端模型输出中的错误。流程包括向 2-3 个本地模型分发审查任务、汇总标记信号并合成共识。 适用场景:验证交易分析、审查智能体输出质量、测试本地模型准确性、在发布或执行高风险 Claude 输出前进行校验。 禁用场景:简单事实核查(直接网页搜索即可)、无需多模型共识的任务、对延迟敏感(无法接受 60 秒延迟)的决策、审查琐碎或低风险内容。 反面示例: - “检查日期是否正确” → 不适用,直接网页搜索即可。 - “审查我的购物清单” → 不适用,不值得消耗多模型推理资源。 - “我需要在 5 秒内得到答案” → 不适用,同行评审会增加 30-60 秒延迟。 边界情况: - 短文本(<50 词)→ 模型可能无法发现实质性问题,建议跳过。 - 高度专业领域 → 本地模型可能缺乏领域知识,应降低标记权重。 - 创意写作 → 事实审查适用性较差,仅用于检查逻辑一致性。
staybased
数据分析 clawhub v1.0.0 1 版本 99729.7 Key: 无需
★ 0
Stars
📥 1,476
下载
💾 87
安装
1
版本
#latest

概述

Peer Review — Local LLM Critique Layer

> Hypothesis: Local LLMs can catch ≥30% of real errors in cloud output with <50% false positive rate.


Architecture

Cloud Model (Claude) produces analysis
        │
        ▼
┌────────────────────────┐
│   Peer Review Fan-Out  │
├────────────────────────┤
│  Drift (Mistral 7B)   │──► Critique A
│  Pip (TinyLlama 1.1B) │──► Critique B
│  Lume (Llama 3.1 8B)  │──► Critique C
└────────────────────────┘
        │
        ▼
  Aggregator (consensus logic)
        │
        ▼
  Final: original + flagged issues

Swarm Bot Roles

BotModelRoleStrengths
-----------------------------
Drift 🌊Mistral 7BMethodical analystStructured reasoning, catches logical gaps
Pip 🐣TinyLlama 1.1BFast checkerQuick sanity checks, low latency
Lume 💡Llama 3.1 8BDeep thinkerNuanced analysis, catches subtle issues

Scripts

ScriptPurpose
-----------------
scripts/peer-review.shSend single input to all models, collect critiques
scripts/peer-review-batch.shRun peer review across a corpus of samples
scripts/seed-test-corpus.shGenerate seeded error corpus for testing

Usage

# Single file review
bash scripts/peer-review.sh <input_file> [output_dir]

# Batch review
bash scripts/peer-review-batch.sh <corpus_dir> [results_dir]

# Generate test corpus
bash scripts/seed-test-corpus.sh [count] [output_dir]

Scripts live at workspace/scripts/ — not bundled in skill to avoid duplication.


Critique Prompt Template

You are a skeptical reviewer. Analyze the following text for errors.

For each issue found, output JSON:
{"category": "factual|logical|missing|overconfidence|hallucinated_source",
 "quote": "...", "issue": "...", "confidence": 0-100}

If no issues found, output: {"issues": []}

TEXT:
---
{cloud_output}
---

Error Categories

CategoryDescriptionExample
--------------------------------
factualWrong numbers, dates, names"Bitcoin launched in 2010"
logicalNon-sequiturs, unsupported conclusions"X is rising, therefore Y will fall"
missingImportant context omittedIgnoring a major counterargument
overconfidenceCertainty without justification"This will definitely happen" on 55% event
hallucinated_sourceCiting nonexistent sources"According to a 2024 Reuters report..."

Discord Workflow

  1. Post analysis to #the-deep (or #swarm-lab)
  2. Drift, Pip, and Lume respond with independent critiques
  3. Celeste synthesizes: deduplicates flags, weights by model confidence
  4. If consensus (≥2 models agree) → flag is high-confidence
  5. Final output posted with recommendation: publish | revise | flag_for_human

Success Criteria

OutcomeTPRFPRDecision
-----------------------------
Strong pass≥50%<30%Ship as default layer
Pass≥30%<50%Ship as opt-in layer
Marginal20–30%50–70%Iterate on prompts, retest
Fail<20%>70%Abandon approach

Scoring Rules

  • Flag = true positive if it identifies a real error (even if explanation is imperfect)
  • Flag = false positive if flagged content is actually correct
  • Duplicate flags across models count once for TPR but inform consensus metrics

Dependencies

  • Ollama running locally with models pulled: mistral:7b, tinyllama:1.1b, llama3.1:8b
  • jq and curl installed
  • Results stored in experiments/peer-review-results/

Integration

When peer review passes validation:

  • Package as Reef API endpoint: POST /review
  • Agents call before publishing any analysis
  • Configurable: model selection, consensus threshold, categories
  • Log all reviews to #reef-logs with TPR tracking

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-29 04:22 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

data-analysis

Excel / XLSX

ivangdavila
创建、检查和编辑 Microsoft Excel 工作簿及 XLSX 文件,支持可靠的公式、日期、类型、格式、重算及模板保留功能。
★ 367 📥 140,104
productivity

Cold Outreach

staybased
运用经过验证的框架,创建有针对性的个性化多触点冷接触消息,联系潜在客户并提高B2B和本地服务的回复率。
★ 1 📥 4,119
data-analysis

Data Analysis

ivangdavila
{"answer":"数据分析与可视化。查询数据库、生成报告、自动化电子表格,将原始数据转化为清晰可行的见解。适用于:(1) 您……"}
★ 198 📥 64,932