← 返回
未分类 中文

LLM as Judge

Cross-model verification for complex tasks. Spawn a judge subagent with a different model to review plans, code, architecture, or decisions before execution....
跨模型验证复杂任务。启动使用不同模型的裁判子智能体,在执行前对计划、代码、架构或决策进行审查。
ngmeyer ngmeyer 来源
未分类 clawhub v1.2.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 463
下载
💾 1
安装
1
版本
#latest

概述

LLM-as-Judge

Core principle: Same model = same blind spots. Different model = fresh perspective. Cross-model review catches ~85% of issues vs ~60% for self-reflection.

Activation Criteria

Use this pattern when:

  • Architecture or system design decisions
  • Multi-file changes affecting >5 files or >500 LOC
  • Security-critical code (auth, payments, crypto/DeFi)
  • Financial/trading systems (market making, quant strategies)
  • Planning documents that will drive weeks of work
  • Stuck after 3+ failed attempts on same problem

Skip when:

  • Simple edits, config tweaks, bug fixes with obvious cause
  • Documentation updates
  • Single-file changes under 100 LOC
  • Tasks where self-review is sufficient

The Pattern

Executor (Model A) → Output → Judge (Model B) → Verdict → Action

Verdicts: APPROVE | REVISE (with specific feedback) | REJECT (restart)

Model Pairing

Use a different provider than the executor to avoid shared blind spots:

  • Executor: Claude → Judge: kimi or grok or gemini-pro
  • Executor: Kimi/Gemini → Judge: opus
  • Principle: Different provider, similar capability tier

Judge Prompt Templates

Plan/Architecture Review

See references/judge-prompts.md for full templates covering:

  • Plan completeness, feasibility, risk, testing strategy
  • Architecture review with scoring (0-10 per dimension)
  • Code review checklist (correctness, design, safety, maintainability)

Integration Points

  • With adversarial review: This IS the formalized version of "spawn a separate model to review"
  • With planning-protocol: Judge reviews the plan before the Execute phase
  • With coding workflows: Code → cross-model review → fix findings → test → build → push

Quick Decision

Simple task?           → Self-review
Complex / high stakes? → LLM-as-Judge
Stuck after retries?   → LLM-as-Judge (fresh perspective)
Financial/security?    → LLM-as-Judge (mandatory)

Gotchas

  • Same provider defeats the purpose — Claude Opus judging Claude Sonnet shares the same training distribution. Use a different provider (Grok judging Claude, Gemini judging GPT, etc.).
  • Vague judge output is useless — If the judge says "looks good" without specifics, the prompt is too weak. Always require the judge to produce scored dimensions + specific actionable items, even if approving.
  • Judge scope creep — Judges sometimes rewrite the entire plan instead of reviewing it. Constrain the verdict to APPROVE / REVISE / REJECT with specific feedback, not a replacement solution.
  • Approval rate drift — If the judge approves >80% of submissions, the model pairing is too similar or the prompts are too lenient. Target 60-70% approval rate.
  • Don't judge trivial tasks — A 50-line CSS fix doesn't need cross-model review. Use the activation criteria in this skill strictly.

版本历史

共 1 个版本

  • v1.2.0 当前
    2026-03-30 15:16 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-agent

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,406 📥 324,518
content-creation

X OAuth API

ngmeyer
使用官方 OAuth 1.0a API 向 X(Twitter)发布内容。适用于“发到 X”“发推”“在 Twitter 上发布”、创建推文串、删除推文等场景。
★ 2 📥 1,879
ai-agent

Find Skills

guipi888
场景驱动+关键词双模式技能发现工具。当用户用自然语言描述场景/需求(如"我想做一个海报""帮我分析股票"),或明确说"安装技能/find skills/找个skill"时,自动从官方内置、本地已安装、SkillHub、虾评、GitHub、C
★ 1,490 📥 553,767