← 返回
未分类

Llm Judge

Use when comparing two or more code implementations against a spec or requirements doc. Triggers on "which repo is better", "compare these implementations",...
用于将两个或多个代码实现与规格或需求文档进行比较。触发条件包括“哪个仓库更好”“比较这些实现”等。
anderskev anderskev 来源
未分类 clawhub v1.0.4 3 版本 99841 Key: 无需
★ 0
Stars
📥 628
下载
💾 1
安装
3
版本
#latest

概述

LLM Judge

Compare code implementations across multiple repositories using structured evaluation.

Usage

llm-judge <spec> <repo1> <repo2> [repo3...] [--labels=...] [--weights=...] [--branch=...]

Arguments

ArgumentRequiredDescription
---------------------------------
specYesPath to spec/requirements document
reposYes2+ paths to repositories to compare
--labelsNoComma-separated labels (default: directory names)
--weightsNoOverride weights, e.g. functionality:40,security:30
--branchNoBranch to compare against main (default: main)

Workflow

  1. Parse $ARGUMENTS into spec_path, repo_paths, labels, weights, and branch.
  2. Validate the spec file, each repo path, and the minimum repo count.
  3. Read the spec document into memory.
  4. Load this skill and the supporting reference files.
  5. Gather facts per repository (one Phase 1 unit per repo) — facts only, no scoring.
  6. Validate the repo-agent JSON results before proceeding.
  7. Score each dimension (one Phase 2 unit per dimension).
  8. Aggregate scores, compute weighted totals, rank repos, and write the report.
  9. Display the markdown summary and verify the JSON report.

Hard gates

Sequenced workflow: do not start the next phase until the current gate passes. Each pass condition must be checkable (file on disk, non-empty content, or json.load succeeds)—not “I reviewed internally.”

GatePass conditionUnblocks
--------------------------------
A — Inputsspec_path is a readable file and non-empty; len(repo_paths) ≥ 2; each path contains .git.Phase 1 repo agents
B — Phase 1 factsFor each repo agent output: stdin/stdout parses as JSON; required keys/shape match references/fact-schema.md.Phase 2 judge agents
C — Phase 2 scoresFive judge outputs (one per dimension) each parse as JSON; each includes a score (and justification) for every repo label.Aggregation
D — Report file.beagle/llm-judge-report.json exists; python3 -c "import json; json.load(open('.beagle/llm-judge-report.json'))" exits 0.Markdown summary to the user
E — ConsistencySummary table and verdict use the same labels, weights, and per-dimension scores as the JSON report.Mark task complete

Parallelism is allowed within a phase (all Phase 1 tasks together; all Phase 2 tasks together), but Phase 2 must not start until Gate B passes, and the user-visible summary must not precede Gate D.

Command Workflow

Step 1: Parse Arguments

Parse $ARGUMENTS to extract:

  • spec_path: first positional argument
  • repo_paths: remaining positional arguments (must be 2+)
  • labels: from --labels or derived from directory names
  • weights: from --weights or defaults
  • branch: from --branch or main

Default Weights:

{
  "functionality": 30,
  "security": 25,
  "tests": 20,
  "overengineering": 15,
  "dead_code": 10
}

Step 2: Validate Inputs

[ -f "$SPEC_PATH" ] || { echo "Error: Spec file not found: $SPEC_PATH"; exit 1; }

for repo in "${REPO_PATHS[@]}"; do
  [ -d "$repo/.git" ] || { echo "Error: Not a git repository: $repo"; exit 1; }
done

[ ${#REPO_PATHS[@]} -ge 2 ] || { echo "Error: Need at least 2 repositories to compare"; exit 1; }

Step 3: Read Spec Document

SPEC_CONTENT=$(cat "$SPEC_PATH") || { echo "Error: Failed to read spec file: $SPEC_PATH"; exit 1; }
[ -z "$SPEC_CONTENT" ] && { echo "Error: Spec file is empty: $SPEC_PATH"; exit 1; }

Step 4: Load the Skill

Load this llm-judge skill and its reference files into context.

Step 5: Phase 1 - Gather Facts Per Repo

If the agent supports subagents, dispatch one Phase 1 repo agent per repository in parallel; otherwise run the same fact-gathering steps sequentially, one repo at a time — the output is identical either way. Give each unit this brief:

You are a Phase 1 Repo Agent for the LLM Judge evaluation.

**Your Repo:** $LABEL at $REPO_PATH

**Spec Document:**
$SPEC_CONTENT

**Instructions:**
1. Load the **llm-judge** skill's references/repo-agent.md for detailed instructions
2. Follow references/fact-schema.md for the output format
3. Load the **llm-artifacts-detection** skill ([../../../beagle-core/skills/llm-artifacts-detection/SKILL.md](../../../beagle-core/skills/llm-artifacts-detection/SKILL.md), if available) for dead-code/overengineering analysis

Explore the repository and gather facts. Return ONLY valid JSON following the fact schema.

Do NOT score or judge. Only gather facts.

Collect all repo outputs into ALL_FACTS.

Step 6: Validate Phase 1 Results

echo "$FACTS" | python3 -c "import json,sys; json.load(sys.stdin)" 2>/dev/null || { echo "Error: Invalid JSON from $LABEL"; exit 1; }

Step 7: Phase 2 - Score Per Dimension

If the agent supports subagents, dispatch one judge agent per dimension (five total) in parallel; otherwise score each dimension sequentially — identical output. Give each unit this brief:

You are the $DIMENSION Judge for the LLM Judge evaluation.

**Spec Document:**
$SPEC_CONTENT

**Facts from all repos:**
$ALL_FACTS_JSON

**Instructions:**
1. Load the **llm-judge** skill's references/judge-agents.md for detailed instructions
2. Follow references/scoring-rubrics.md for the $DIMENSION rubric

Score each repo on $DIMENSION. Return ONLY valid JSON with scores and justifications.

Step 8: Aggregate Scores

for repo_label in labels:
    scores[repo_label] = {}
    for dimension in dimensions:
        scores[repo_label][dimension] = judge_outputs[dimension]['scores'][repo_label]

    weighted_total = sum(
        scores[repo_label][dim]['score'] * weights[dim] / 100
        for dim in dimensions
    )
    scores[repo_label]['weighted_total'] = round(weighted_total, 2)

ranking = sorted(labels, key=lambda l: scores[l]['weighted_total'], reverse=True)

Step 9: Generate Verdict

Name the winner, explain why they won, and note any close calls or trade-offs.

Step 10: Write JSON Report

mkdir -p .beagle

Write .beagle/llm-judge-report.json with version, timestamp, repo metadata, weights, scores, ranking, and verdict.

Step 11: Display Summary

Render a markdown summary with the scores table, ranking, verdict, and detailed justifications.

Step 12: Verification

python3 -c "import json; json.load(open('.beagle/llm-judge-report.json'))" && echo "Valid report"

Output Shape

The generated report should include:

  • repo labels and paths
  • per-dimension scores and justifications
  • weighted totals and ranking
  • a verdict explaining the winner

Reference Files

FilePurpose
---------------
references/fact-schema.mdJSON schema for Phase 1 facts
references/scoring-rubrics.mdDetailed rubrics for each dimension
references/repo-agent.mdInstructions for Phase 1 agents
references/judge-agents.mdInstructions for Phase 2 judges

Scoring Model

DimensionDefault WeightEvaluates
--------------------------------------
Functionality30%Spec compliance, test pass rate
Security25%Vulnerabilities, security patterns
Test Quality20%Coverage, DRY, mock boundaries
Overengineering15%Unnecessary complexity
Dead Code10%Unused code, TODOs

Scoring Scale

ScoreMeaning
----------------
5Excellent - Exceeds expectations
4Good - Meets requirements, minor issues
3Average - Functional but notable gaps
2Below Average - Significant issues
1Poor - Fails basic requirements

Phase 1: Gathering Facts Per Repo

For each repository (in parallel via subagents if supported, otherwise sequentially), run a fact-gathering unit with:

You are a Phase 1 Repo Agent for the LLM Judge evaluation.

**Your Repo:** $REPO_LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT

**Instructions:** Follow the **llm-judge** skill's references/repo-agent.md

Gather facts and return a JSON object following the schema in references/fact-schema.md.

Load the **llm-artifacts-detection** skill ([../../../beagle-core/skills/llm-artifacts-detection/SKILL.md](../../../beagle-core/skills/llm-artifacts-detection/SKILL.md), if available) for dead code and overengineering analysis.

Return ONLY valid JSON, no markdown or explanations.

Collect all repo-agent outputs into ALL_FACTS.

Phase 2: Scoring Per Dimension

After all Phase 1 facts are collected, score the five dimensions (in parallel via subagents if supported, otherwise sequentially), one unit per dimension:

You are the $DIMENSION Judge for the LLM Judge evaluation.

**Spec Document:**
$SPEC_CONTENT

**Facts from all repos:**
$ALL_FACTS_JSON

**Instructions:** Follow the **llm-judge** skill's references/judge-agents.md

Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.

Return ONLY valid JSON following the judge output schema.

Aggregation

  1. Collect the five judge outputs.
  2. Compute each repo's weighted total with the configured weights.
  3. Rank repos by weighted total in descending order.
  4. Generate a verdict that explains the result and any close calls.
  5. Write .beagle/llm-judge-report.json.

Output

Display a markdown summary with scores, ranking, verdict, and detailed justifications.

Verification

Before completing (maps to Hard gates D and E):

  1. Gate D: .beagle/llm-judge-report.json exists and json.load succeeds.
  2. Gate E / completeness: Every repo label has scores for every dimension; each weighted_total equals the sum over dimensions of (score × weight / 100) using the configured weights; markdown summary matches the JSON report.

Rules

  • Always validate inputs before proceeding
  • Complete all Phase 1 fact-gathering before any Phase 2 scoring (parallel within a phase if subagents are supported, otherwise sequential)
  • Run one Phase 2 unit per dimension
  • Every score must have a justification
  • Write the JSON report before displaying the summary

版本历史

共 3 个版本

  • v1.0.4 当前
    2026-06-01 20:46 安全 安全
  • v1.0.3
    2026-05-03 04:44 安全 安全
  • v1.0.0
    2026-03-30 15:16 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

dev-programming

CodeConductor.ai

larsonreever
AI驱动平台,提供快速全栈开发、智能体、工作流自动化及低代码AI集成的可扩展产品创建。
★ 72 📥 181,907
dev-programming

Github

steipete
使用 `gh` CLI 与 GitHub 交互,通过 `gh issue`、`gh pr`、`gh run` 和 `gh api` 管理议题、PR、CI 运行及高级查询。
★ 678 📥 327,606
ai-agent

Deepagents Implementation

anderskev
使用 Deep Agents 实现代理,适用于创建代理、配置后端、定义子代理、添加中间件或设置...
★ 0 📥 689