Compare code implementations across multiple repositories using structured evaluation.
llm-judge <spec> <repo1> <repo2> [repo3...] [--labels=...] [--weights=...] [--branch=...]
| Argument | Required | Description |
|---|---|---|
| ---------- | ---------- | ------------- |
spec | Yes | Path to spec/requirements document |
repos | Yes | 2+ paths to repositories to compare |
--labels | No | Comma-separated labels (default: directory names) |
--weights | No | Override weights, e.g. functionality:40,security:30 |
--branch | No | Branch to compare against main (default: main) |
$ARGUMENTS into spec_path, repo_paths, labels, weights, and branch.Sequenced workflow: do not start the next phase until the current gate passes. Each pass condition must be checkable (file on disk, non-empty content, or json.load succeeds)—not “I reviewed internally.”
| Gate | Pass condition | Unblocks |
|---|---|---|
| ------ | ---------------- | ---------- |
| A — Inputs | spec_path is a readable file and non-empty; len(repo_paths) ≥ 2; each path contains .git. | Phase 1 repo agents |
| B — Phase 1 facts | For each repo agent output: stdin/stdout parses as JSON; required keys/shape match references/fact-schema.md. | Phase 2 judge agents |
| C — Phase 2 scores | Five judge outputs (one per dimension) each parse as JSON; each includes a score (and justification) for every repo label. | Aggregation |
| D — Report file | .beagle/llm-judge-report.json exists; python3 -c "import json; json.load(open('.beagle/llm-judge-report.json'))" exits 0. | Markdown summary to the user |
| E — Consistency | Summary table and verdict use the same labels, weights, and per-dimension scores as the JSON report. | Mark task complete |
Parallelism is allowed within a phase (all Phase 1 tasks together; all Phase 2 tasks together), but Phase 2 must not start until Gate B passes, and the user-visible summary must not precede Gate D.
Parse $ARGUMENTS to extract:
spec_path: first positional argumentrepo_paths: remaining positional arguments (must be 2+)labels: from --labels or derived from directory namesweights: from --weights or defaultsbranch: from --branch or mainDefault Weights:
{
"functionality": 30,
"security": 25,
"tests": 20,
"overengineering": 15,
"dead_code": 10
}
[ -f "$SPEC_PATH" ] || { echo "Error: Spec file not found: $SPEC_PATH"; exit 1; }
for repo in "${REPO_PATHS[@]}"; do
[ -d "$repo/.git" ] || { echo "Error: Not a git repository: $repo"; exit 1; }
done
[ ${#REPO_PATHS[@]} -ge 2 ] || { echo "Error: Need at least 2 repositories to compare"; exit 1; }
SPEC_CONTENT=$(cat "$SPEC_PATH") || { echo "Error: Failed to read spec file: $SPEC_PATH"; exit 1; }
[ -z "$SPEC_CONTENT" ] && { echo "Error: Spec file is empty: $SPEC_PATH"; exit 1; }
Load this llm-judge skill and its reference files into context.
If the agent supports subagents, dispatch one Phase 1 repo agent per repository in parallel; otherwise run the same fact-gathering steps sequentially, one repo at a time — the output is identical either way. Give each unit this brief:
You are a Phase 1 Repo Agent for the LLM Judge evaluation.
**Your Repo:** $LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT
**Instructions:**
1. Load the **llm-judge** skill's references/repo-agent.md for detailed instructions
2. Follow references/fact-schema.md for the output format
3. Load the **llm-artifacts-detection** skill ([../../../beagle-core/skills/llm-artifacts-detection/SKILL.md](../../../beagle-core/skills/llm-artifacts-detection/SKILL.md), if available) for dead-code/overengineering analysis
Explore the repository and gather facts. Return ONLY valid JSON following the fact schema.
Do NOT score or judge. Only gather facts.
Collect all repo outputs into ALL_FACTS.
echo "$FACTS" | python3 -c "import json,sys; json.load(sys.stdin)" 2>/dev/null || { echo "Error: Invalid JSON from $LABEL"; exit 1; }
If the agent supports subagents, dispatch one judge agent per dimension (five total) in parallel; otherwise score each dimension sequentially — identical output. Give each unit this brief:
You are the $DIMENSION Judge for the LLM Judge evaluation.
**Spec Document:**
$SPEC_CONTENT
**Facts from all repos:**
$ALL_FACTS_JSON
**Instructions:**
1. Load the **llm-judge** skill's references/judge-agents.md for detailed instructions
2. Follow references/scoring-rubrics.md for the $DIMENSION rubric
Score each repo on $DIMENSION. Return ONLY valid JSON with scores and justifications.
for repo_label in labels:
scores[repo_label] = {}
for dimension in dimensions:
scores[repo_label][dimension] = judge_outputs[dimension]['scores'][repo_label]
weighted_total = sum(
scores[repo_label][dim]['score'] * weights[dim] / 100
for dim in dimensions
)
scores[repo_label]['weighted_total'] = round(weighted_total, 2)
ranking = sorted(labels, key=lambda l: scores[l]['weighted_total'], reverse=True)
Name the winner, explain why they won, and note any close calls or trade-offs.
mkdir -p .beagle
Write .beagle/llm-judge-report.json with version, timestamp, repo metadata, weights, scores, ranking, and verdict.
Render a markdown summary with the scores table, ranking, verdict, and detailed justifications.
python3 -c "import json; json.load(open('.beagle/llm-judge-report.json'))" && echo "Valid report"
The generated report should include:
| File | Purpose |
|---|---|
| ------ | --------- |
| references/fact-schema.md | JSON schema for Phase 1 facts |
| references/scoring-rubrics.md | Detailed rubrics for each dimension |
| references/repo-agent.md | Instructions for Phase 1 agents |
| references/judge-agents.md | Instructions for Phase 2 judges |
| Dimension | Default Weight | Evaluates |
|---|---|---|
| ----------- | ---------------- | ----------- |
| Functionality | 30% | Spec compliance, test pass rate |
| Security | 25% | Vulnerabilities, security patterns |
| Test Quality | 20% | Coverage, DRY, mock boundaries |
| Overengineering | 15% | Unnecessary complexity |
| Dead Code | 10% | Unused code, TODOs |
| Score | Meaning |
|---|---|
| ------- | --------- |
| 5 | Excellent - Exceeds expectations |
| 4 | Good - Meets requirements, minor issues |
| 3 | Average - Functional but notable gaps |
| 2 | Below Average - Significant issues |
| 1 | Poor - Fails basic requirements |
For each repository (in parallel via subagents if supported, otherwise sequentially), run a fact-gathering unit with:
You are a Phase 1 Repo Agent for the LLM Judge evaluation.
**Your Repo:** $REPO_LABEL at $REPO_PATH
**Spec Document:**
$SPEC_CONTENT
**Instructions:** Follow the **llm-judge** skill's references/repo-agent.md
Gather facts and return a JSON object following the schema in references/fact-schema.md.
Load the **llm-artifacts-detection** skill ([../../../beagle-core/skills/llm-artifacts-detection/SKILL.md](../../../beagle-core/skills/llm-artifacts-detection/SKILL.md), if available) for dead code and overengineering analysis.
Return ONLY valid JSON, no markdown or explanations.
Collect all repo-agent outputs into ALL_FACTS.
After all Phase 1 facts are collected, score the five dimensions (in parallel via subagents if supported, otherwise sequentially), one unit per dimension:
You are the $DIMENSION Judge for the LLM Judge evaluation.
**Spec Document:**
$SPEC_CONTENT
**Facts from all repos:**
$ALL_FACTS_JSON
**Instructions:** Follow the **llm-judge** skill's references/judge-agents.md
Score each repo on $DIMENSION using the rubric in references/scoring-rubrics.md.
Return ONLY valid JSON following the judge output schema.
.beagle/llm-judge-report.json.Display a markdown summary with scores, ranking, verdict, and detailed justifications.
Before completing (maps to Hard gates D and E):
.beagle/llm-judge-report.json exists and json.load succeeds.weighted_total equals the sum over dimensions of (score × weight / 100) using the configured weights; markdown summary matches the JSON report.共 3 个版本