概述

DoubleAgent Skill

Purpose

The DoubleAgent pattern solves a fundamental problem in AI-generated software: AI self-evaluation bias.

When a single AI agent both generates and evaluates its own output, it systematically overestimates quality — the same cognitive conflict that occurs when a student grades their own exam. The solution is to forcibly separate the two cognitive roles into independent agents with different prompts, goals, and evaluation criteria.

This skill provides:

Architecture templates for Generator-Evaluator agent pairs
Evaluator prompt templates calibrated with few-shot scoring examples
Iteration loop design for 5-15 round refinement cycles
Playwright integration patterns for real browser-based evaluation
Scoring rubric design to prevent score drift and grade inflation

Core Architecture

User Goal / Spec
      ↓
 ┌─────────────┐
 │  Generator  │ ← Produces output (code, UI, content, data)
 └──────┬──────┘
        │ output artifact
        ↓
 ┌────────────────────────────────────┐
 │           Evaluator                │
 │  • Reads spec (NOT generator output)│
 │  • Operates artifact via Playwright │
 │    (click, fill form, navigate)     │
 │  • Scores on rubric (0-100)         │
 │  • Writes structured feedback       │
 └────────────────┬───────────────────┘
                  │ score + feedback
                  ↓
         ┌────────────────┐
         │ Score ≥ target? │
         │   YES → Done    │
         │   NO → Loop     │
         └────────┬────────┘
                  │
                  └──→ Generator (next iteration)

Key principle: The Evaluator reads the original spec, not the Generator's output. It evaluates independently, as if it were a real user encountering the product for the first time.

When to Apply

| Scenario | Apply DoubleAgent? |

|----------|--------------------|

| AI-generated frontend UI with interactions | ✅ Yes |

| Multi-step workflow code (forms, flows) | ✅ Yes |

| API endpoint implementation + validation | ✅ Yes |

| Content generation (reports, copy, docs) | ✅ Yes (text-based evaluator) |

| Single-function refactoring | ⚠️ Optional |

| Simple config changes | ❌ Not needed |

Implementation Steps

Step 1: Define the Spec Contract

Write a clear spec that both agents will reference independently. The spec must be:

Concrete (measurable outcomes, not vague goals)
Observable (evaluable through interaction or inspection)
Versioned (so both agents work from the same contract)

See references/architecture.md for spec template.

Step 2: Configure the Generator Agent

Assign the Generator a single role: produce output that satisfies the spec.

Do NOT ask the Generator to self-evaluate
Do NOT include evaluation criteria in the Generator's prompt
Provide: spec + iteration history + previous evaluator feedback

Step 3: Configure the Evaluator Agent

Assign the Evaluator a single role: independently verify the spec is satisfied.

Load references/evaluator-prompts.md for calibrated prompt templates
Use Playwright MCP for UI/web artifacts (real browser interaction)
Use structured JSON output for scores to enable automated loop control
Calibrate with few-shot examples BEFORE running (prevents grade inflation)

Step 4: Design the Iteration Loop

MAX_ROUNDS = 15
PASS_THRESHOLD = 80  # out of 100

for round in range(MAX_ROUNDS):
    output = generator.run(spec, history)
    evaluation = evaluator.run(spec, output)  # Playwright-based
    
    history.append({"round": round, "score": evaluation.score, "feedback": evaluation.feedback})
    
    if evaluation.score >= PASS_THRESHOLD:
        break
    
    if evaluation.score_trend == "plateauing":
        generator.switch_approach()  # Complete strategy reset

See scripts/iteration_loop.py for a complete implementation template.

Step 5: Calibrate the Evaluator

To prevent score drift, run the Evaluator on 3-5 known examples FIRST:

1 example at ~30/100 (clearly bad)
1 example at ~60/100 (mediocre)
1 example at ~85/100 (good)
1 example at ~95/100 (excellent)

If scores deviate >15 points from expected, adjust the Evaluator's prompt or rubric weights before the real run.

Scoring Rubric Design

Effective rubrics for software systems:

| Dimension | Weight | What to Measure |

|-----------|--------|-----------------|

| Functional completeness | 30% | Does each spec requirement work end-to-end? |

| Interaction quality | 25% | Click/form/navigation behavior as a real user |

| Edge case handling | 20% | Error states, empty data, boundary inputs |

| Code/design quality | 15% | Consistency, readability, no obvious anti-patterns |

| Originality / craft | 10% | Avoids generic/template outputs when spec requires uniqueness |

Adjust weights based on the domain. For content systems, increase "originality". For data pipelines, increase "edge case handling".

Playwright Integration (for UI artifacts)

When evaluating web/H5/mini-program outputs, the Evaluator should:

Navigate to the deployed artifact URL
Execute each spec requirement as a user action sequence
Observe actual behavior (DOM state, network requests, visual output)
Record pass/fail per requirement with screenshots
Report structured JSON with score breakdown

Playwright MCP tool calls to use:

playwright_navigate → open URL
playwright_click → interact with elements
playwright_fill → fill form inputs
playwright_screenshot → capture evidence
playwright_get_visible_text → verify content

Reference Files

references/architecture.md — Detailed architecture patterns, spec templates, and design rationale
references/evaluator-prompts.md — Ready-to-use Evaluator prompt templates for different artifact types

Scripts

scripts/iteration_loop.py — Complete iteration loop implementation template
scripts/calibrate_evaluator.py — Evaluator calibration utility

版本历史

共 1 个版本

v1.0.0 当前

2026-05-07 12:28 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)