← 返回
未分类 中文

DoubleAgent — Generator-Evaluator Dual Agent Pattern

This skill should be used when designing, implementing, or improving any AI system that requires quality assurance through separation of generation and evalu...
此技能用于设计、实现或改进需通过生成与评估分离来保证质量的AI系统。
zmy1006-sudo zmy1006-sudo 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 356
下载
💾 0
安装
1
版本
#latest

概述

DoubleAgent Skill

Purpose

The DoubleAgent pattern solves a fundamental problem in AI-generated software: AI self-evaluation bias.

When a single AI agent both generates and evaluates its own output, it systematically overestimates quality — the same cognitive conflict that occurs when a student grades their own exam. The solution is to forcibly separate the two cognitive roles into independent agents with different prompts, goals, and evaluation criteria.

This skill provides:

  1. Architecture templates for Generator-Evaluator agent pairs
  2. Evaluator prompt templates calibrated with few-shot scoring examples
  3. Iteration loop design for 5-15 round refinement cycles
  4. Playwright integration patterns for real browser-based evaluation
  5. Scoring rubric design to prevent score drift and grade inflation

Core Architecture

User Goal / Spec
      ↓
 ┌─────────────┐
 │  Generator  │ ← Produces output (code, UI, content, data)
 └──────┬──────┘
        │ output artifact
        ↓
 ┌────────────────────────────────────┐
 │           Evaluator                │
 │  • Reads spec (NOT generator output)│
 │  • Operates artifact via Playwright │
 │    (click, fill form, navigate)     │
 │  • Scores on rubric (0-100)         │
 │  • Writes structured feedback       │
 └────────────────┬───────────────────┘
                  │ score + feedback
                  ↓
         ┌────────────────┐
         │ Score ≥ target? │
         │   YES → Done    │
         │   NO → Loop     │
         └────────┬────────┘
                  │
                  └──→ Generator (next iteration)

Key principle: The Evaluator reads the original spec, not the Generator's output. It evaluates independently, as if it were a real user encountering the product for the first time.


When to Apply

| Scenario | Apply DoubleAgent? |

|----------|--------------------|

| AI-generated frontend UI with interactions | ✅ Yes |

| Multi-step workflow code (forms, flows) | ✅ Yes |

| API endpoint implementation + validation | ✅ Yes |

| Content generation (reports, copy, docs) | ✅ Yes (text-based evaluator) |

| Single-function refactoring | ⚠️ Optional |

| Simple config changes | ❌ Not needed |


Implementation Steps

Step 1: Define the Spec Contract

Write a clear spec that both agents will reference independently. The spec must be:

  • Concrete (measurable outcomes, not vague goals)
  • Observable (evaluable through interaction or inspection)
  • Versioned (so both agents work from the same contract)

See references/architecture.md for spec template.

Step 2: Configure the Generator Agent

Assign the Generator a single role: produce output that satisfies the spec.

  • Do NOT ask the Generator to self-evaluate
  • Do NOT include evaluation criteria in the Generator's prompt
  • Provide: spec + iteration history + previous evaluator feedback

Step 3: Configure the Evaluator Agent

Assign the Evaluator a single role: independently verify the spec is satisfied.

  • Load references/evaluator-prompts.md for calibrated prompt templates
  • Use Playwright MCP for UI/web artifacts (real browser interaction)
  • Use structured JSON output for scores to enable automated loop control
  • Calibrate with few-shot examples BEFORE running (prevents grade inflation)

Step 4: Design the Iteration Loop

MAX_ROUNDS = 15
PASS_THRESHOLD = 80  # out of 100

for round in range(MAX_ROUNDS):
    output = generator.run(spec, history)
    evaluation = evaluator.run(spec, output)  # Playwright-based
    
    history.append({"round": round, "score": evaluation.score, "feedback": evaluation.feedback})
    
    if evaluation.score >= PASS_THRESHOLD:
        break
    
    if evaluation.score_trend == "plateauing":
        generator.switch_approach()  # Complete strategy reset

See scripts/iteration_loop.py for a complete implementation template.

Step 5: Calibrate the Evaluator

To prevent score drift, run the Evaluator on 3-5 known examples FIRST:

  • 1 example at ~30/100 (clearly bad)
  • 1 example at ~60/100 (mediocre)
  • 1 example at ~85/100 (good)
  • 1 example at ~95/100 (excellent)

If scores deviate >15 points from expected, adjust the Evaluator's prompt or rubric weights before the real run.


Scoring Rubric Design

Effective rubrics for software systems:

| Dimension | Weight | What to Measure |

|-----------|--------|-----------------|

| Functional completeness | 30% | Does each spec requirement work end-to-end? |

| Interaction quality | 25% | Click/form/navigation behavior as a real user |

| Edge case handling | 20% | Error states, empty data, boundary inputs |

| Code/design quality | 15% | Consistency, readability, no obvious anti-patterns |

| Originality / craft | 10% | Avoids generic/template outputs when spec requires uniqueness |

Adjust weights based on the domain. For content systems, increase "originality". For data pipelines, increase "edge case handling".


Playwright Integration (for UI artifacts)

When evaluating web/H5/mini-program outputs, the Evaluator should:

  1. Navigate to the deployed artifact URL
  2. Execute each spec requirement as a user action sequence
  3. Observe actual behavior (DOM state, network requests, visual output)
  4. Record pass/fail per requirement with screenshots
  5. Report structured JSON with score breakdown

Playwright MCP tool calls to use:

  • playwright_navigate → open URL
  • playwright_click → interact with elements
  • playwright_fill → fill form inputs
  • playwright_screenshot → capture evidence
  • playwright_get_visible_text → verify content

Reference Files

  • references/architecture.md — Detailed architecture patterns, spec templates, and design rationale
  • references/evaluator-prompts.md — Ready-to-use Evaluator prompt templates for different artifact types

Scripts

  • scripts/iteration_loop.py — Complete iteration loop implementation template
  • scripts/calibrate_evaluator.py — Evaluator calibration utility

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-05-07 12:28 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-agent

self-improving agent

pskoett
记录自身发现以实现自我改进的技能
★ 4,168 📥 941,359
design-media

DESIGN.md — AI时代设计规范技能

zmy1006-sudo
DESIGN.md 设计与品牌规范技能。用户需要创建项目设计规范、为AI编写品牌规范或生成符合品牌风格的页面时激活。 触发词:DESIGN.md、设计规范、品牌规范、设计系统、生成设计文档、写DESIGN.md、设计语言。 用于:创建/更新
★ 0 📥 1,562
ai-agent

Agent Browser

rez0
用于 AI 代理的浏览器自动化 CLI。当用户需要与网站交互(包括浏览页面、填写表单、点击按钮、截图等)时使用。
★ 868 📥 348,188