A skill for designing custom multi-dimensional evaluation frameworks for AI systems. Walks the user from "I have a system to evaluate" to "I have a calibrated, group-organized scorecard with canonical/proxy duality and explicit failure modes."
The central premise: a single composite score destroys the information you need to debug which dimension actually drove the outcome. This skill produces frameworks that force the reader to look at multiple numbers, with rules for when each measurement is reliable.
After Stage 4, ask: "Want to score additional cases or adjust the rubric?" — this is the calibration loop.
Activate when the user:
Don't activate when:
Goal: extract enough about the user's evaluation domain to design groups and dimensions.
Turn 1 — concrete instances, not abstract criteria. Ask:
> "Give me 1-2 concrete instances of systems you want to evaluate (or have already evaluated). What's the question that comparison should answer? — e.g., 'is system V2 more grounded than V1?' / 'does adding a Critic agent reduce sycophancy?'"
This grounds the design in real comparisons rather than generic axes.
Turn 2 — calibration cases. Ask:
> "Of the systems you've already run, which 2-3 do you have strong intuitions about — i.e., 'I expect X to score higher than Y because Z'? Those are your calibration cases."
If the user has no calibration cases yet, the framework can't be calibrated. Either:
Turn 3 — data availability. Ask:
> "For each calibration case, what data do you have? — structured records (jsonl, database)? narrative logs (markdown, reports)? both? Same schema across cases or different?"
This determines canonical/proxy split for Stage 3.
Turn 4 — capability layers (optional). If the system is complex, ask:
> "If you had to split the evaluation into 3 layers, what would they be? Examples: evidence-quality / process-dynamics / structural-form. Or: retrieval-quality / ranking-quality / adaptation-quality."
The user's natural splits become the groups. If the user can't articulate layers, default to a 3-group structure: (1) evidence/grounding, (2) process/dynamics, (3) structural/architecture. Or use the 4-family alternative shown in memory-bench-taxonomy.md.
By end of Stage 1 you should know:
Author the group structure + dimensions per group.
Step 1: Surface the 12-axis MADEF reference to the user. Ask which axes feel relevant.
Don't force the user to use all 12 — most domains use 5-8 of the MADEF axes plus 0-3 domain-specific additions. The MADEF table at the bottom of madef-axes.md shows likely keep/modify/drop patterns for common domains (single-LLM reasoning, tool-using agents, RAG, multi-step coding).
Step 2: Show the memory-bench-designer's 4-family taxonomy as alternative shape.
This makes the point that group structure is domain-driven. memory-bench has 4 groups (capability families) because memory has those layers. Deliberation has 3 groups (evidence/process/structure) because deliberation has those layers. Don't blindly copy — let the user's domain shape it.
Step 3: Walk the design worksheet. Use axes-design-worksheet.md to fill in:
Cap at 8-12 total dimensions. More than 12 is unmanageable; less than 4 isn't multi-dim.
For each dimension designed in Stage 2, fill in the operational rubric using canonical-vs-proxy-decision.md:
⚠)A dimension without all five fields is not yet operational — it's a sketch.
Apply group-design-principles.md M1-M5 meta-principles:
Apply the framework to the calibration cases the user named in Stage 1.
For each case, populate scorecard.md.tmpl with group-wise scores.
Critical: report group means separately, never a composite. A failing system with one group at 0.9 and another at 0.2 is not the same as a system with all groups at 0.55.
Verify ordinal predictions: do the calibration cases score in the predicted order? If not:
iteration_log.md (see group-design-principles.md M5)The framework freezes (becomes versioned) when the calibration ordinals hold and at least 2-3 real adjustments have been logged.
User: "I have 4 multi-agent debate experiments. The 4th one added claims+verifications infra. I want to evaluate which experiment is doing the most rigorous deliberation."
Stage 1 reveals:
Stage 2 lands on the 12-axis MADEF taxonomy in madef-axes.md, with 3 groups (Grounding / Dynamics / Architecture).
Stage 3 fills in canonical/proxy for each axis. Most legacy experiments need proxy on A1, A3, B1, B2; V4 has canonical on all.
Stage 4 produces 4 scorecards. The ordinals confirm V4 is highest on Group A (Grounding) but the picture is more nuanced on Group B (V3 outscores V4 on dynamics due to more agents and a unique cross-agent finding). The framework surfaces which dimensions move with the architecture change, which is what the user needed.
Full walkthrough: examples/deliberation-system-eval.md.
MIT
共 1 个版本