A/B evaluate whether a skill actually helps, or just adds complexity.
Runs inside the current agent runtime (Cursor, OpenClaw, ClaudeCode). No extra API key required.
Copy this checklist and track progress:
Evaluation Progress:
- [ ] Step 1: Profile the skill (read SKILL.md, extract domain/triggers/boundaries)
- [ ] Step 2: Design eval plan (task categories, count, difficulty mix)
- [ ] Step 3: Generate test tasks (normal + boundary + adversarial)
- [ ] Step 4: Dispatch baseline to Sub-Agent A (no skill content!)
- [ ] Step 5: Dispatch with-skill to Sub-Agent B (include full skill)
- [ ] Step 6: Score both runs (rule + result + optional LLM judge)
- [ ] Step 7: Attribute differences and generate report
Steps 1-3 and 6-7: You (orchestrator) do these.
Steps 4-5: Dispatch to isolated sub-agents. NEVER execute tasks yourself.
Create two separate sub-agent sessions. See DISPATCH_PROTOCOL.md for exact prompt templates and constraints.
Key rules:
session_id for each sub-agentCollect outputs from both sub-agents. Score across 6 dimensions (100-point scale). See SCORING_REFERENCE.md for scoring layers, dimension weights, thresholds, and output format.
Inconclusive only after real attempted execution.For local runs outside an agent:
skillprobe evaluate <skill-path> --tasks 30 --repeats 2 --db outputs/evaluations.db
Add --llm-judge [--judge-model for pairwise judge scoring. The CLI uses whatever LLM provider the local runtime is configured with.
Skill content and task prompts are sent to the configured LLM provider only. All evaluation data stored locally. No telemetry.
共 1 个版本