概述

SkillProbe

A/B evaluate whether a skill actually helps, or just adds complexity.

Runs inside the current agent runtime (Cursor, OpenClaw, ClaudeCode). No extra API key required.

7-Step Workflow

Copy this checklist and track progress:

Evaluation Progress:
- [ ] Step 1: Profile the skill (read SKILL.md, extract domain/triggers/boundaries)
- [ ] Step 2: Design eval plan (task categories, count, difficulty mix)
- [ ] Step 3: Generate test tasks (normal + boundary + adversarial)
- [ ] Step 4: Dispatch baseline to Sub-Agent A (no skill content!)
- [ ] Step 5: Dispatch with-skill to Sub-Agent B (include full skill)
- [ ] Step 6: Score both runs (rule + result + optional LLM judge)
- [ ] Step 7: Attribute differences and generate report

Steps 1-3 and 6-7: You (orchestrator) do these.

Steps 4-5: Dispatch to isolated sub-agents. NEVER execute tasks yourself.

Steps 1-3: Prepare (Orchestrator)

Profile: Read the target skill's SKILL.md. Extract problem domain, trigger conditions, capabilities, boundaries.
Design plan: Choose task categories (QA, retrieval, coding, analysis, etc.), count, difficulty distribution (easy 30% / medium 40% / hard 20% / edge 10%).
Generate tasks: Create diverse, self-contained test prompts. Do NOT mention the skill name or A/B experiment in task prompts.

Steps 4-5: Dispatch (Three-Role Isolation)

Create two separate sub-agent sessions. See DISPATCH_PROTOCOL.md for exact prompt templates and constraints.

Key rules:

Sub-Agent A (baseline): receives ONLY task prompts, zero skill content
Sub-Agent B (with-skill): receives task prompts + full skill content
Different session_id for each sub-agent
Orchestrator never answers any test task

Steps 6-7: Score and Report (Orchestrator)

Collect outputs from both sub-agents. Score across 6 dimensions (100-point scale). See SCORING_REFERENCE.md for scoring layers, dimension weights, thresholds, and output format.

Principles

Three-role isolation: Orchestrator designs and scores. Sub-agents execute. Never mix.
Real execution only: No hypothetical or simulated outputs.
Evidence-backed scoring: Rules and results first; LLM judge optional.
Attribution over numbers: Explain WHY, not just how much.
Finish before claiming uncertainty: Inconclusive only after real attempted execution.

Standalone CLI (Optional)

For local runs outside an agent:

skillprobe evaluate <skill-path> --tasks 30 --repeats 2 --db outputs/evaluations.db

Add --llm-judge [--judge-model ] for pairwise judge scoring. The CLI uses whatever LLM provider the local runtime is configured with.

Reference Files

DISPATCH_PROTOCOL.md: Three-role architecture, sub-agent prompt templates, dispatch constraints, evidence requirements
SCORING_REFERENCE.md: Scoring layers, 6-dimension weights, derived metrics, recommendation thresholds, report format

Security & Privacy

Skill content and task prompts are sent to the configured LLM provider only. All evaluation data stored locally. No telemetry.

版本历史

共 1 个版本

v1.0.0 当前

2026-03-30 07:09 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)

安全，无风险

查看报告

SkillProbe

概述

SkillProbe

7-Step Workflow

Steps 1-3: Prepare (Orchestrator)

Steps 4-5: Dispatch (Three-Role Isolation)

Steps 6-7: Score and Report (Orchestrator)

Principles

Standalone CLI (Optional)

Reference Files

Security & Privacy

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

Self-Improving + Proactive Agent

Agent Browser

self-improving agent