← 返回
未分类 中文

SkillProbe

A/B evaluates any AI agent skill's real impact through three-role isolation (orchestrator + two sub-agents). Generates skill profiles, synthetic test tasks,...
A/B测试评估任何AI智能体技能的实际影响,通过三角色隔离(编排器+两个子智能体)。生成技能画像、合成测试任务。
luarassassin luarassassin 来源
未分类 clawhub v1.0.0 1 版本 100000 Key: 无需
★ 0
Stars
📥 580
下载
💾 1
安装
1
版本
#latest

概述

SkillProbe

A/B evaluate whether a skill actually helps, or just adds complexity.

Runs inside the current agent runtime (Cursor, OpenClaw, ClaudeCode). No extra API key required.

7-Step Workflow

Copy this checklist and track progress:

Evaluation Progress:
- [ ] Step 1: Profile the skill (read SKILL.md, extract domain/triggers/boundaries)
- [ ] Step 2: Design eval plan (task categories, count, difficulty mix)
- [ ] Step 3: Generate test tasks (normal + boundary + adversarial)
- [ ] Step 4: Dispatch baseline to Sub-Agent A (no skill content!)
- [ ] Step 5: Dispatch with-skill to Sub-Agent B (include full skill)
- [ ] Step 6: Score both runs (rule + result + optional LLM judge)
- [ ] Step 7: Attribute differences and generate report

Steps 1-3 and 6-7: You (orchestrator) do these.

Steps 4-5: Dispatch to isolated sub-agents. NEVER execute tasks yourself.

Steps 1-3: Prepare (Orchestrator)

  1. Profile: Read the target skill's SKILL.md. Extract problem domain, trigger conditions, capabilities, boundaries.
  2. Design plan: Choose task categories (QA, retrieval, coding, analysis, etc.), count, difficulty distribution (easy 30% / medium 40% / hard 20% / edge 10%).
  3. Generate tasks: Create diverse, self-contained test prompts. Do NOT mention the skill name or A/B experiment in task prompts.

Steps 4-5: Dispatch (Three-Role Isolation)

Create two separate sub-agent sessions. See DISPATCH_PROTOCOL.md for exact prompt templates and constraints.

Key rules:

  • Sub-Agent A (baseline): receives ONLY task prompts, zero skill content
  • Sub-Agent B (with-skill): receives task prompts + full skill content
  • Different session_id for each sub-agent
  • Orchestrator never answers any test task

Steps 6-7: Score and Report (Orchestrator)

Collect outputs from both sub-agents. Score across 6 dimensions (100-point scale). See SCORING_REFERENCE.md for scoring layers, dimension weights, thresholds, and output format.

Principles

  1. Three-role isolation: Orchestrator designs and scores. Sub-agents execute. Never mix.
  2. Real execution only: No hypothetical or simulated outputs.
  3. Evidence-backed scoring: Rules and results first; LLM judge optional.
  4. Attribution over numbers: Explain WHY, not just how much.
  5. Finish before claiming uncertainty: Inconclusive only after real attempted execution.

Standalone CLI (Optional)

For local runs outside an agent:

skillprobe evaluate <skill-path> --tasks 30 --repeats 2 --db outputs/evaluations.db

Add --llm-judge [--judge-model ] for pairwise judge scoring. The CLI uses whatever LLM provider the local runtime is configured with.

Reference Files

  • DISPATCH_PROTOCOL.md: Three-role architecture, sub-agent prompt templates, dispatch constraints, evidence requirements
  • SCORING_REFERENCE.md: Scoring layers, 6-dimension weights, derived metrics, recommendation thresholds, report format

Security & Privacy

Skill content and task prompts are sent to the configured LLM provider only. All evaluation data stored locally. No telemetry.

版本历史

共 1 个版本

  • v1.0.0 当前
    2026-03-30 07:09 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-agent

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,403 📥 324,055
ai-agent

Agent Browser

rez0
用于 AI 代理的浏览器自动化 CLI。当用户需要与网站交互(包括浏览页面、填写表单、点击按钮、截图等)时使用。
★ 842 📥 320,433
ai-agent

self-improving agent

pskoett
捕获经验教训、错误及修正内容,以实现持续改进。适用于以下场景:(1)命令或操作意外失败;(2)用户纠正Claude(如“不,那不对……”“实际上……”);(3)用户请求的功能不存在;(4)外部API或工具出现故障;(5)Claude发现自身
★ 4,120 📥 840,905