← 返回
未分类 中文

Skill Eval

Skill evaluation framework. Use when: testing trigger rate, quality compare (with/without skill), or model comparison. Runs via sessions_spawn + sessions_his...
技能评估框架。适用场景:测试触发率、质量对比(有无技能)、或模型对比。通过 sessions_spawn + sessions_his 运行。
xiaoxing9 xiaoxing9 来源
未分类 clawhub v1.1.1 1 版本 99835.3 Key: 无需
★ 0
Stars
📥 606
下载
💾 0
安装
1
版本
#benchmark#eval#latest#testing

概述

openclaw-eval-skill

Evaluation framework for any OpenClaw skill. No claude CLI dependency — all agent execution runs through sessions_spawn + sessions_history.

Scope: Works with CLI tool skills, conversational skills, and API integration skills.


Runtime Actions Disclosure

This skill performs the following actions during evaluation:

ActionPurposeWhen
-----------------------
Read ~/.openclaw/openclaw.jsonFind skill directories (extraDirs)Path resolution
Write to eval-workspace/Store evaluation resultsEvery eval run
Call sessions_spawnRun test queries in isolated sessionsTrigger & quality tests
Call sessions_historyCollect conversation data for analysisAfter each spawn
Persist cleanup="keep" sessionsRequired for trigger detectionTrigger rate tests

NOT performed automatically: Gateway restart, config modification, skill installation. These require manual user action (see "Bundled Test Skill" section).


Quick Eval

Just say:

evaluate weather

The agent will:

  1. Run scripts/resolve_paths.py weather to find all paths
  2. Execute trigger rate + quality compare with detected evals
  3. Output results to eval-workspace/weather/iter-N/

Options:

  • evaluate weather trigger — trigger rate only
  • evaluate weather quality — quality compare only
  • evaluate github --mode all — explicit mode

What gets auto-detected:

  • Skill path: from OpenClaw built-in skills or registered extraDirs
  • Evals: from evals/{skill-name}/ or fallback to evals/example-*.json
  • Output: next available iter-N directory

First step for agent: Run the resolver to get paths:

python scripts/resolve_paths.py {skill-name} --mode {trigger|quality|all}

Use the JSON output to fill in paths for the workflows below.


Bundled Test Skill: fake-tool

A test skill (test-skills/fake-tool/) is included for validating trigger rate detection. It simulates a fictional "Zephyr API" that models cannot know from training.

Manual setup required: The agent will NOT automatically install fake-tool or restart your gateway. If you want to test with fake-tool:

  1. Copy fake-tool to your skills directory:

```bash

cp -r test-skills/fake-tool ~/.openclaw/workspace/skills/

```

  1. Restart OpenClaw gateway (from terminal):

```bash

openclaw gateway restart

```

  1. Verify registration:

```bash

python scripts/resolve_paths.py fake-tool

```

If step 3 returns a valid path, fake-tool is ready. If "not found", check that your ~/.openclaw/openclaw.json includes the skills directory in skills.load.extraDirs.


Evaluation Scenarios

Tier 1: Core (Always Run)

ScenarioWhat It TestsOutput
---------------------------------
Trigger RateDoes description trigger SKILL.md reads at the right times? Includes positive (should trigger) AND negative (should NOT trigger) cases.recall, specificity, precision, F1
Quality CompareDoes skill improve output vs no-skill baseline?quality_score, assertion pass rate
Description DiagnosisWhy did triggers fail? Analyzes both false negatives AND false positives.gap analysis, recommendations

Tier 2: Optional (Run When Needed)

ScenarioWhat It TestsWhen to Use
--------------------------------------
Model ComparisonQuality + speed across haiku/sonnet/opusBefore deployment: which model is enough?
Efficiency ProfileResponse time + retry patternsWhen skill feels slow: is agent walking wrong paths?

Tier 3: Future (Roadmap)

ScenarioWhat It TestsStatus
---------------------------------
Cross-skill ConflictTwo skills with overlapping descriptionsPlanned
Error RecoveryDoes agent recover when CLI fails?Planned

How This Skill Works

Two-layer architecture:

Layer 1: Agent (main OpenClaw session) — YOU ARE HERE
  → Reads evals.json
  → Calls sessions_spawn to run subagents
  → Calls sessions_history to collect results
  → Writes raw data to workspace/

Layer 2: Python analysis scripts (run via exec)
  → Read the raw data from workspace/
  → Compute statistics
  → Generate reports

Python scripts (analyze_*.py) are data processors — they cannot call sessions_spawn. The agent drives the workflow.


Usage

Follow USAGE.md for all workflows.

Quick reference:

WorkflowWhat It TestsUSAGE.md Section
-------------------------------------------
Trigger RateDoes description trigger SKILL.md reads at the right times?Workflow 1
Quality CompareDoes skill improve output vs no-skill baseline?Workflow 2
Model ComparisonQuality + Speed across haiku/sonnet/opusWorkflow 3
Latency ProfileResponse time p50/p90Workflow 4

Each workflow follows the same pattern:

  1. Agent spawns subagents using sessions_spawn
  2. Agent collects histories using sessions_history
  3. Agent writes raw data to workspace/{skill}/iter-{n}/raw/
  4. Agent runs analysis script via exec

Core Principles

  1. Never modify the evaluated skill — observe only, give recommendations
  2. Keep eval records in workspace — output goes to eval-workspace//iteration-N/
  3. Keep full records — save full_history.json (including tool_use + tool_result)

agents/ Reference

FilePurposeWhen to Use
----------------------------
grader.mdCheck assertions, record behavior anomalies, give priority recommendationsRequired for every Quality Compare eval
comparator.mdBlind A/B comparison without assertionsWhen unbiased comparison is needed
analyzer.mdAnalyze cross-eval patterns after all evals completePost-analysis

Directory Structure

eval-workspace/<skill-name>/
├── evals.json                    ← Eval definition (shared across iterations)
└── iteration-1/
    ├── raw/
    │   ├── histories/            ← Trigger test session histories
    │   └── transcripts/          ← Quality compare transcripts
    ├── trigger_results.json      ← analyze_triggers output
    ├── quality_results.json      ← analyze_quality output
    └── diagnostics/
        └── RECOMMENDATIONS.md

evals.json Format

Quality Compare (prompt + assertions):

{
  "skill_name": "my-skill",
  "evals": [
    {
      "id": 1,
      "name": "onboarding-fresh",
      "prompt": "Check the weather in Tokyo",
      "context": "Clean machine, no prior setup. For grader only.",
      "expected_output": "Install → configure → verify profile",
      "assertions": [
        {
          "id": "a1-1",
          "description": "Install command executed",
          "type": "output_contains",
          "value": "pip install"
        },
        {
          "id": "a1-2",
          "description": "Profile verified after setup",
          "type": "output_contains",
          "value": "profile current",
          "priority": true
        }
      ]
    }
  ]
}

Trigger Rate (query + expected):

{
  "id": 1,
  "name": "direct-weather",
  "query": "What's the weather in Singapore?",
  "expected": true,
  "category": "positive"
}

Assertion Types

TypeDetection
-----------------
output_containsValue appears in conversation or tool output
output_not_containsValue does not appear
output_count_maxOccurrences ≤ max
tool_calledSpecific tool called at least once
tool_not_calledSpecific tool not called
conversation_containsValue appears anywhere in conversation
conversation_contains_anyAt least one value appears

Priority assertions ("priority": true): any failure → overall=FAIL.

Gap assertions ("note": "Best practice..."): failure = skill design gap.


Issue Priority (grader output)

🔴 P0 Critical  — Core functionality broken
🟠 P1 High      — Significantly impacts usability
🟡 P2 Medium    — Room for improvement
🟢 P3 Low       — Minor polish

Behavior Anomaly Tracking

Grader records these signals beyond assertions:

FieldTrigger
----------------
path_correctionsWrong path then self-corrected
retry_countSame command executed multiple times
missing_file_readsAttempted to read non-existent files
skipped_stepsSteps required by skill were not executed
hallucinationsFabricated non-existent commands/APIs

Key Constraints

  • sandbox="inherit" — subagents inherit skill registration environment
  • cleanup="keep" — history must be retained for trigger detection
  • Skill must be in a real directory under skills.load.extraDirs (symlinks rejected)

版本历史

共 1 个版本

  • v1.1.1 当前
    2026-05-02 05:23 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-agent

Agent Browser

rez0
用于 AI 代理的浏览器自动化 CLI。当用户需要与网站交互(包括浏览页面、填写表单、点击按钮、截图等)时使用。
★ 838 📥 314,313
ai-agent

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,398 📥 323,039
ai-agent

self-improving agent

pskoett
捕获经验教训、错误及修正内容,以实现持续改进。适用于以下场景:(1)命令或操作意外失败;(2)用户纠正Claude(如“不,那不对……”“实际上……”);(3)用户请求的功能不存在;(4)外部API或工具出现故障;(5)Claude发现自身
★ 4,108 📥 830,718