← 返回
未分类

Improvement Evaluator

当需要验证 Skill 改进是否真正提升了 AI 执行效果时使用。通过预定义任务集(YAML)运行 AI 任务,判定 pass/fail,输出 execution_pass_rate。不用于文档结构评分(用 improvement-learner)或候选打分(用 improvement-discriminator)。
当需要验证 Skill 改进是否真正提升了 AI 执行效果时使用。通过预定义任务集(YAML)运行 AI 任务,判定 pass/fail,输出 execution_pass_rate。不用于文档结构评分(用 improvement-learner)或候选打分(用 improvement-discriminator)。
lanyasheng lanyasheng 来源
未分类 clawhub v1.1.1 1 版本 100000 Key: 无需
★ 0
Stars
📥 295
下载
💾 1
安装
1
版本
#latest

概述

Improvement Evaluator

Measures whether a Skill actually makes AI perform better on real tasks,

not just whether the SKILL.md document looks well-structured.

Why Execution Testing Matters

Structural scoring (word count, section presence, formatting) correlates

poorly with actual AI task performance. Internal benchmarks showed R²=0.00

between document-structure scores and execution pass rates across 40+ skill

evaluations. A perfectly formatted SKILL.md can still produce failing task

outputs if the instructions mislead the model or omit critical constraints.

Tradeoff: Execution testing is slower and more expensive than structural

checks because it invokes the AI model once per task. A 7-task suite at

pass@1 costs roughly 7 API calls per candidate plus 7 for the baseline.

This is acceptable because structural scoring alone gives no signal about

whether the skill actually works. To offset cost, the evaluator caches

baseline results for 7 days and supports --pass-k 1 (single attempt)

as the default to keep runs lean.

When to Use

  • Verify that a SKILL.md change improves AI task execution, not just document structure
  • Run a task suite against a candidate SKILL.md and compare pass rate with baseline
  • Get execution_pass_rate as a concrete quality metric for gating decisions
  • Validate that a newly written task suite produces a sane baseline (>20% pass rate)
  • Compare two versions of a skill on the same task suite to detect regressions
  • Feed execution deltas into the improvement-gate for accept/reject decisions
  • Debug low scores by inspecting per-task pass/fail details in the output artifact
  • Run standalone evaluations during skill development without a full pipeline

When NOT to Use

  • Checking SKILL.md structure quality only (use improvement-learner instead)
  • Scoring candidates with semantic rubrics before execution (use improvement-discriminator)
  • Running the full generate-score-evaluate-execute-gate pipeline (use improvement-orchestrator)
  • Measuring document formatting, section counts, or word-level metrics

Task Suite Format

A task suite is a YAML file that defines what tasks to run and how to judge them.

Each suite targets a specific skill and contains 5-10 tasks covering the skill's

core behaviors. The schema is versioned at "1.0".

# task_suite.yaml -- minimal complete example
skill_id: "target-skill-name"
version: "1.0"
tasks:
  - id: "task-keyword-check"
    description: "Verify output mentions required concepts"
    prompt: "Given these scores {accuracy: 0.9}, what quality tier?"
    judge:
      type: "contains"
      expected: ["POWERFUL"]
    timeout_seconds: 30

  - id: "task-semantic-quality"
    description: "Rubric-scored analysis quality"
    prompt: "Accuracy dropped 0.9 to 0.8 but coverage rose. Accept?"
    judge:
      type: "llm-rubric"
      rubric: "Must mention trade-off analysis and give a recommendation"
      pass_threshold: 0.7
    timeout_seconds: 120

Validation rules enforced at load time:

  • skill_id must be non-empty.
  • version must equal "1.0".
  • Every task needs a unique id, a non-empty prompt, and a judge block.
  • Judge type must be one of contains, pytest, or llm-rubric.
  • For contains: expected must be a non-empty list of strings.
  • For pytest: test_file must start with fixtures/ (path-traversal guard).
  • For llm-rubric: rubric must be non-empty.

See references/task-format.md and references/writing-tasks-guide.md for

detailed patterns and anti-patterns.

Judge Types

The evaluator supports three judge types. Choose based on determinism needs

and output complexity.

JudgeMechanismBest For
----------------------------
ContainsJudgeChecks all expected keywords appear (case-insensitive)Deterministic presence checks, format validation
PytestJudgeRuns pytest on AI output via AI_OUTPUT_FILE env varStructured output, JSON schema validation
LLMRubricJudgeLLM scores output against a rubric (0.0-1.0)Semantic quality, open-ended evaluation

Because deterministic judges (Contains, Pytest) are fast and free while

LLM judges cost an API call per evaluation, prefer deterministic judges when

the pass condition can be expressed as keyword presence or structured format.

Reserve LLMRubricJudge for tasks where semantic quality matters and no

deterministic proxy exists.

Judge configuration examples:

# ContainsJudge -- all keywords must appear (case-insensitive)
judge:
  type: "contains"
  expected: ["validation", "sanitiz", "error handling"]

# PytestJudge -- test file receives AI output path via AI_OUTPUT_FILE
judge:
  type: "pytest"
  test_file: "fixtures/test_output_format.py"

# LLMRubricJudge -- score 0.0-1.0, pass if >= threshold
judge:
  type: "llm-rubric"
  rubric: |
    Score 0.0-1.0:
    - 0.8+: Correct analysis with actionable recommendation
    - 0.5-0.8: Partial analysis, missing specifics
    - <0.5: Generic or incorrect
  pass_threshold: 0.7

LLMRubricJudge supports --mock mode for local testing without API calls.

In mock mode the judge returns a fixed passing score so you can verify the

pipeline wiring without incurring cost.

Evaluate a candidate skill in pipeline mode:

$ python3 scripts/evaluate.py \

--input ranking.json \

--candidate-id c1 \

--task-suite tasks.yaml \

--state-root /tmp/eval-state

→ {"execution_pass_rate": 0.80, "baseline_pass_rate": 0.70, "delta": 0.10, "verdict": "pass"}

Running the evaluator without a task suite file:

→ Preflight fails with "Task suite not found" -- the evaluator requires a valid task_suite.yaml.

Running with a broken task suite (baseline pass rate < 20%):

→ Aborts with verdict="error" and reason="baseline pass rate X < 0.2". Fix the suite first.

CLI Reference

Two operating modes: pipeline mode (with ranking artifact from discriminator)

and standalone mode (direct evaluation during development).

# Pipeline mode -- requires ranking artifact from discriminator stage
python3 scripts/evaluate.py \
  --input ranking-artifact.json \
  --candidate-id cand-01-docs \
  --task-suite task_suites/target-skill/task_suite.yaml \
  --state-root /tmp/eval-state \
  --pass-k 1 \
  --baseline-cache-dir /tmp/baseline-cache \
  --eval-threshold 6.0 \
  --output /tmp/eval-result.json

# Standalone mode -- evaluate a skill directly without pipeline artifacts
python3 scripts/evaluate.py \
  --standalone \
  --task-suite task_suites/deslop/task_suite.yaml \
  --skill-path ./skills/deslop \
  --state-root /tmp/eval-state \
  --mock
FlagRequiredDefaultPurpose
----------------------------------
--inputpipeline--Path to ranking artifact JSON from discriminator
--candidate-idpipeline--ID of candidate to evaluate
--standalonestandalonefalseRun without ranking artifact
--task-suitealways--Path to task suite YAML
--state-rootalways--Directory for evaluation state and output
--skill-pathstandalone--Path to SKILL.md or skill directory
--pass-kno1Attempts per task (passes if any attempt succeeds)
--baseline-cache-dirnononeCache baseline results (7-day TTL)
--eval-thresholdno6.0Minimum discriminator score to proceed
--mocknofalseUse mock execution, no claude CLI needed
--outputnoautoOverride output path (default: /evaluations/.json)

Output Artifacts

The evaluator writes a JSON artifact to /evaluations/.json

(or the path specified by --output). Downstream consumers are the improvement-gate

and improvement-orchestrator.

FieldTypeDescription
--------------------------
execution_pass_ratefloatCandidate pass rate (0.0-1.0)
baseline_pass_ratefloatOriginal SKILL.md pass rate (0.0-1.0)
deltafloatcandidate - baseline; non-negative means improvement
verdictstringpass, fail, skipped, or error
candidate_resultsarrayPer-task breakdown with task_id, passed, score, duration_ms
baseline_resultsarraySame structure for baseline run
truth_anchorstringAbsolute path to this artifact for audit trail

Verdict logic: pass when delta >= 0 (candidate is at least as good as baseline).

skipped when candidate discriminator score is below --eval-threshold.

error when baseline pass rate < 20% (broken task suite).

Related Skills

  • improvement-discriminator -- Runs semantic scoring before this stage.

Produces the ranking artifact that this evaluator consumes. Use discriminator

when you need LLM panel review scores, not execution-based pass rates.

  • improvement-gate -- Consumes this evaluator's output artifact. Applies a

6-layer mechanical gate (Schema, Compile, Lint, Regression, Review, HumanReview)

to decide whether to accept or reject the change.

  • improvement-orchestrator -- Coordinates the full pipeline: generate,

discriminate, evaluate, execute, gate. Use orchestrator when you want the

end-to-end flow rather than running individual stages.

  • improvement-learner -- Structural quality scoring (6-dimension). Use learner

when you only care about document quality metrics, not execution effectiveness.

版本历史

共 1 个版本

  • v1.1.1 当前
    2026-05-07 15:18 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-agent

Agent Browser

rez0
用于 AI 代理的浏览器自动化 CLI。当用户需要与网站交互(包括浏览页面、填写表单、点击按钮、截图等)时使用。
★ 844 📥 325,924
content-creation

去 AI 味内容引擎

lanyasheng
为小红书、X、知乎等平台生成“去 AI 味”的内容重写与新稿产出技能。用于用户要求内容更像真人表达、减少模板腔、提升口语感和观点力度的场景;也用于把已有草稿改写为平台适配版本。触发词包括“去 AI 味”“重写成更像人写的”“太像 AI 了”
★ 15 📥 4,874
ai-agent

Find Skills

guipi888
场景驱动+关键词双模式技能发现工具。当用户用自然语言描述场景/需求(如"我想做一个海报""帮我分析股票"),或明确说"安装技能/find skills/找个skill"时,自动从官方内置、本地已安装、SkillHub、虾评、GitHub、C
★ 1,494 📥 558,976