← 返回
未分类

Multi-Skill-Eval | 集成化技能评估系统

集成化多方法技能评估系统。整合静态分析(skill-assessment)、Rubric质量打分(skill-evaluator)和自主基准测试(skill-eval)。用于全面评估、对比、审计或改进OpenClaw技能。覆盖文档完整性、代码质量、25项Rubric打分、多模型基准测试。 触发词(中文): 评估技...
集成化多方法技能评估系统。整合静态分析、Rubric打分和基准测试,全面评估、对比、审计OpenClaw技能。覆盖文档完整性、代码质量、25项Rubric打分、多模型基准测试。触发词: 评估技...
wangzairong
未分类 clawhub v1.0.2 1 版本 100000 Key: 无需
★ 1
Stars
📥 364
下载
💾 0
安装
1
版本
#assessment#benchmark#evaluation#latest#openclaw#quality#skill

概述

Multi-Skill-Eval v1.0.0

Integrated Multi-Method Skill Evaluation System

Combines three evaluation approaches into one unified system:

  1. Skill Assessment — lightweight static analysis (fast, automated)
  2. Skill Evaluator — 25-criterion rubric scoring (ISO 25010, OpenSSF, Shneiderman)
  3. Skill-Eval — autonomous benchmark evaluation with skill card generation

🚀 快速开始 / Quick Start

# 完整评估(三种方法)
multi-skill-eval ~/.openclaw/skills/my-skill

# 快速静态分析
multi-skill-eval ~/.openclaw/skills/my-skill --method quick

# 完整评估 + 详细报告
multi-skill-eval ~/.openclaw/skills/my-skill --method full

# 对比两个技能
multi-skill-eval --compare skill-a skill-b

# 批量评估所有本地技能
multi-skill-eval --all

# 指定模型进行基准测试
multi-skill-eval ~/.openclaw/skills/my-skill --method benchmark --model minimax/MiniMax-M2

Three Evaluation Methods

方法一:静态分析 (快速 — 约30秒)

轻量级自动化检查,覆盖4个维度:

python3 scripts/static-analyze.py ~/.openclaw/skills/my-skill
python3 scripts/static-analyze.py ~/.openclaw/skills/my-skill --json    # 机器可读格式

检查项目:

  • 文档完整性(SKILL.md、描述质量、示例)
  • 代码质量与安全信号(脚本语法、错误处理)
  • 配置友好性(环境变量文档化、默认值清晰)
  • 维护性信号(版本管理、近期更新)

输出: 0-100分数 + 按严重性分类的问题列表。


方法二:Rubric打分 (详细 — 约10分钟)

25项标准,覆盖8个类别。自动化检查 + 手动评审结合。

运行自动化结构检查:

python3 scripts/eval-skill.py ~/.openclaw/skills/my-skill --json --verbose

然后使用 references/rubric.md 进行手动评分

The 25 Criteria (8 Categories)

#CategoryFrameworkCriteria
----------------------------------
1Functional SuitabilityISO 25010Completeness, Correctness, Appropriateness
2ReliabilityISO 25010Fault Tolerance, Error Reporting, Recoverability
3Performance / ContextISO 25010 + AgentToken Cost, Execution Efficiency
4Usability — AI AgentShneiderman, Gerhardt-PowalsLearnability, Consistency, Feedback, Error Prevention
5Usability — HumanTognazzini, NormanDiscoverability, Forgiveness
6SecurityISO 25010 + OpenSSFCredentials, Input Validation, Data Safety
7MaintainabilityISO 25010Modularity, Modifiability, Testability
8Agent-SpecificNovelTrigger Precision, Progressive Disclosure, Composability, Idempotency, Escape Hatches

Scoring: Each criterion 0–4. Total 100 max.

ScoreVerdictAction
------------------------
90–100ExcellentPublish confidently
80–89GoodPublishable, note known issues
70–79AcceptableFix P0s before publishing
60–69Needs WorkFix P0+P1 before publishing
<60Not ReadySignificant rework needed

Rubric Score Sheet

Copy assets/EVAL-TEMPLATE.md to the skill directory as EVAL.md.

P0 Issues (blocks publishing):

  • Missing SKILL.md or invalid frontmatter
  • Hardcoded credentials or secrets
  • Phantom tooling (referenced scripts not in package)
  • No description or description < 50 chars

P1 Issues (should fix):

  • No usage examples
  • No error handling in scripts
  • Missing dependency documentation
  • Unclear trigger conditions

方法三:自主基准测试 (深度 — 约30分钟/技能)

Full multi-phase evaluation with multi-model support. Requires AI agent execution.

# Spawn benchmark via AI agent
multi-skill-eval /path/to/skill --method benchmark --model claude-sonnet-4

> ⚠️ Note: The benchmark method requires an AI agent to orchestrate subagent execution. The CLI coordinates the workflow but actual execution happens through AI agent sessions.

> 📋 Planned: Self-evolution improvement engine (Phase 7+) is planned but not yet implemented.

Phase 1: Pre-flight Analysis

  1. Read SKILL.md — understand claims, dependencies, target use cases
  2. Classify skill type:
    • Capability uplift — teaches the agent something it can't do well
    • Encoded preference — sequences steps according to specific process
  3. Dependency check:
    • Required CLI tools, API keys, env vars
    • Mark dependency-gated if credentials missing (skip eval, not fault of skill)
    • Check for phantom tooling (referenced scripts not in package)
  4. Marketing claims check: flag any metrics ("7.8x faster") without evidence
  5. Read knowledge base: knowledge/lessons.md, eval-patterns.md, failures.md
  6. Check prior evaluations: knowledge/skill-profiles/.md

Phase 2: Test Case Design

Design 2-3 test prompts across four categories:

  • Outcome — Did the task complete correctly?
  • Process — Did the agent follow the skill's intended steps?
  • Style — Does output follow skill-claimed conventions?
  • Efficiency — Reasonable time/token usage?

Assertion design (two layers):

Layer 1: Deterministic checks (fast, reproducible)

  • File existence, word counts, keyword presence
  • Format compliance (valid JSON, SQL, markdown)
  • Programmatic verification (run tests, check syntax)

Layer 2: Rubric-based quality assessment (LLM-as-judge)

  • Judge model (NOT execution model) grades output against specific rubric
  • Structured scoring, not pass/fail

Key assertion patterns:

  • Banned-word checks for style-constrained skills (highly discriminating)
  • Methodology/structure assertions for technical domains (baseline already strong on correctness)
  • Output-floor assertions: required sections must appear even in error/fallback paths
  • Bilingual keyword variants for Chinese-language skills (索引/index, 前导通配符/leading wildcard)

Phase 3: Execution

For each test case, spawn two subagents:

With-skill:

[Model: <execution_model>]
Read the skill at <skill-path>/SKILL.md and follow its instructions.
Task: <prompt>
Save outputs to: <workspace>/iteration-<N>/<test-name>/with_skill/outputs/

Without-skill (baseline):

[Model: <execution_model>]
Complete this task using only built-in capabilities. Do NOT read SKILL.md.
Task: <prompt>
Save outputs to: <workspace>/iteration-<N>/<test-name>/without_skill/outputs/

Multi-model mode: Run same skill across multiple models to check cross-model consistency.

Phase 4: Grading

Programmatic grading for deterministic checks. LLM-based grading for qualitative:

python3 scripts/grade-assertions.py --workspace /path/to/results

Save to grading.json:

{
  "expectations": [
    {"text": "assertion text", "passed": true, "evidence": "..."}
  ],
  "summary": {"passed": N, "failed": N, "total": N, "pass_rate": 0.X}
}

Phase 5: Benchmark Aggregation

{
  "with_skill": {"pass_rate": 0.X, "avg_time": "Ns", "avg_tokens": N},
  "without_skill": {"pass_rate": 0.X, "avg_time": "Ns", "avg_tokens": N},
  "delta": {"pass_rate": "+0.XX", "time": "+Xx"},
  "model_used": "claude-sonnet-4",
  "verdict": "Recommended"
}

Efficiency flags: Flag skills where quality delta ≈ 0 but cost delta >2x ("high-overhead framework inflation").

Phase 6: Skill Card Generation

python3 scripts/generate_skill_card.py \
  --workspace /path/to/results \
  --skill-name "My Skill" \
  --skill-slug my-skill \
  --eval-model claude-sonnet-4 \
  --output skill-cards/my-skill-v1.md

Skill Card Contents:

  • Metadata: name, source, eval date, model, engine version
  • Overall score 0-10 (Quality 0-5 + Delta 0-3 + Efficiency 0-2)
  • With-skill vs without-skill comparison table
  • Per-test-case breakdown with assertions, timing, grading
  • Strengths / Weaknesses
  • Recommendation: Recommended / Conditional / Marginal / Not Recommended

Phase 7: Leaderboard Update

python3 scripts/generate_leaderboard.py --cards-dir skill-cards --output leaderboard/index.html

Self-Evolution Improvement Engine

> ⚠️ Planned — Not Yet Implemented

>

> The self-evolution improvement engine is designed but not yet implemented. The knowledge base (knowledge/improve/) contains proven patterns and lessons that inform manual skill improvement, but automatic skill rewriting is not available.

Planned Improvement Process (Phase 7-12)

  1. Read knowledge base:
    • knowledge/improve/lessons.md — proven strategies
    • knowledge/improve/patterns.md — category-specific playbooks
    • knowledge/improve/failures.md — what NOT to try
  1. Diagnose root cause:
    • Skill too vague? (Doesn't specify enough to change model behavior)
    • Skill redundant? (Teaches things model already knows)
    • Skill too heavy? (Adds overhead without proportional quality gain)
    • Missing structure? (No clear output format)
    • Phantom tooling? (References tools that don't exist)
    • Reference manual anti-pattern? (>200 lines of educational content)
    • Library-as-skill anti-pattern? (Contains code instead of instructions)
  1. Select improvement strategy from patterns:
    • Reference Manual Slim-Down: Delete 70%+ redundant content, add MUST/ALWAYS/NEVER mandates
    • Library-to-Instructions: Convert code to behavioral instructions
    • Phantom Tooling Replacement: Replace missing tool references with inline instructions
    • Overhead Routing: Add quick-mode vs full-framework routing
    • Assertion-Aligned Rewrite: Rewrite to pass specific failed assertions
  1. Rewrite SKILL.md with selected strategy:
    • Default: Remove > Add (delete 60-80% first, then add behavioral mandates)
    • Add specific, enforceable conventions
    • Remove redundant content model already handles
    • Save as SKILL-improved.md
  1. Update assertions to match improved skill
  1. Re-evaluate with improved version

Planned Re-Eval (Phase 10-11)

Run same eval against SKILL-improved.md:

  • Score improved by >= 1.5 points → Success
  • Less than 50% of previously-failed assertions fixed → Document limitation, move on

Planned Improvement Knowledge Update (Phase 12)

After each improvement batch:

  • Update knowledge/improve/lessons.md with what worked
  • Update knowledge/improve/patterns.md with reusable patterns
  • Update knowledge/improve/failures.md with failed attempts
  • Fold proven patterns back into this SKILL.md

Scoring Summary

MethodSpeedCoverageBest For
----------------------------------
Static Analysis~30s4 dimensionsQuick comparison, batch scan
Rubric Scoring~10min25 criteriaPre-publish audit, detailed report
Benchmark Eval~30minFull + self-evolutionProduction evaluation, skill improvement
Overall ScoreVerdict
-----------------------
7-10Recommended
5-6.9Conditional
3-4.9Marginal
0-2.9Not Recommended

Anti-Patterns to Detect

  • Reference manual anti-pattern: SKILL.md >200 lines of educational content (not behavioral instructions)
  • Library-as-skill anti-pattern: SKILL.md contains Python/JS class definitions instead of instructions
  • Phantom tooling: SKILL.md references scripts/binaries not in the package
  • Phantom tooling framework skills: Evaluate template/output structure separately from real data execution
  • Unsubstantiated claims: Skill claims specific metrics without evidence — do not use self-reported numbers
  • High-overhead framework inflation: Quality delta ≈ 0 but cost delta >2x — penalize efficiency

Deeper Security Scanning

For thorough security audits, complement with SkillLens:

npx skilllens scan /path/to/skill

Checks: exfiltration, code execution, persistence, privilege bypass, prompt injection.


Dependencies

  • Python 3.6+ (for eval-skill.py, static-analyze.py, grade-assertions.py)
  • PyYAML (pip install pyyaml) — frontmatter parsing
  • Node.js (for SkillLens security scanning)

版本历史

共 1 个版本

  • v1.0.2 当前
    2026-05-07 08:54 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-intelligence

self-improving agent

pskoett
捕获经验教训、错误及修正内容,以实现持续改进。适用于以下场景:(1)命令或操作意外失败;(2)用户纠正Claude(如“不,那不对……”“实际上……”);(3)用户请求的功能不存在;(4)外部API或工具出现故障;(5)Claude发现自身
★ 4,073 📥 806,275
security-compliance

Skill Vetter

spclaudehome
AI智能体技能安全预审工具。安装ClawdHub、GitHub等来源技能前,检查风险信号、权限范围及可疑模式。
★ 1,223 📥 267,442
ai-intelligence

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,374 📥 319,866