Combines three evaluation approaches into one unified system:
# 完整评估(三种方法)
multi-skill-eval ~/.openclaw/skills/my-skill
# 快速静态分析
multi-skill-eval ~/.openclaw/skills/my-skill --method quick
# 完整评估 + 详细报告
multi-skill-eval ~/.openclaw/skills/my-skill --method full
# 对比两个技能
multi-skill-eval --compare skill-a skill-b
# 批量评估所有本地技能
multi-skill-eval --all
# 指定模型进行基准测试
multi-skill-eval ~/.openclaw/skills/my-skill --method benchmark --model minimax/MiniMax-M2
轻量级自动化检查,覆盖4个维度:
python3 scripts/static-analyze.py ~/.openclaw/skills/my-skill
python3 scripts/static-analyze.py ~/.openclaw/skills/my-skill --json # 机器可读格式
检查项目:
输出: 0-100分数 + 按严重性分类的问题列表。
25项标准,覆盖8个类别。自动化检查 + 手动评审结合。
运行自动化结构检查:
python3 scripts/eval-skill.py ~/.openclaw/skills/my-skill --json --verbose
然后使用 references/rubric.md 进行手动评分
| # | Category | Framework | Criteria |
|---|---|---|---|
| --- | ---------- | ----------- | ---------- |
| 1 | Functional Suitability | ISO 25010 | Completeness, Correctness, Appropriateness |
| 2 | Reliability | ISO 25010 | Fault Tolerance, Error Reporting, Recoverability |
| 3 | Performance / Context | ISO 25010 + Agent | Token Cost, Execution Efficiency |
| 4 | Usability — AI Agent | Shneiderman, Gerhardt-Powals | Learnability, Consistency, Feedback, Error Prevention |
| 5 | Usability — Human | Tognazzini, Norman | Discoverability, Forgiveness |
| 6 | Security | ISO 25010 + OpenSSF | Credentials, Input Validation, Data Safety |
| 7 | Maintainability | ISO 25010 | Modularity, Modifiability, Testability |
| 8 | Agent-Specific | Novel | Trigger Precision, Progressive Disclosure, Composability, Idempotency, Escape Hatches |
Scoring: Each criterion 0–4. Total 100 max.
| Score | Verdict | Action |
|---|---|---|
| ------- | --------- | -------- |
| 90–100 | Excellent | Publish confidently |
| 80–89 | Good | Publishable, note known issues |
| 70–79 | Acceptable | Fix P0s before publishing |
| 60–69 | Needs Work | Fix P0+P1 before publishing |
| <60 | Not Ready | Significant rework needed |
Copy assets/EVAL-TEMPLATE.md to the skill directory as EVAL.md.
P0 Issues (blocks publishing):
P1 Issues (should fix):
Full multi-phase evaluation with multi-model support. Requires AI agent execution.
# Spawn benchmark via AI agent
multi-skill-eval /path/to/skill --method benchmark --model claude-sonnet-4
> ⚠️ Note: The benchmark method requires an AI agent to orchestrate subagent execution. The CLI coordinates the workflow but actual execution happens through AI agent sessions.
> 📋 Planned: Self-evolution improvement engine (Phase 7+) is planned but not yet implemented.
SKILL.md — understand claims, dependencies, target use casesdependency-gated if credentials missing (skip eval, not fault of skill)knowledge/lessons.md, eval-patterns.md, failures.mdknowledge/skill-profiles/.md Design 2-3 test prompts across four categories:
Assertion design (two layers):
Layer 1: Deterministic checks (fast, reproducible)
Layer 2: Rubric-based quality assessment (LLM-as-judge)
Key assertion patterns:
For each test case, spawn two subagents:
With-skill:
[Model: <execution_model>]
Read the skill at <skill-path>/SKILL.md and follow its instructions.
Task: <prompt>
Save outputs to: <workspace>/iteration-<N>/<test-name>/with_skill/outputs/
Without-skill (baseline):
[Model: <execution_model>]
Complete this task using only built-in capabilities. Do NOT read SKILL.md.
Task: <prompt>
Save outputs to: <workspace>/iteration-<N>/<test-name>/without_skill/outputs/
Multi-model mode: Run same skill across multiple models to check cross-model consistency.
Programmatic grading for deterministic checks. LLM-based grading for qualitative:
python3 scripts/grade-assertions.py --workspace /path/to/results
Save to grading.json:
{
"expectations": [
{"text": "assertion text", "passed": true, "evidence": "..."}
],
"summary": {"passed": N, "failed": N, "total": N, "pass_rate": 0.X}
}
{
"with_skill": {"pass_rate": 0.X, "avg_time": "Ns", "avg_tokens": N},
"without_skill": {"pass_rate": 0.X, "avg_time": "Ns", "avg_tokens": N},
"delta": {"pass_rate": "+0.XX", "time": "+Xx"},
"model_used": "claude-sonnet-4",
"verdict": "Recommended"
}
Efficiency flags: Flag skills where quality delta ≈ 0 but cost delta >2x ("high-overhead framework inflation").
python3 scripts/generate_skill_card.py \
--workspace /path/to/results \
--skill-name "My Skill" \
--skill-slug my-skill \
--eval-model claude-sonnet-4 \
--output skill-cards/my-skill-v1.md
Skill Card Contents:
python3 scripts/generate_leaderboard.py --cards-dir skill-cards --output leaderboard/index.html
> ⚠️ Planned — Not Yet Implemented
>
> The self-evolution improvement engine is designed but not yet implemented. The knowledge base (knowledge/improve/) contains proven patterns and lessons that inform manual skill improvement, but automatic skill rewriting is not available.
knowledge/improve/lessons.md — proven strategiesknowledge/improve/patterns.md — category-specific playbooksknowledge/improve/failures.md — what NOT to trySKILL-improved.mdRun same eval against SKILL-improved.md:
After each improvement batch:
knowledge/improve/lessons.md with what workedknowledge/improve/patterns.md with reusable patternsknowledge/improve/failures.md with failed attempts| Method | Speed | Coverage | Best For |
|---|---|---|---|
| -------- | ------- | --------- | ---------- |
| Static Analysis | ~30s | 4 dimensions | Quick comparison, batch scan |
| Rubric Scoring | ~10min | 25 criteria | Pre-publish audit, detailed report |
| Benchmark Eval | ~30min | Full + self-evolution | Production evaluation, skill improvement |
| Overall Score | Verdict |
|---|---|
| -------------- | --------- |
| 7-10 | Recommended |
| 5-6.9 | Conditional |
| 3-4.9 | Marginal |
| 0-2.9 | Not Recommended |
For thorough security audits, complement with SkillLens:
npx skilllens scan /path/to/skill
Checks: exfiltration, code execution, persistence, privilege bypass, prompt injection.
pip install pyyaml) — frontmatter parsing共 1 个版本