Test, evaluate, and grade agent skills across platforms. Catch regressions, verify triggers, score outputs, and track quality over time — without depending on any specific agent runtime.
Choose one of these entry points:
Trial Mode: test a skill safely before trusting or adopting it.Publish Evaluation Mode: evaluate a skill before submitting it to ClawHub or skills.sh.Always start by reading the target SKILL.md, then run:
python3 {baseDir}/scripts/eval_skill.py <skill-path>
This package bundles one executable:
scripts/eval_skill.py for static quality, structure, and release-readiness checksDeterministic graders, runtime smoke tests, and LLM rubric grading are optional workflows you define around the target skill. They are not extra bundled executables inside this package.
Use this skill when you need evidence that a skill is ready to publish, ship, or compare against a previous version.
Typical use cases:
eval.yaml suite when regression testing is neededUse Trial Mode when the question is: "Should I trust or adopt this skill at all?"
Trial Mode should focus on:
SKILL.md, scripts, and referencesUse Publish Evaluation Mode when the question is: "Is this skill ready to publish or update?"
Publish Evaluation Mode should focus on:
Do not use this skill when the task is primarily listing optimization rather than quality validation.
Use a different workflow when you need to:
If the static report cannot prove readiness, say what is still unverified and what test evidence is missing.
Common next steps:
eval.yaml with deterministic checksTreat third-party or unfamiliar skills as untrusted until reviewed.
When testing a skill:
When using this skill, the assistant should:
SKILL.md, scripts, and references before judging qualityOrganize the final evaluation report like this:
Keep these limits explicit when reporting results:
scripts/eval_skill.py; deeper deterministic or LLM grading needs additional user-defined setupeval.yaml is an optional quality asset, not a universal release requirementagents/openai.yaml is optional and should not be treated as a baseline quality gateCall out these cases explicitly when they appear:
SKILL.mdpython3.SKILL.md with valid YAML frontmatter.Is this skill ready to publish on skills.sh, or does it still have quality gaps?Audit this skill before I submit it to ClawHub and show me the release blockers.Evaluate this skill and tell me if it would trigger correctly for common user prompts.Write a test suite for my skill that covers trigger, output, and style checks.Run the skill-test grader against this skill folder and show me the report.Compare the current version of this skill against the previous version and flag regressions.Audit this skill's trigger description — does it fire when it should and stay silent when it shouldn't?Benchmark this skill: run 10 trials and give me pass rates with confidence intervals.Find trigger gaps, output issues, and release blockers in this skill before publication.SKILL.md to understand intent and structure.```sh
python3 {baseDir}/scripts/eval_skill.py
```
eval.yaml test suite for the target skill when repeatable regression testing is needed.The evaluator scores skills across five dimensions:
Dimension | What it measures | Weight
Trigger accuracy | Does the skill fire for correct prompts and stay silent for incorrect ones? | 25%
Structural integrity | Are referenced files, links, and paths valid and portable? | 20%
Content quality | Does SKILL.md document workflow, prerequisites, commands, and completion criteria? | 25%
Baseline platform compatibility | Does the skill meet the minimum structural expectations for OpenClaw-style packaging, with other platforms treated as coarse compatibility checks? | 15%
Testability | Can automated graders verify the skill's outcomes and boundaries? | 15%
Produce a structured report containing:
critical, high, medium, low severityWhen the user asks you to create tests for a skill, generate an eval.yaml in the target skill folder:
version: "1"
defaults:
trials: 5
timeout: 300
threshold: 0.8
trigger_tests:
- id: explicit-invoke
should_trigger: true
prompt: "Use the $<skill-name> skill to do X"
- id: implicit-invoke
should_trigger: true
prompt: "Do X with Y for quick Z experiments"
- id: negative-control
should_trigger: false
prompt: "Add Y to my existing X project"
outcome_tests:
- id: file-structure
type: deterministic
checks:
- file_exists: "package.json"
- file_exists: "src/index.ts"
- command_ran: "npm install"
- id: style-compliance
type: llm_rubric
rubric: |
Structure (0-0.5): Does the output match the declared project layout?
Conventions (0-0.5): Are naming, formatting, and tooling choices consistent?
The built-in analyzer verifies without running the skill:
name, description, trigger phrasing, length{baseDir}, no dead linksSKILL.md validity first, with other platform hints checked only when relevant# Full evaluation report (text)
python3 {baseDir}/scripts/eval_skill.py /path/to/skill
# JSON output for automation
python3 {baseDir}/scripts/eval_skill.py /path/to/skill --json
# Evaluate with explicit quality keywords
python3 {baseDir}/scripts/eval_skill.py /path/to/skill --keywords "backup,restore,disaster recovery"
# Generate eval.yaml scaffold for a skill
python3 {baseDir}/scripts/eval_skill.py /path/to/skill --scaffold
# Compare two versions of a skill
python3 {baseDir}/scripts/eval_skill.py /path/to/skill-v2 --baseline /path/to/skill-v1
critical and high findings have been addressed or explicitly justified.critical and high issues first.共 1 个版本