← 返回
AI智能 中文

Eval Skills

AI Agent Skill unit testing framework. A framework-agnostic toolkit for discovering, scaffolding, selecting, evaluating, and reporting on AI skills. Use this...
AI Agent 技能单元测试框架。一个框架无关的工具包,用于发现、构建、选择、评估和报告AI技能。使用此...
islinxu
AI智能 clawhub v0.1.1 1 版本 99877.1 Key: 无需
★ 0
Stars
📥 813
下载
💾 22
安装
1
版本
#latest

概述

eval-skills

AI Agent Skill unit testing framework — a framework-agnostic toolkit for discovering, scaffolding, selecting, evaluating, and reporting on AI skills.

This skill fills the L1 (Skill Unit Test) gap that LangSmith / DeepEval leave open: while those platforms focus on agent-level and trajectory-level evaluation (L2-L3), eval-skills targets the individual skill level, ensuring each building block meets quality standards before it ever enters an agent pipeline.

When to Use This Skill

  • Before deploying a new skill to production — run eval to verify it meets your quality gate.
  • When choosing between multiple candidate skills — run select to rank them on the same benchmark.
  • When a skill is upgraded — run report diff to detect regressions.
  • In CI/CD — use --exit-on-fail to block merges that degrade skill quality.
  • When bootstrapping a new skill — run create to generate a ready-to-fill skeleton.

Capabilities

1. Find Skills

Search for existing skills by keyword, tag, or adapter type.

eval-skills find \
  --query "web search" \
  --tag retrieval api \
  --adapter http \
  --min-completion 0.8 \
  --skills-dir ./skills \
  --limit 10
OptionDescriptionDefault
------------------------------
-q, --query Keyword search (matches name, description, tags)
-t, --tag Filter by tags (intersection: skill must have ALL specified tags)
-a, --adapter Filter by adapter type (http, subprocess, mcp)
--min-completion Minimum historical completion rate (0.0 ~ 1.0)
--skills-dir Directory to scan for skill.json files./skills
--limit Maximum number of results20

Results are ranked by search relevance (when --query is provided) or by historical completion rate (descending).

2. Create Skills

Generate a skill skeleton from a template to bootstrap development.

eval-skills create \
  --name my_api_skill \
  --from-template http_request \
  --output-dir ./skills \
  --description "Fetches weather data from OpenWeather API"
OptionDescriptionDefault
------------------------------
--name Required. Skill name
--from-template Template type: http_request, python_script, mcp_toolhttp_request
--output-dir Output directory./skills
--description Human-readable description embedded in skill.json

Generated file structure:

skills/my_api_skill/
  skill.json            # Skill metadata (id, schemas, adapter config)
  adapter.config.json   # Adapter-specific configuration
  tests/
    basic.eval.json     # A starter benchmark with one sample task
  skill.py              # (python_script template only) JSON-RPC entrypoint

3. Evaluate Skills

Run benchmark evaluations against one or more skills. This is the core command.

eval-skills eval \
  --skills ./skills/calculator/skill.json ./skills/search/ \
  --benchmark coding-easy \
  --concurrency 4 \
  --timeout 30000 \
  --retries 2 \
  --runs 3 \
  --evaluator exact \
  --format json markdown html \
  --output-dir ./reports \
  --exit-on-fail --min-completion 0.8 \
  --store ./eval-skills.db
OptionDescriptionDefault
------------------------------
--skills Required. Skill file(s) or directory(ies)
`--benchmark path>`Built-in benchmark ID or path to benchmark.jsoncoding-easy
--tasks Custom tasks JSON file (replaces benchmark)
--concurrency Number of parallel task executions4
--timeout Per-task timeout in milliseconds30000
--retries Retry count on task failure (with incremental backoff)0
--runs Repeat evaluation N times for consistency scoring1
--evaluator Default scorer type (see Scorer Types below)exact
--format Output formats: json, markdown, htmljson markdown
--output-dir Report output directory./reports
--exit-on-failExit with code 1 if any skill falls below thresholddisabled
--min-completion Threshold for --exit-on-fail0.7
--dry-runValidate configuration only; do not execute tasksdisabled
--benchmarks-dir Directory containing built-in benchmarks./benchmarks
--store SQLite database path for persistent result storage./eval-skills.db
-c, --config Path to eval-skills.config.yamlauto-detected

Evaluation flow:

  1. Load skills from --skills paths (supports both single skill.json and directories)
  2. Load benchmark tasks from --benchmark or --tasks
  3. Build the cartesian product: skills x tasks x runs
  4. Execute all task items concurrently (controlled by --concurrency, with timeout and retry)
  5. Score each result using the appropriate scorer
  6. Aggregate into SkillCompletionReport per skill
  7. Write reports to --output-dir

4. Select Skills

Filter and rank skills based on evaluation reports using a multi-dimensional strategy.

eval-skills select \
  --from ./skills \
  --reports ./reports/eval-result.json \
  --strategy ./strategy.yaml \
  --min-completion 0.8 \
  --top-k 5 \
  --output ./selected.json
OptionDescriptionDefault
------------------------------
--from Required. Candidate skills directory or JSON file
--reports Evaluation reports JSON file
--strategy SelectStrategy YAML/JSON filebuilt-in default
--min-completion Override minimum completion rate filter
--top-k Return only the top K resultsall
--output Write selected skills to filestdout

Selection pipeline: Filter (by completion rate, error rate, latency, adapter type, required tags) -> Score -> Rank (by compositeScore, completionRate, latency, or tokenCost) -> TopK

Example strategy.yaml:

filters:
  minCompletionRate: 0.8
  maxErrorRate: 0.1
  maxLatencyP95Ms: 5000
  adapterTypes: [http, subprocess]
  requiredTags: [production-ready]
sortBy: compositeScore
order: desc
topK: 5

5. Run Pipeline

Execute the full end-to-end pipeline: Find -> Eval -> Select -> Report in a single command.

eval-skills run \
  --query "math" \
  --benchmark coding-easy \
  --skills-dir ./skills \
  --top-k 3 \
  --min-completion 0.7 \
  --format json markdown \
  --output-dir ./reports

This command automates the entire process:

  1. Find — scans --skills-dir and optionally filters by --query
  2. Eval — evaluates all candidate skills against --benchmark
  3. Select — filters and ranks results using --min-completion, --top-k, and optional --strategy
  4. Report — generates output files in all requested --formats

6. Generate & Compare Reports

Convert report format

eval-skills report convert \
  --input ./reports/eval-result.json \
  --format html \
  --output ./reports/eval-result.html

Supported output formats: markdown, html.

Diff two reports (regression detection)

eval-skills report diff \
  ./reports/v1.json ./reports/v2.json \
  --label-a "v1.0" --label-b "v2.0" \
  --output ./reports/diff.md

Generates a side-by-side delta table per skill showing changes in completion rate, error rate, P95 latency, and composite score with directional arrows.

7. Initialize Project

eval-skills init --dir .

Creates the project scaffold:

  • eval-skills.config.yaml — global configuration
  • skills/ — directory for skill definitions
  • benchmarks/ — directory for benchmark files
  • reports/ — directory for evaluation output

8. Manage Configuration

# List all current configuration values
eval-skills config list

# Get a specific value (supports dot notation)
eval-skills config get llm.model

# Set a value (persisted to ~/.eval-skills/config.yaml)
eval-skills config set concurrency 8
eval-skills config set llm.model gpt-4o
eval-skills config set llm.temperature 0

Configuration is resolved in priority order:

  1. CLI flags (highest priority)
  2. eval-skills.config.yaml in current directory
  3. ~/.eval-skills/config.yaml
  4. Built-in defaults (concurrency: 4, timeoutMs: 30000, outputDir: ./reports)

Scorer Types

Each task in a benchmark specifies an evaluator type. The scorer compares the skill's actual output against the expected output.

TypeAliasesDescriptionScore Range
-----------------------------------------
exact_matchexactStrict equality comparison. Supports caseSensitive option.0 or 1
containsChecks for the presence of all specified keywords in the output. Partial credit: matched_keywords / total_keywords.0.0 ~ 1.0
json_schemaschemaValidates output against a JSON Schema (using Ajv).0 or 1
llm_judgeSends the output + expected rubric to an LLM (configurable model) for quality rating.0.0 ~ 1.0
customLoads a custom scorer from expectedOutput.customScorerPath.0.0 ~ 1.0

Evaluation Metrics

Every evaluation produces a SkillCompletionReport with these metrics:

MetricDescriptionFormula
------------------------------
Completion RateFraction of tasks that passedpass_count / total_count
Partial ScoreMean score across all tasksmean(task_scores)
Error RateFraction of tasks that errored or timed out(error_count + timeout_count) / total_count
Consistency ScoreStability across multiple runs (requires --runs >= 2)1 - stddev(per_run_completion_rates)
P50 / P95 / P99 LatencyResponse time percentilesSorted percentile of latencyMs
Composite ScoreWeighted overall quality score0.5 CR + 0.2 (1 - latP95_norm) + 0.3 * (1 - ER)

Built-in Benchmarks

IDDomainTasksScoringDescription
------------------:----------------------
coding-easycoding20mean / exact_matchMath expressions, string reversal, palindrome detection
skill-qualitytool-use5mean / containsMetadata completeness, description quality, structure checks
web-search-basicweb8mean / contains + schemaFactual queries, keyword verification, structured output validation
gaia-v1generalmeanPlaceholder for GAIA benchmark Level 1 tasks
toolbench-litetool-usemeanPlaceholder for ToolBench single-tool scenarios

Custom Benchmark

Create a benchmark.json file:

{
  "id": "my-benchmark",
  "name": "My Custom Benchmark",
  "version": "1.0.0",
  "domain": "general",
  "scoringMethod": "mean",
  "maxLatencyMs": 30000,
  "metadata": { "source": "internal", "lastUpdated": "2026-02-28" },
  "tasks": [
    {
      "id": "task_001",
      "description": "Test basic addition",
      "inputData": { "expression": "2+3" },
      "expectedOutput": { "type": "exact", "value": "5" },
      "evaluator": { "type": "exact" },
      "timeoutMs": 10000,
      "tags": ["math"]
    },
    {
      "id": "task_002",
      "description": "Test keyword presence",
      "inputData": { "query": "TypeScript" },
      "expectedOutput": { "type": "contains", "keywords": ["JavaScript", "Microsoft"] },
      "evaluator": { "type": "contains", "caseSensitive": false },
      "timeoutMs": 15000,
      "tags": ["search"]
    }
  ]
}
eval-skills eval --skills ./my-skill/ --benchmark ./my-benchmark.json

Adapter Types

Skills communicate through adapters. The adapter type is specified in skill.json via adapterType.

AdapterProtocolHow it worksKey config
--------------------------------------------
httpREST POSTSends POST { skillId, version, input } to skill.entrypoint. Supports Bearer / API-Key auth via env vars.baseUrl, authType, authTokenEnvKey
subprocessJSON-RPC 2.0 over stdin/stdoutSpawns skill.entrypoint (e.g. python3 skill.py), writes JSON-RPC request to stdin, reads response from stdout.command, args
mcpMCP Protocol(Phase 2) Native Model Context Protocol integration via @modelcontextprotocol/sdk.

Workflow Examples

Evaluating a Single Skill

# 1. Create a skill skeleton
eval-skills create --name my_calc --from-template python_script

# 2. Implement your logic in skills/my_calc/skill.py

# 3. Run evaluation against the coding-easy benchmark
eval-skills eval \
  --skills ./skills/my_calc/skill.json \
  --benchmark coding-easy \
  --runs 3 \
  --format json markdown

# 4. Review the report
cat ./reports/eval-result-*.md

Comparing Multiple Candidate Skills

# 1. Discover candidates
eval-skills find --query "weather" --skills-dir ./skills

# 2. Evaluate all candidates on the same benchmark
eval-skills eval \
  --skills ./skills/weather_v1 ./skills/weather_v2 ./skills/weather_v3 \
  --benchmark web-search-basic \
  --runs 3

# 3. Select the best
eval-skills select \
  --from ./skills \
  --reports ./reports/eval-result-*.json \
  --min-completion 0.8 \
  --top-k 2

# 4. Compare two versions
eval-skills report diff \
  ./reports/v1.json ./reports/v2.json \
  --label-a "weather_v1" --label-b "weather_v2"

Full Pipeline (One Command)

eval-skills run \
  --skills-dir ./skills \
  --benchmark coding-easy \
  --top-k 3 \
  --min-completion 0.7 \
  --format json markdown html \
  --output-dir ./reports

CI/CD Quality Gate

# In your CI pipeline — fail the build if completion rate drops below 80%
eval-skills eval \
  --skills ./skills/production_skill \
  --benchmark coding-easy \
  --exit-on-fail \
  --min-completion 0.8 \
  --format json

Regression Detection

# Compare today's evaluation against the baseline
eval-skills report diff \
  ./reports/baseline.json ./reports/latest.json \
  --label-a "baseline" --label-b "latest" \
  --output ./reports/regression-check.md

Best Practices

  1. Always use --runs 3 or more when evaluating for production decisions. Single-run results can be noisy; the consistency score captures stability across runs.
  1. Use --exit-on-fail in CI/CD pipelines to enforce quality gates. Set --min-completion to your acceptable threshold (recommended: 0.8 for production skills).
  1. Create domain-specific custom benchmarks rather than relying solely on built-in ones. Your custom benchmark should reflect real-world inputs your skill will encounter.
  1. Use report diff after every skill upgrade to catch regressions early. Compare the new evaluation against a saved baseline report.
  1. Use --dry-run before long evaluations to validate your configuration (skill paths, benchmark resolution, task count) without actually executing tasks.
  1. Persist results with --store to track skill quality over time. The SQLite store enables historical trend queries.
  1. Start with --concurrency 1 when debugging a failing skill, then increase for production benchmarking.
  1. Tag your benchmark tasks to enable per-category analysis (e.g., filter by math, string, edge-case).

Skill JSON Schema

Every skill must provide a skill.json that conforms to this structure:

{
  "id": "my_skill_v1",
  "name": "My Skill",
  "version": "1.0.0",
  "description": "Does something useful",
  "tags": ["utility", "math"],
  "inputSchema": {
    "type": "object",
    "properties": { "query": { "type": "string" } },
    "required": ["query"]
  },
  "outputSchema": {
    "type": "object",
    "properties": { "result": { "type": "string" } }
  },
  "adapterType": "subprocess",
  "entrypoint": "python3 skill.py",
  "metadata": {
    "author": "Your Name",
    "license": "MIT",
    "homepage": "https://github.com/you/my-skill"
  }
}

Validation rules:

  • id: lowercase alphanumeric with _ or -, non-empty
  • version: semver format (X.Y.Z)
  • adapterType: one of http, subprocess, mcp, langchain, custom
  • entrypoint: non-empty string (URL for http, command for subprocess)

Global Options

These options are available on all commands:

OptionDescription
---------------------
-c, --config Path to configuration file
--jsonJSON output format (CI-friendly)
--no-colorDisable colored output
-v, --verboseVerbose logging
--versionShow version
-h, --helpShow help

版本历史

共 1 个版本

  • v0.1.1 当前
    2026-03-30 02:54 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-intelligence

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,362 📥 318,692
ai-intelligence

ontology

oswalpalash
类型化知识图谱,用于结构化智能体记忆与可组合技能。支持创建/查询实体(人员、项目、任务、事件、文档)及关联...
★ 714 📥 243,982
ai-intelligence

Proactive Agent

halthelobster
将AI智能体从任务执行者升级为主动预判需求、持续优化的智能伙伴。集成WAL协议、工作缓冲区、自主定时任务及实战验证模式。Hal Stack核心组件 🦞
★ 837 📥 213,276