eval-skills

AI Agent Skill unit testing framework — a framework-agnostic toolkit for discovering, scaffolding, selecting, evaluating, and reporting on AI skills.

This skill fills the L1 (Skill Unit Test) gap that LangSmith / DeepEval leave open: while those platforms focus on agent-level and trajectory-level evaluation (L2-L3), eval-skills targets the individual skill level, ensuring each building block meets quality standards before it ever enters an agent pipeline.

When to Use This Skill

Before deploying a new skill to production — run eval to verify it meets your quality gate.
When choosing between multiple candidate skills — run select to rank them on the same benchmark.
When a skill is upgraded — run report diff to detect regressions.
In CI/CD — use --exit-on-fail to block merges that degrade skill quality.
When bootstrapping a new skill — run create to generate a ready-to-fill skeleton.

Capabilities

1. Find Skills

Search for existing skills by keyword, tag, or adapter type.

eval-skills find \
  --query "web search" \
  --tag retrieval api \
  --adapter http \
  --min-completion 0.8 \
  --skills-dir ./skills \
  --limit 10

Option	Description	Default
--------	-------------	---------
`-q, --query`	Keyword search (matches name, description, tags)	—
`-t, --tag`	Filter by tags (intersection: skill must have ALL specified tags)	—
`-a, --adapter`	Filter by adapter type (`http`, `subprocess`, `mcp`)	—
`--min-completion`	Minimum historical completion rate (0.0 ~ 1.0)	—
`--skills-dir`	Directory to scan for `skill.json` files	`./skills`
`--limit`	Maximum number of results	`20`

Results are ranked by search relevance (when --query is provided) or by historical completion rate (descending).

2. Create Skills

Generate a skill skeleton from a template to bootstrap development.

eval-skills create \
  --name my_api_skill \
  --from-template http_request \
  --output-dir ./skills \
  --description "Fetches weather data from OpenWeather API"

Option	Description	Default
--------	-------------	---------
`--name`	Required. Skill name	—
`--from-template`	Template type: `http_request`, `python_script`, `mcp_tool`	`http_request`
`--output-dir`	Output directory	`./skills`
`--description`	Human-readable description embedded in `skill.json`	—

Generated file structure:

skills/my_api_skill/
  skill.json            # Skill metadata (id, schemas, adapter config)
  adapter.config.json   # Adapter-specific configuration
  tests/
    basic.eval.json     # A starter benchmark with one sample task
  skill.py              # (python_script template only) JSON-RPC entrypoint

3. Evaluate Skills

Run benchmark evaluations against one or more skills. This is the core command.

eval-skills eval \
  --skills ./skills/calculator/skill.json ./skills/search/ \
  --benchmark coding-easy \
  --concurrency 4 \
  --timeout 30000 \
  --retries 2 \
  --runs 3 \
  --evaluator exact \
  --format json markdown html \
  --output-dir ./reports \
  --exit-on-fail --min-completion 0.8 \
  --store ./eval-skills.db

Option	Description	Default
--------	-------------	---------
`--skills`	Required. Skill file(s) or directory(ies)	—
`--benchmark	path>`	Built-in benchmark ID or path to `benchmark.json`	`coding-easy`
`--tasks`	Custom tasks JSON file (replaces benchmark)	—
`--concurrency`	Number of parallel task executions	`4`
`--timeout`	Per-task timeout in milliseconds	`30000`
`--retries`	Retry count on task failure (with incremental backoff)	`0`
`--runs`	Repeat evaluation N times for consistency scoring	`1`
`--evaluator`	Default scorer type (see Scorer Types below)	`exact`
`--format`	Output formats: `json`, `markdown`, `html`	`json markdown`
`--output-dir`	Report output directory	`./reports`
`--exit-on-fail`	Exit with code 1 if any skill falls below threshold	disabled
`--min-completion`	Threshold for `--exit-on-fail`	`0.7`
`--dry-run`	Validate configuration only; do not execute tasks	disabled
`--benchmarks-dir`	Directory containing built-in benchmarks	`./benchmarks`
`--store`	SQLite database path for persistent result storage	`./eval-skills.db`
`-c, --config`	Path to `eval-skills.config.yaml`	auto-detected

Evaluation flow:

Load skills from --skills paths (supports both single skill.json and directories)
Load benchmark tasks from --benchmark or --tasks
Build the cartesian product: skills x tasks x runs
Execute all task items concurrently (controlled by --concurrency, with timeout and retry)
Score each result using the appropriate scorer
Aggregate into SkillCompletionReport per skill
Write reports to --output-dir

4. Select Skills

Filter and rank skills based on evaluation reports using a multi-dimensional strategy.

eval-skills select \
  --from ./skills \
  --reports ./reports/eval-result.json \
  --strategy ./strategy.yaml \
  --min-completion 0.8 \
  --top-k 5 \
  --output ./selected.json

Option	Description	Default
--------	-------------	---------
`--from`	Required. Candidate skills directory or JSON file	—
`--reports`	Evaluation reports JSON file	—
`--strategy`	SelectStrategy YAML/JSON file	built-in default
`--min-completion`	Override minimum completion rate filter	—
`--top-k`	Return only the top K results	all
`--output`	Write selected skills to file	stdout

Selection pipeline: Filter (by completion rate, error rate, latency, adapter type, required tags) -> Score -> Rank (by compositeScore, completionRate, latency, or tokenCost) -> TopK

Example strategy.yaml:

filters:
  minCompletionRate: 0.8
  maxErrorRate: 0.1
  maxLatencyP95Ms: 5000
  adapterTypes: [http, subprocess]
  requiredTags: [production-ready]
sortBy: compositeScore
order: desc
topK: 5

5. Run Pipeline

Execute the full end-to-end pipeline: Find -> Eval -> Select -> Report in a single command.

eval-skills run \
  --query "math" \
  --benchmark coding-easy \
  --skills-dir ./skills \
  --top-k 3 \
  --min-completion 0.7 \
  --format json markdown \
  --output-dir ./reports

This command automates the entire process:

Find — scans --skills-dir and optionally filters by --query
Eval — evaluates all candidate skills against --benchmark
Select — filters and ranks results using --min-completion, --top-k, and optional --strategy
Report — generates output files in all requested --formats

6. Generate & Compare Reports

Convert report format

eval-skills report convert \
  --input ./reports/eval-result.json \
  --format html \
  --output ./reports/eval-result.html

Supported output formats: markdown, html.

Diff two reports (regression detection)

eval-skills report diff \
  ./reports/v1.json ./reports/v2.json \
  --label-a "v1.0" --label-b "v2.0" \
  --output ./reports/diff.md

Generates a side-by-side delta table per skill showing changes in completion rate, error rate, P95 latency, and composite score with directional arrows.

7. Initialize Project

eval-skills init --dir .

Creates the project scaffold:

eval-skills.config.yaml — global configuration
skills/ — directory for skill definitions
benchmarks/ — directory for benchmark files
reports/ — directory for evaluation output

8. Manage Configuration

# List all current configuration values
eval-skills config list

# Get a specific value (supports dot notation)
eval-skills config get llm.model

# Set a value (persisted to ~/.eval-skills/config.yaml)
eval-skills config set concurrency 8
eval-skills config set llm.model gpt-4o
eval-skills config set llm.temperature 0

Configuration is resolved in priority order:

CLI flags (highest priority)
eval-skills.config.yaml in current directory
~/.eval-skills/config.yaml
Built-in defaults (concurrency: 4, timeoutMs: 30000, outputDir: ./reports)

Scorer Types

Each task in a benchmark specifies an evaluator type. The scorer compares the skill's actual output against the expected output.

Type	Aliases	Description	Score Range
------	---------	-------------	-------------
`exact_match`	`exact`	Strict equality comparison. Supports `caseSensitive` option.	0 or 1
`contains`	—	Checks for the presence of all specified keywords in the output. Partial credit: `matched_keywords / total_keywords`.	0.0 ~ 1.0
`json_schema`	`schema`	Validates output against a JSON Schema (using Ajv).	0 or 1
`llm_judge`	—	Sends the output + expected rubric to an LLM (configurable model) for quality rating.	0.0 ~ 1.0
`custom`	—	Loads a custom scorer from `expectedOutput.customScorerPath`.	0.0 ~ 1.0

Evaluation Metrics

Every evaluation produces a SkillCompletionReport with these metrics:

Metric	Description	Formula
--------	-------------	---------
Completion Rate	Fraction of tasks that passed	`pass_count / total_count`
Partial Score	Mean score across all tasks	`mean(task_scores)`
Error Rate	Fraction of tasks that errored or timed out	`(error_count + timeout_count) / total_count`
Consistency Score	Stability across multiple runs (requires `--runs >= 2`)	`1 - stddev(per_run_completion_rates)`
P50 / P95 / P99 Latency	Response time percentiles	Sorted percentile of `latencyMs`
Composite Score	Weighted overall quality score	`0.5 CR + 0.2 (1 - latP95_norm) + 0.3 * (1 - ER)`

Built-in Benchmarks

ID	Domain	Tasks	Scoring	Description
----	--------	------:	---------	-------------
`coding-easy`	coding	20	mean / exact_match	Math expressions, string reversal, palindrome detection
`skill-quality`	tool-use	5	mean / contains	Metadata completeness, description quality, structure checks
`web-search-basic`	web	8	mean / contains + schema	Factual queries, keyword verification, structured output validation
`gaia-v1`	general	—	mean	Placeholder for GAIA benchmark Level 1 tasks
`toolbench-lite`	tool-use	—	mean	Placeholder for ToolBench single-tool scenarios

Custom Benchmark

Create a benchmark.json file:

{
  "id": "my-benchmark",
  "name": "My Custom Benchmark",
  "version": "1.0.0",
  "domain": "general",
  "scoringMethod": "mean",
  "maxLatencyMs": 30000,
  "metadata": { "source": "internal", "lastUpdated": "2026-02-28" },
  "tasks": [
    {
      "id": "task_001",
      "description": "Test basic addition",
      "inputData": { "expression": "2+3" },
      "expectedOutput": { "type": "exact", "value": "5" },
      "evaluator": { "type": "exact" },
      "timeoutMs": 10000,
      "tags": ["math"]
    },
    {
      "id": "task_002",
      "description": "Test keyword presence",
      "inputData": { "query": "TypeScript" },
      "expectedOutput": { "type": "contains", "keywords": ["JavaScript", "Microsoft"] },
      "evaluator": { "type": "contains", "caseSensitive": false },
      "timeoutMs": 15000,
      "tags": ["search"]
    }
  ]
}

eval-skills eval --skills ./my-skill/ --benchmark ./my-benchmark.json

Adapter Types

Skills communicate through adapters. The adapter type is specified in skill.json via adapterType.

Adapter	Protocol	How it works	Key config
---------	----------	-------------	------------
`http`	REST POST	Sends `POST { skillId, version, input }` to `skill.entrypoint`. Supports Bearer / API-Key auth via env vars.	`baseUrl`, `authType`, `authTokenEnvKey`
`subprocess`	JSON-RPC 2.0 over stdin/stdout	Spawns `skill.entrypoint` (e.g. `python3 skill.py`), writes JSON-RPC request to stdin, reads response from stdout.	`command`, `args`
`mcp`	MCP Protocol	(Phase 2) Native Model Context Protocol integration via `@modelcontextprotocol/sdk`.	—

Workflow Examples

Evaluating a Single Skill

# 1. Create a skill skeleton
eval-skills create --name my_calc --from-template python_script

# 2. Implement your logic in skills/my_calc/skill.py

# 3. Run evaluation against the coding-easy benchmark
eval-skills eval \
  --skills ./skills/my_calc/skill.json \
  --benchmark coding-easy \
  --runs 3 \
  --format json markdown

# 4. Review the report
cat ./reports/eval-result-*.md

Comparing Multiple Candidate Skills

# 1. Discover candidates
eval-skills find --query "weather" --skills-dir ./skills

# 2. Evaluate all candidates on the same benchmark
eval-skills eval \
  --skills ./skills/weather_v1 ./skills/weather_v2 ./skills/weather_v3 \
  --benchmark web-search-basic \
  --runs 3

# 3. Select the best
eval-skills select \
  --from ./skills \
  --reports ./reports/eval-result-*.json \
  --min-completion 0.8 \
  --top-k 2

# 4. Compare two versions
eval-skills report diff \
  ./reports/v1.json ./reports/v2.json \
  --label-a "weather_v1" --label-b "weather_v2"

Full Pipeline (One Command)

eval-skills run \
  --skills-dir ./skills \
  --benchmark coding-easy \
  --top-k 3 \
  --min-completion 0.7 \
  --format json markdown html \
  --output-dir ./reports

CI/CD Quality Gate

# In your CI pipeline — fail the build if completion rate drops below 80%
eval-skills eval \
  --skills ./skills/production_skill \
  --benchmark coding-easy \
  --exit-on-fail \
  --min-completion 0.8 \
  --format json

Regression Detection

# Compare today's evaluation against the baseline
eval-skills report diff \
  ./reports/baseline.json ./reports/latest.json \
  --label-a "baseline" --label-b "latest" \
  --output ./reports/regression-check.md

Best Practices

Always use --runs 3 or more when evaluating for production decisions. Single-run results can be noisy; the consistency score captures stability across runs.

Use --exit-on-fail in CI/CD pipelines to enforce quality gates. Set --min-completion to your acceptable threshold (recommended: 0.8 for production skills).

Create domain-specific custom benchmarks rather than relying solely on built-in ones. Your custom benchmark should reflect real-world inputs your skill will encounter.

Use report diff after every skill upgrade to catch regressions early. Compare the new evaluation against a saved baseline report.

Use --dry-run before long evaluations to validate your configuration (skill paths, benchmark resolution, task count) without actually executing tasks.

Persist results with --store to track skill quality over time. The SQLite store enables historical trend queries.

Start with --concurrency 1 when debugging a failing skill, then increase for production benchmarking.

Tag your benchmark tasks to enable per-category analysis (e.g., filter by math, string, edge-case).

Skill JSON Schema

Every skill must provide a skill.json that conforms to this structure:

{
  "id": "my_skill_v1",
  "name": "My Skill",
  "version": "1.0.0",
  "description": "Does something useful",
  "tags": ["utility", "math"],
  "inputSchema": {
    "type": "object",
    "properties": { "query": { "type": "string" } },
    "required": ["query"]
  },
  "outputSchema": {
    "type": "object",
    "properties": { "result": { "type": "string" } }
  },
  "adapterType": "subprocess",
  "entrypoint": "python3 skill.py",
  "metadata": {
    "author": "Your Name",
    "license": "MIT",
    "homepage": "https://github.com/you/my-skill"
  }
}

Validation rules:

id: lowercase alphanumeric with _ or -, non-empty
version: semver format (X.Y.Z)
adapterType: one of http, subprocess, mcp, langchain, custom
entrypoint: non-empty string (URL for http, command for subprocess)

Global Options

These options are available on all commands:

Option	Description
--------	-------------
`-c, --config`	Path to configuration file
`--json`	JSON output format (CI-friendly)
`--no-color`	Disable colored output
`-v, --verbose`	Verbose logging
`--version`	Show version
`-h, --help`	Show help

Eval Skills

概述