← 返回
未分类

Skill Forge

当需要为已有 Skill 自动生成 task_suite.yaml 测试任务集、从 skill_spec.yaml 生成完整 SKILL.md + task_suite、或一键走完「生成 → 评估 → 改进」全链路时使用。 读取 SKILL.md 中的 frontmatter/When to Use/exampl...
当需要为已有 Skill 自动生成 task_suite.yaml 测试任务集、从 skill_spec.yaml 生成完整 SKILL.md + task_suite、或一键走完「生成 → 评估 → 改进」全链路时使用。 读取 SKILL.md 中的 frontmatter/When to Use/exampl...
lanyasheng lanyasheng 来源
未分类 clawhub v1.0.1 1 版本 100000 Key: 无需
★ 0
Stars
📥 343
下载
💾 0
安装
1
版本
#latest

概述

Skill Forge

Generate Skills from requirements AND generate task_suite.yaml for evaluation.

The primary value of this skill is task_suite generation -- turning a SKILL.md into a structured test harness that improvement-evaluator can run. Secondary value is generating SKILL.md from a structured skill_spec.yaml.

Key differentiator: Skill Forge does not merely scaffold a skeleton; it performs static analysis of the SKILL.md to extract testable claims (from five distinct sources) and assigns the appropriate judge type per task, producing a suite that is immediately runnable by improvement-evaluator.

When to Use

  1. Add tests to an existing Skill -- You have a SKILL.md but no task_suite.yaml.

Run --from-skill to analyze the SKILL.md and generate a test suite automatically.

The generator extracts scenarios from five sections of the document and assigns

the right judge per task.

  1. Create a new Skill from scratch -- You have a skill_spec.yaml describing

what the skill should do. Run --from-spec to generate both a complete SKILL.md

(with frontmatter, sections, examples) and a matching task_suite.yaml.

  1. Full lifecycle -- Generate a skill, test it with --evaluate, and optionally

improve it with --auto-improve in one pass. Combines skill-forge,

improvement-evaluator, and improvement-orchestrator into a single command.

  1. Validate coverage of an existing suite -- Compare the generated task_suite

against a hand-written one to find untested scenarios.

When NOT to Use

  • You just want to evaluate an existing skill with an existing task_suite.yaml

→ use improvement-evaluator

  • You just want to improve a skill that already has tests

→ use improvement-orchestrator

  • You want to manually write a SKILL.md with templates

→ use skill-creator

  • You want to score improvement candidates (not generate tests)

→ use improvement-discriminator

  • You want to merge overlapping skills into one consolidated skill

→ use skill-distill

Modes

Mode A: --from-skill (Generate task suite for existing SKILL.md)

# Generate test suite for an existing skill
python3 scripts/forge.py --from-skill /path/to/skill-dir --output /path/to/output

# Generate and immediately evaluate
python3 scripts/forge.py --from-skill /path/to/skill-dir --output /path/to/output --evaluate

Reads the SKILL.md, extracts scenarios from five sources in priority order:

  1. Frontmatter (description, triggers) → 1 core-capability test
  2. "When to Use" bullets → up to 3 positive use-case tests
  3. tags → up to 2 keyword-match tests (uses ContainsJudge)
  4. tags → up to 2 negative tests (should avoid bad patterns)
  5. Output format/CLI sections → 1 format compliance test

Produces task_suite.yaml in the output directory. Example generated output:

skill_id: release-notes-generator
version: "1.0"
generated_by: skill-forge
tasks:
  - id: release-notes-generator-core-capability
    description: "Test core capability described in skill description"
    prompt: "You are an AI assistant with this skill loaded..."
    judge:
      type: llm-rubric
      rubric: "The output should demonstrate the capability..."
      pass_threshold: 0.7
    timeout_seconds: 120
    source: frontmatter.description
  - id: release-notes-generator-use-case-01
    description: "Use case: Generate notes from git log between tags"
    prompt: "Scenario: Generate notes from git log..."
    judge:
      type: llm-rubric
      rubric: "The output should address this use case..."
      pass_threshold: 0.6
    timeout_seconds: 120
    source: when_to_use

Mode B: --from-spec (Generate skill + task suite from spec)

# Generate complete skill from a spec file
python3 scripts/forge.py --from-spec spec.yaml --output /path/to/output

# Generate, evaluate, and auto-improve if below SOLID grade
python3 scripts/forge.py --from-spec spec.yaml --output /path/to/output --auto-improve

Reads a skill_spec.yaml (see references/spec-format.md) and generates:

  1. A complete SKILL.md with frontmatter, When to Use / When NOT to Use, examples, output format
  2. A task_suite.yaml derived from the generated SKILL.md (same five-source extraction)

The spec format requires only name and purpose; optional fields (inputs, outputs, quality_criteria, domain_knowledge, reference_skills) enrich the generated SKILL.md. Example minimal spec:

name: release-notes-generator
purpose: Generate structured release notes from git commit history

inputs:
  - name: commits
    type: git-log
    description: "Git commit log between two tags"

outputs:
  - name: release-notes
    format: markdown
    description: "Structured release notes with sections"

quality_criteria:
  - name: completeness
    description: "All commits accounted for in the notes"
    weight: 0.3

Common Flags

FlagEffect
--------------
--mockUse mock LLM (for testing without API calls)
--evaluateRun improvement-evaluator after generation (requires it installed)
--auto-improveRun improvement-orchestrator if score below SOLID (requires it installed)

Output Artifacts

RequestDeliverableLocation
-------------------------------
--from-skilltask_suite.yaml with 5-10 test tasks/task_suite.yaml
--from-specSKILL.md + task_suite.yaml//SKILL.md, //task_suite.yaml
--evaluateEvaluation report (pass/fail per task, aggregate pass rate)stdout + /evaluation_report.json
--auto-improveImproved SKILL.md (if score was below SOLID)in-place update of SKILL.md

All generated YAML files use allow_unicode: True and default_flow_style: False for human readability. Files are written atomically (write-then-rename) to prevent corruption on crash.

Task Generation Strategy

The generator extracts test scenarios from 5 sources in the SKILL.md:

  1. Frontmatter description → 1 core-capability test
  2. "When to Use" section → up to 3 positive use-case tests
  3. tags → up to 2 keyword-match tests
  4. tags → up to 2 anti-pattern avoidance tests
  5. Output format section → 1 format compliance test

Tasks are deduplicated and capped at 10 per suite.

Judge Selection

  • Scenario with specific keywords/outputs → ContainsJudge
  • Scenario requiring quality/style assessment → LLMRubricJudge
  • Scenario with structured output → PytestJudge (if test script can be generated)

Harness Pattern Tasks (for scripted skills)

When the target skill has a scripts/ directory, forge auto-generates additional test tasks checking execution-harness pattern adoption:

  • Timeout handling: Does the skill handle subprocess.TimeoutExpired?
  • Atomic writes: Does it use write_json/write_text from lib/common (atomic write-then-rename)?
  • Backup/rollback: Does it create backups before file modifications?
  • Error escalation: Does it have graduated error handling (not just crash-on-first-failure)?
  • State persistence: Does it write recoverable state for crash recovery?

These tasks use ContainsJudge to grep the skill's Python source code. They only apply to orchestration/tool-type skills — pure-text knowledge skills skip this category.

Why Null-Skill Calibration

A generated task suite is only useful if it measures the _skill's_ contribution, not the base LLM's general ability. Without calibration, a naive suite can report 80%+ pass rates even when the SKILL.md adds zero value -- because the LLM already knows how to answer those questions.

Null-skill calibration addresses this by running every candidate task against a "null skill" (empty context, no SKILL.md loaded). Any task the null skill passes trivially is filtered out before the final suite is emitted. This ensures that every surviving task genuinely requires the knowledge or structure encoded in the SKILL.md.

Tradeoff: Null-skill calibration adds one extra LLM call per candidate task (or a heuristic keyword check in --mock mode). For a typical 10-task candidate set, this means ~10 additional calls during suite generation. The cost is justified because an uncalibrated suite gives false confidence: a skill that scores 9/10 on easy tasks looks "SOLID" but may add no value over a bare model. Calibrated suites reliably distinguish genuine skill contributions from baseline LLM capability.

When --mock is used, calibration falls back to a heuristic: tasks whose prompt contains only generic verbs ("explain", "describe", "list") without skill-specific terminology are filtered. This is less precise than LLM-based calibration but costs zero API calls.

The calibration step runs after deduplication and before the final cap of 10 tasks per suite.

Related Skills

SkillRelationshipWhen to prefer over skill-forge
---------------------------------------------------
improvement-evaluatorDownstream consumer: runs the generated task_suite.yaml and reports pass/fail per taskYou already have both SKILL.md and task_suite.yaml, just need to run them
improvement-orchestratorDrives the full generate-evaluate-improve loop; skill-forge is one step in this loopYou want automatic multi-round improvement, not just test generation
improvement-generatorGenerates improvement candidates (patches) for a SKILL.mdYou want to improve an existing skill's prose/structure, not generate tests
improvement-discriminatorScores improvement candidates via multi-reviewer blind panelYou need to judge which candidate patch is best
skill-creatorManual SKILL.md authoring guide with templatesYou prefer hand-writing the SKILL.md rather than generating it
skill-distillMerges multiple overlapping skills into one distilled skillYou have redundant skills to consolidate, not a new skill to create

版本历史

共 1 个版本

  • v1.0.1 当前
    2026-05-07 10:41 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-agent

Find Skills

guipi888
场景驱动+关键词双模式技能发现工具。当用户用自然语言描述场景/需求(如"我想做一个海报""帮我分析股票"),或明确说"安装技能/find skills/找个skill"时,自动从官方内置、本地已安装、SkillHub、虾评、GitHub、C
★ 1,494 📥 558,381
ai-agent

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,412 📥 325,281
content-creation

去 AI 味内容引擎

lanyasheng
为小红书、X、知乎等平台生成“去 AI 味”的内容重写与新稿产出技能。用于用户要求内容更像真人表达、减少模板腔、提升口语感和观点力度的场景;也用于把已有草稿改写为平台适配版本。触发词包括“去 AI 味”“重写成更像人写的”“太像 AI 了”
★ 15 📥 4,869