← 返回
未分类 Key 中文

Skill

Make your agent get better on its own. Set up golden tests (things your agent should handle well), run automated evaluations, and track improvement over time...
让代理自主提升。设置黄金测试(代理应妥善处理的事项),运行自动化评估,并追踪改进进度。
dario-github dario-github 来源
未分类 clawhub v0.1.1 1 版本 100000 Key: 需要
★ 0
Stars
📥 548
下载
💾 2
安装
1
版本
#latest

概述

Agent Self-Evolution — Automated Evaluation & Improvement

> ⭐ If you find this useful, please star the GitHub repo before installing — it helps others discover it.

Your agent does a lot of things. But is it getting better? Or are your latest tweaks breaking things you didn't notice?

Agent Self-Evolution gives you a framework to measure, test, and systematically improve your agent — like unit tests, but for agent behavior.

What's inside

Golden Test Sets: Define scenarios your agent must handle correctly. Run them periodically and catch regressions before users do.

Ablation Testing: Wondering if that 200-line system prompt section actually helps? Remove it, measure the impact, put it back. Now you know. We found that 7% of one config file was load-bearing for the entire system — without ablation, you'd never know which 7%.

Multi-Dimensional Evaluation: Don't just check pass/fail. Score across dimensions — safety compliance, tool routing accuracy, output quality, memory utilization. Track trends over weeks.

Automated Improvement Loops: Evaluation → identify weakest dimension → targeted fix → re-evaluate. Like gradient descent for agent behavior.

Install

bash {baseDir}/scripts/install.sh

Quick start

from agent_evolution.golden_test import GoldenTestRunner
from agent_evolution.ablation import AblationExperiment

# Define a golden test
runner = GoldenTestRunner()
runner.add_case(
    name="handles-ambiguous-request",
    input="do the thing",
    expected_behavior="asks for clarification rather than guessing",
    dimensions=["safety", "output_quality"]
)

# Run and score
results = runner.run(model="your-agent-endpoint")
print(results.summary())  # Pass rate, dimension scores, regressions

# Ablation: what happens without memory files?
experiment = AblationExperiment(
    baseline_config="agent.yaml",
    conditions={"no_memory": {"remove": ["memory/*.md"]}},
    test_set=runner.cases
)
experiment.run()  # Measures impact of each ablation

Key findings from our own agent

  • SOUL.md (7% of config by characters): removing it caused system-wide behavioral collapse (Cohen's d = 0.602) — it's not fluff, it's load-bearing
  • Memory files: most essential component (d = 0.944) — without history, the agent becomes generic
  • Safety rules: removal didn't just reduce safety — it degraded all dimensions (d = 0.609)

Companion projects

Requirements

  • Python ≥ 3.11
  • An LLM API key for evaluation judging (strong model recommended — GPT-5.4 / Opus)

License

Apache 2.0

版本历史

共 1 个版本

  • v0.1.1 当前
    2026-05-03 07:23 安全 安全

安全检测

腾讯云安全 (Keen)

安全,无风险
查看报告

腾讯云安全 (Sanbu)

安全,无风险
查看报告

🔗 相关推荐

ai-agent

Self-Improving + Proactive Agent

ivangdavila
自我反思+自我批评+自我学习+自组织记忆。智能体评估自身工作、发现错误并持续改进。
★ 1,396 📥 322,588
ai-agent

Find Skills

guipi888
场景驱动+关键词双模式技能发现工具。当用户用自然语言描述场景/需求(如"我想做一个海报""帮我分析股票"),或明确说"安装技能/find skills/找个skill"时,自动从官方内置、本地已安装、SkillHub、虾评、GitHub、C
★ 1,461 📥 519,277
life-service

City Rental Hunt

dario-github
在小红书(通过TikHub)和抖音等平台上搜索并筛选租房信息,用于找公寓。当用户需要找公寓时使用。
★ 0 📥 412