Real Karpathy self-improvement loop: evaluate → modify → re-evaluate → keep/revert → repeat.
improvement-discriminatorimprovement-orchestratorimprovement-executor| Dimension | Checks | Pure-text default |
|---|---|---|
| ----------- | -------- | ------------------- |
| accuracy | 15 items: frontmatter(3), symptom-driven desc, When to Use/Not, code examples, Usage, few-shot, no vague language, min length, Related Skills, Output Artifacts, atomicity | — |
| coverage | SKILL.md = 60% base + scripts/references/tests/README bonuses | — |
| reliability | pytest pass=1.0, fail=0.5 | 1.0 (pure-text) |
| efficiency | Line count: ≤200=1.0, ≥1200=0.3 | — |
| security | No api_key/password/sk- in SKILL.md, no os.system()/exec() | — |
| trigger_quality | Description length, triggers field, disambiguation | — |
| Layer | Capacity | Behavior |
|---|---|---|
| ------- | ---------- | ---------- |
| HOT | ≤100 | Always loaded, frequently accessed patterns |
| WARM | Unlimited | Overflow from HOT, loaded on demand |
| COLD | Archive | >3 months inactive (future) |
正确用法: 评估一个 skill 的质量
$ python3 scripts/self_improve.py --skill-path /path/to/skill --max-iterations 1
→ 输出 JSON:
{"final_scores": {"accuracy": 0.83, "coverage": 1.0, "reliability": 1.0, ...}}
→ accuracy 0.83 说明 SKILL.md 缺少部分检查项(如 Output Artifacts 或 Related Skills)
错误判读: 纯文本 skill 的 reliability=1.0 不代表质量好
→ 纯文本 skill 没有 scripts/,reliability 默认 1.0(没有代码就不需要测试)
→ 真正有意义的维度是 accuracy 和 trigger_quality
# 评估(不改动,只看分数)
python3 scripts/self_improve.py --skill-path /path/to/skill --max-iterations 1
# 自改进循环(5 轮)
python3 scripts/self_improve.py \
--skill-path /path/to/skill \
--max-iterations 5 \
--memory-dir /path/to/memory \
--state-root /path/to/state
# 追踪历史
python3 scripts/track_progress.py --skill-path /path/to/skill --output progress.json
| Request | Deliverable |
|---|---|
| --------- | ------------ |
| Evaluate | JSON with 6-dimension scores (0.0-1.0 each) |
| Self-improve | JSON: iterations, kept/reverted/skipped, final_scores, memory stats |
| Track progress | JSON with historical scores and trend data |
共 1 个版本