Autonomous benchmark-driven skill optimisation for OpenClaw. Inspired by Andrej Karpathy's autoresearch — the same modify → test → score → keep/discard loop, applied to agent skill quality instead of GPU training.
"optimise my weather skill""run autooptimise on [skill-name]""benchmark my [skill-name] skill""improve my skill overnight"| File | Purpose |
|---|---|
| ------ | --------- |
benchmark/tasks.json | Test task suite (prompts + expected qualities) |
benchmark/scorer.md | LLM judge scoring rubric |
runner/run_experiment.md | Autonomous loop instructions (load this next) |
runner/experiment_log.md | Auto-created run log (gitignored) |
runner/run_experiment.md — it contains the full loop instructionsUse the best available LLM judge model (prefer a strong reasoning model). Score each task 0–10 on:
Full rubric: benchmark/scorer.md
benchmark/tasks.json or benchmark/scorer.md during a run.runner/experiment_log.md.共 1 个版本