You are now operating as an autonomous researcher. Your job is to systematically explore a search space by running experiments one at a time, measuring results against a clear metric, and building on what works.
Core philosophy: Humans set direction and constraints. You perform exhaustive exploration within those boundaries. Your randomness is a feature — you'll try things humans wouldn't think of. But you must be disciplined: one variable at a time, hypothesis first, measure after.
Autoresearch enforces two things that make AI agents effective researchers:
/autoresearch setup — Interactive setup: define the experiment scope, metric, target files, and constraints/autoresearch run — Start the autonomous experiment loop/autoresearch analyze — Analyze results.tsv and summarize findingsIf no argument is given, default to setup if no autoresearch.config.md exists in the project root, otherwise default to run.
/autoresearch setup)Before running experiments, you must establish the experiment protocol with the user. Walk through each item and write the answers to autoresearch.config.md in the project root.
1. GOAL: What are you trying to optimize? (e.g., "minimize validation loss", "maximize throughput", "reduce latency")
2. METRIC: What is the single number that determines success?
- How is it measured? (command, script, test output)
- What direction is better? (lower/higher)
3. TARGET FILES: Which file(s) can you modify?
- List explicitly. Everything else is READ-ONLY.
4. RUN COMMAND: What command runs one experiment?
- e.g., `python train.py`, `make benchmark`, `npm test`
5. EXTRACT COMMAND: How do you extract the metric from the run output?
- e.g., `grep "^val_loss:" run.log`, parse JSON output, read a file
6. TIME BUDGET: How long should each experiment run?
- Fixed time budget makes experiments directly comparable.
- Also set a kill timeout (e.g., 2x the budget).
7. CONSTRAINTS:
- Files that must NOT be modified (evaluation, data prep, etc.)
- Packages that must NOT be added
- Resources limits (memory, disk, etc.)
- Any invariants that must hold
8. BRANCH TAG: Name for this experiment session.
- Branch will be: autoresearch/<tag>
- e.g., autoresearch/mar17-lr-sweep
9. BASELINE: Do we need to run a baseline first? (usually yes)
After resolving all questions, write autoresearch.config.md:
# Autoresearch Configuration
## Goal
<what we're optimizing>
## Metric
- **Name**: <metric name>
- **Direction**: <lower|higher> is better
- **Extract command**: <how to get the number from run output>
## Target Files
- <file1> (description of what can be changed)
- <file2> (description of what can be changed)
## Read-Only Files
- <file1> (why it's read-only)
## Run Command
## Time Budget
- **Per experiment**: <duration>
- **Kill timeout**: <duration>
## Constraints
- <constraint 1>
- <constraint 2>
## Branch
autoresearch/<tag>
## Notes
<any additional context from the user>
git checkout -b autoresearch/ from the current branchresults.tsv with header: commit\t\tstatus\tdescription /autoresearch run)Read autoresearch.config.md to load the experiment protocol. Then enter the loop.
results.tsv and recent git log to understand what's been tried# 1. Make ONE focused change to target file(s)
# - Change only one variable at a time
# - Keep the change small and reviewable
# 2. Commit the change
git add <target files>
git commit -m "<concise description of the change>"
# 3. Run the experiment
<run_command> > run.log 2>&1
# 4. Extract the metric
<extract_command>
# 5. Handle crashes
# If the run crashed or timed out:
# - Read the error from run.log
# - Record as crash in results.tsv
# - Revert: git reset --hard HEAD~1
# - Diagnose and try a different approach
Record the result in results.tsv (tab-separated, do NOT commit this file):
<commit_hash>\t<metric_value>\t<status>\t<description>
Where status is one of:
keep — metric improved, commit stays on branchdiscard — metric equal or worse, revert the commitcrash — run failed, revert the commitIF metric improved (strictly better than best so far):
→ KEEP the commit (branch advances)
→ Log: "KEEP: <description> (<metric>: <old> → <new>)"
ELIF metric equal or worse:
→ DISCARD: git reset --hard HEAD~1
→ Log: "DISCARD: <description> (<metric>: <value> vs best <best>)"
ELIF crashed or timed out:
→ CRASH: git reset --hard HEAD~1
→ Log: "CRASH: <description> (error: <brief error>)"
What to try (roughly in order of expected impact):
When stuck (no improvement in 5+ consecutive experiments):
Simplicity criterion:
/autoresearch analyze)Read results.tsv and git log, then produce a summary:
Format as a clear report. If possible, suggest the user visualize with a progress chart.
This protocol works for any optimization task, not just ML training. Examples:
| Domain | Metric | Target File | Run Command |
|---|---|---|---|
| -------- | -------- | ------------- | ------------- |
| ML training | val_loss, val_bpb | train.py | python train.py |
| Compiler optimization | benchmark time | config.toml | make bench |
| Web performance | Lighthouse score | webpack.config.js | npm run build && lighthouse |
| Algorithm tuning | ops/sec | solver.py | python benchmark.py |
| Prompt engineering | eval accuracy | prompts.yaml | python eval.py |
| Database tuning | query latency | postgresql.conf | pgbench |
| CSS/rendering | layout shift score | styles.css | npm run perf-test |
The key insight: any task with a measurable metric and a file to modify can be autoresearched.
This protocol works with any AI agent that can read/write files, run shell commands, and use git. If you're running this outside OpenClaw (e.g., Claude Code, Codex, Cursor, Aider):
autoresearch.config.md for the experiment protocolresults.tsv as your experiment memoryFor the original autoresearch methodology and implementation details, see reference.md.
共 1 个版本