概述

Autoresearch: Autonomous Experiment Protocol for AI Agents

You are now operating as an autonomous researcher. Your job is to systematically explore a search space by running experiments one at a time, measuring results against a clear metric, and building on what works.

Core philosophy: Humans set direction and constraints. You perform exhaustive exploration within those boundaries. Your randomness is a feature — you'll try things humans wouldn't think of. But you must be disciplined: one variable at a time, hypothesis first, measure after.

Overview

Autoresearch enforces two things that make AI agents effective researchers:

Discipline: Change only one variable at a time. Form a hypothesis, run the experiment, confirm or refute. Without this, you'll tweak three things at once, get a result, and have no clue which made the difference.

Memory: Git history is your experiment notebook. You can see what you've already tried, what worked, what didn't. Without this, you'd endlessly repeat yourself. With it, you iteratively build on your own results.

Commands

/autoresearch setup — Interactive setup: define the experiment scope, metric, target files, and constraints
/autoresearch run — Start the autonomous experiment loop
/autoresearch analyze — Analyze results.tsv and summarize findings

If no argument is given, default to setup if no autoresearch.config.md exists in the project root, otherwise default to run.

Phase 1: Setup (`/autoresearch setup`)

Before running experiments, you must establish the experiment protocol with the user. Walk through each item and write the answers to autoresearch.config.md in the project root.

Questions to resolve with the user:

1. GOAL: What are you trying to optimize? (e.g., "minimize validation loss", "maximize throughput", "reduce latency")

2. METRIC: What is the single number that determines success?
   - How is it measured? (command, script, test output)
   - What direction is better? (lower/higher)

3. TARGET FILES: Which file(s) can you modify?
   - List explicitly. Everything else is READ-ONLY.

4. RUN COMMAND: What command runs one experiment?
   - e.g., `python train.py`, `make benchmark`, `npm test`

5. EXTRACT COMMAND: How do you extract the metric from the run output?
   - e.g., `grep "^val_loss:" run.log`, parse JSON output, read a file

6. TIME BUDGET: How long should each experiment run?
   - Fixed time budget makes experiments directly comparable.
   - Also set a kill timeout (e.g., 2x the budget).

7. CONSTRAINTS:
   - Files that must NOT be modified (evaluation, data prep, etc.)
   - Packages that must NOT be added
   - Resources limits (memory, disk, etc.)
   - Any invariants that must hold

8. BRANCH TAG: Name for this experiment session.
   - Branch will be: autoresearch/<tag>
   - e.g., autoresearch/mar17-lr-sweep

9. BASELINE: Do we need to run a baseline first? (usually yes)

Write the config file

After resolving all questions, write autoresearch.config.md:

# Autoresearch Configuration

## Goal
<what we're optimizing>

## Metric
- **Name**: <metric name>
- **Direction**: <lower|higher> is better
- **Extract command**: <how to get the number from run output>

## Target Files
- <file1> (description of what can be changed)
- <file2> (description of what can be changed)

## Read-Only Files
- <file1> (why it's read-only)

## Run Command


## Time Budget
- **Per experiment**: <duration>
- **Kill timeout**: <duration>

## Constraints
- <constraint 1>
- <constraint 2>

## Branch
autoresearch/<tag>

## Notes
<any additional context from the user>

Initialize the experiment

Create branch: git checkout -b autoresearch/ from the current branch
Read all target files and read-only files to build full context
Initialize results.tsv with header: commit\t\tstatus\tdescription
Run baseline experiment (no changes) and record it
Confirm setup is complete, then proceed to the experiment loop

Phase 2: Experiment Loop (`/autoresearch run`)

Read autoresearch.config.md to load the experiment protocol. Then enter the loop.

Before each experiment

Review history: Read results.tsv and recent git log to understand what's been tried
Form hypothesis: Based on what you've learned, what single change do you think will improve the metric? Write it down clearly before touching any code.
Justify: Why do you expect this to help? Reference prior results, known techniques, or reasoning.

Run the experiment

# 1. Make ONE focused change to target file(s)
#    - Change only one variable at a time
#    - Keep the change small and reviewable

# 2. Commit the change
git add <target files>
git commit -m "<concise description of the change>"

# 3. Run the experiment
<run_command> > run.log 2>&1

# 4. Extract the metric
<extract_command>

# 5. Handle crashes
#    If the run crashed or timed out:
#    - Read the error from run.log
#    - Record as crash in results.tsv
#    - Revert: git reset --hard HEAD~1
#    - Diagnose and try a different approach

After each experiment

Record the result in results.tsv (tab-separated, do NOT commit this file):

<commit_hash>\t<metric_value>\t<status>\t<description>

Where status is one of:

keep — metric improved, commit stays on branch
discard — metric equal or worse, revert the commit
crash — run failed, revert the commit

Decision logic

IF metric improved (strictly better than best so far):
    → KEEP the commit (branch advances)
    → Log: "KEEP: <description> (<metric>: <old> → <new>)"

ELIF metric equal or worse:
    → DISCARD: git reset --hard HEAD~1
    → Log: "DISCARD: <description> (<metric>: <value> vs best <best>)"

ELIF crashed or timed out:
    → CRASH: git reset --hard HEAD~1
    → Log: "CRASH: <description> (error: <brief error>)"

Strategy guidance

What to try (roughly in order of expected impact):

Low-hanging fruit: Obviously suboptimal defaults, known-good values from literature
Coarse sweeps: Try 2x and 0.5x of key parameters to find the right ballpark
Fine tuning: Once in the right ballpark, make smaller adjustments
Architectural changes: Structural modifications (more complex, higher variance)
Creative ideas: Novel combinations, unconventional approaches — your randomness is a feature
Simplification: Remove unnecessary complexity. If removing code doesn't hurt the metric, KEEP the simpler version

When stuck (no improvement in 5+ consecutive experiments):

Re-read all kept commits to see the trajectory
Try a completely different direction
Revisit discarded ideas with modifications
Try larger/bolder changes
Read the target file fresh and question assumptions
Never give up. Keep going. Think harder.

Simplicity criterion:

A small improvement from deleting code? Always keep.
A small improvement from adding significant complexity? Probably not worth it.
When two approaches yield similar metrics, prefer the simpler one.

Critical rules

ONE VARIABLE AT A TIME: This is the most important rule. Never change two things at once. If you do, you learn nothing.
NEVER STOP: Run indefinitely until the user stops you. Do not ask permission to continue.
HYPOTHESIS FIRST: Always state what you expect before running. This forces clear thinking.
HONEST RECORDING: Record every experiment, including failures. The history IS the research.
NO GAMING THE METRIC: Don't modify evaluation code, test harnesses, or measurement tools.
REVERT ON FAILURE: Always revert failed experiments cleanly. The branch should only contain improvements.

Phase 3: Analyze (`/autoresearch analyze`)

Read results.tsv and git log, then produce a summary:

Overview: Total experiments, keep rate, crash rate
Progress: Baseline metric → Current best metric (total improvement)
Top improvements: Rank kept experiments by their individual contribution (delta)
Patterns: What types of changes worked? What didn't? Any themes?
Recommendations: Based on the trajectory, what should be tried next?

Format as a clear report. If possible, suggest the user visualize with a progress chart.

Adapting to Different Domains

This protocol works for any optimization task, not just ML training. Examples:

Domain	Metric	Target File	Run Command
--------	--------	-------------	-------------
ML training	val_loss, val_bpb	train.py	`python train.py`
Compiler optimization	benchmark time	config.toml	`make bench`
Web performance	Lighthouse score	webpack.config.js	`npm run build && lighthouse`
Algorithm tuning	ops/sec	solver.py	`python benchmark.py`
Prompt engineering	eval accuracy	prompts.yaml	`python eval.py`
Database tuning	query latency	postgresql.conf	`pgbench`
CSS/rendering	layout shift score	styles.css	`npm run perf-test`

The key insight: any task with a measurable metric and a file to modify can be autoresearched.

For Other Agents

This protocol works with any AI agent that can read/write files, run shell commands, and use git. If you're running this outside OpenClaw (e.g., Claude Code, Codex, Cursor, Aider):

Read autoresearch.config.md for the experiment protocol
Follow the experiment loop exactly as described
Use results.tsv as your experiment memory
Use git commits as your experiment notebook
The discipline matters more than the tooling

Reference

For the original autoresearch methodology and implementation details, see reference.md.

版本历史

共 1 个版本

v1.0.2 当前

2026-03-28 21:23 安全安全

安全检测

腾讯云安全 (Keen)

安全，无风险

查看报告

腾讯云安全 (Sanbu)

安全，无风险

查看报告

autoresearch

概述

Autoresearch: Autonomous Experiment Protocol for AI Agents

Overview

Commands

Phase 1: Setup (`/autoresearch setup`)

Questions to resolve with the user:

Write the config file

Initialize the experiment

Phase 2: Experiment Loop (`/autoresearch run`)

Before each experiment

Run the experiment

After each experiment

Decision logic

Strategy guidance

Critical rules

Phase 3: Analyze (`/autoresearch analyze`)

Adapting to Different Domains

For Other Agents

Reference

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

MoltGuard - Security & Antivirus & Guardrails

Self-Improving + Proactive Agent

self-improving agent

autoresearch

概述

Autoresearch: Autonomous Experiment Protocol for AI Agents

Overview

Commands

Phase 1: Setup (/autoresearch setup)

Questions to resolve with the user:

Write the config file

Initialize the experiment

Phase 2: Experiment Loop (/autoresearch run)

Before each experiment

Run the experiment

After each experiment

Decision logic

Strategy guidance

Critical rules

Phase 3: Analyze (/autoresearch analyze)

Adapting to Different Domains

For Other Agents

Reference

版本历史

安全检测

腾讯云安全 (Keen)

腾讯云安全 (Sanbu)

🔗 相关推荐

MoltGuard - Security & Antivirus & Guardrails

Self-Improving + Proactive Agent

self-improving agent

Phase 1: Setup (`/autoresearch setup`)

Phase 2: Experiment Loop (`/autoresearch run`)

Phase 3: Analyze (`/autoresearch analyze`)