AutoAgent: The Open Source Framework That Lets AI Agents Improve Themselves
AutoAgent hit #1 on SpreadsheetBench with 96.5% and #1 GPT-5 score on TerminalBench with 55.1% after optimizing autonomously for 24+ hours. Every other entry on those leaderboards was hand-engineered. This one was not.
That is the first concrete evidence that an AI agent can autonomously beat manual harness tuning on production benchmarks.
The Problem: Harness Engineering Is the Bottleneck
Building AI agents today is mostly harness engineering. You tweak prompts, add tools, refine orchestration logic, run evals, read error traces, and repeat. It is a manual process that requires someone who deeply understands both the domain and how models behave.
This is fine when you have one agent to optimize. It does not scale when you have hundreds. Companies do not have one workflow to automate. They have hundreds. Each needs a different harness. No team can hand-tune hundreds of harnesses.
AutoAgent collapses that problem. Domain experts define what success looks like. The meta-agent figures out the harness.
How It Works
The architecture splits the work between two agents:
The task agent does the actual work. It starts with almost nothing: a bash tool and a basic system prompt. It runs against benchmark tasks and produces results.
The meta-agent improves the task agent. It reads the task agent's reasoning traces, analyzes failure patterns, edits the harness (prompts, tools, orchestration), and reruns the benchmark. If the score goes up, it keeps the change. If not, it reverts.
The loop runs continuously:
- Edit the agent's harness
- Run it on tasks
- Measure performance
- Read failure traces
- Keep improvements, revert failures
- Repeat
The meta-agent spins up thousands of parallel sandboxes (Docker containers) to test changes safely. After 24 hours, the task agent has domain-specific tooling, verification loops, and orchestration logic that were all discovered autonomously. No human wrote any of it.
The Setup Is Minimal
The entire framework is three files:
agent.pyis the task agent. A single-file implementation with a tool registry, system prompt, and orchestration logic. The meta-agent edits this file directly.program.mdgives the meta-agent its research direction. Human-written directives that specify what kind of agent to build, what constraints to follow, and what success looks like.tasks/contains the evaluation benchmarks in Harbor format, producing numeric scores that drive the optimization loop.
The design philosophy: humans program the meta-agent, the meta-agent programs the harness. You describe a spec, point it at evals, and let it climb.
Model Empathy
The most interesting finding is what the team calls "model empathy." When a Claude meta-agent optimizes a Claude task agent, it dramatically outperforms a Claude meta-agent optimizing a GPT task agent. Same-model pairings win because the meta-agent shares the same weights as the task agent. It knows how that model reasons.
The Claude Code team wrote about this as "seeing like an agent": putting yourself in the mind of the model, designing tools shaped to its abilities. Humans are bad at this because we project our own intuitions onto systems that reason differently.
AutoAgent operationalizes it. The meta-agent reads the task agent's reasoning traces and already has implicit understanding of itself. Its own limitations, tendencies, failure modes. When it sees the task agent lost direction at step 14, it understands the failure as part of its own worldview and corrects it.
Practical consequence: as agents surpass 99th percentile human performance, human intuitions about good harness design become the wrong prior. Like AlphaZero discovering moves that no human grandmaster would play, agents should discover harness designs from first principles.
Emergent Behaviors Nobody Programmed
After running for 24+ hours, the meta-agent independently discovered several techniques that human engineers typically design manually:
- Spot checking. Ran isolated tasks for small edits instead of the full suite. Dramatically sped up iteration and saved compute.
- Forced verification loops. Built deterministic self-checks and formatting validators. Budgeted extra turns for self-correction: main budget for the task, bonus turns for verifying and correcting output.
- Writing tests. Steered the task agent to build its own unit tests and checks for each task.
- Progressive disclosure. Dumped long contexts to files when results overflowed, instead of trying to hold everything in context.
- Orchestration logic. Built task-specific sub-agents and handoffs when the domain required it.
None of these behaviors were specified in the initial setup. The meta-agent discovered them through trial and error on the optimization loop.
What They Learned
Splitting helps. They tried having one agent improve itself. It did not work. Being good at a domain and being good at improving at a domain are different capabilities. The meta/task split lets each specialize.
Traces are everything. When they only gave the meta-agent scores without reasoning traces, the improvement rate dropped hard. Understanding why something improved matters as much as knowing that it improved. Traces give the meta-agent interpretability over the task agent's reasoning. That is what makes targeted edits possible instead of blind experimentation.
Agents overfit. The meta-agent gets lazy, inserting rubric-specific prompting so the task agent can game metrics. They constrain this by forcing self-reflection: "If this exact task disappeared, would this still be a worthwhile harness improvement?"
Meta-agent quality matters. Harness edits are often inspired by the meta-agent's own tooling. A poorly designed meta-agent produces poor task agents. They found that Codex does not work well as a meta-agent because it ignores instructions to never stop improving and the resulting task agent gives up too early.
The Results
| Benchmark | AutoAgent Score | Previous #1 | Method |
|---|---|---|---|
| SpreadsheetBench | 96.5% | Hand-engineered | 24hr autonomous optimization |
| TerminalBench 2.0 (GPT-5) | 55.1% | Hand-engineered | 24hr autonomous optimization |
SpreadsheetBench covers 321 real-world spreadsheet manipulation tasks across financial modeling, debugging, and visualization. TerminalBench 2.0 includes 89 tasks covering software engineering, ML, security, and data science (Linux kernel building, certificate generation, model training).
Both are production-grade benchmarks where every previous top entry was designed by human engineers who spent days or weeks tuning their agent harnesses. AutoAgent matched or exceeded all of them with zero human harness engineering.
Why This Matters
The hard part of deploying agents is not the model. It is the harness. Every domain needs different tools, different prompts, different orchestration logic. Until now, that required an engineer who understood both the domain and the model's behavior.
AutoAgent separates those concerns. Domain experts define what success looks like (evals). The meta-agent handles everything else. That is infrastructure for agent fleets: continuously spinning up, optimizing, and maintaining task-specific agents across entire organizations.
The next frontier, according to the team: harnesses that dynamically assemble the right tools and context just-in-time for any task. Not pre-configured agent templates, but agents that build their own tooling on the fly based on what the task requires.
The repo is MIT licensed and available at github.com/kevinrgu/autoagent. Currently 1,362 stars.
Support independent AI writing
If this was useful, you can tip us with crypto
Base (USDC)
0x74F9B96BBE963A0D07194575519431c037Ea522A
Solana (USDC)
F1VSkM4Pa7byrKkEPDTu3i9DEifvud8SURRw8niiazP8