by evo-hq
turns your codebase into an autoresearch loop — discovers what to measure, instruments the benchmark, then runs tree search with parallel subagents.
# Add to your Claude Code skills
git clone https://github.com/evo-hq/evoLast scanned: 5/9/2026
{
"issues": [],
"status": "PASSED",
"scannedAt": "2026-05-09T06:17:16.863Z",
"semgrepRan": false,
"npmAuditRan": true,
"pipAuditRan": true
}No comments yet. Be the first to share your thoughts!
Get started with autoresearch on any codebase — with two simple commands.
Try it · Install · How it works · Dashboard · Upgrading
You give it a codebase. It discovers metrics to optimize, sets up the evaluation, and starts running experiments in a loop -- trying things, keeping what improves the score, throwing away what doesn't.
Inspired by Karpathy's autoresearch -- where an LLM runs training experiments autonomously to beat its own best score. Autoresearch is a pure hill climb: try something, keep or revert, repeat on a single branch. Evo adds structure on top of that idea:
discover skill explores the repo, figures out what to measure, and instruments the evaluation.Runs on Claude Code, Codex, Cursor, OpenClaw, Hermes, Opencode, or Pi. Experiments run locally or on remote sandboxes — Modal, E2B, Daytona, AWS, Azure, SSH.
Two commands:
/evo:discover # one-time code discovery: figures out benchmarks and creates gates against unintended changes
/evo:optimize # run the loop
discover asks what to optimize, the benchmark command, and the metric direction. Skip the questions by seeding the answer:
/evo:discover make the JSON parser at src/parser.py faster
Then run the loop:
/evo:optimize
evo sizes each round to your benchmark's resource profile — one experiment at a time when a run needs the whole GPU or another exclusive resource, wider when runs are independent — and keeps going until the score stops improving. By default it runs unattended and pushes edits through parallel subagents; say so in plain language if you'd rather it pause after each round or hold to one experiment at a time.
Invocation syntax is host-specific: /evo: on Claude Code, $evo on Codex, / skill menu on Cursor, natural language on Hermes, Opencode, OpenClaw, and Pi.
# 1. evo CLI
uv tool install evo-hq-cli
# 2. Host CLI (if you don't already have it)
npm install -g @anthropic-ai/claude-code # or @openai/codex, openclaw, @earendil-works/pi-coding-agent
# Cursor: install from cursor.com (IDE), or `curl https://cursor.com/install -fsS | bash` for the cursor-agent CLI
# 3. Plugin + host hooks
evo install <host> # claude-code | codex | cursor | hermes | opencode | openclaw | pi
evo install <host> installs the plugin into the host's marketplace and stages the hooks evo needs to talk to in-flight subagents. Verify with evo doctor <host>.
For remote backends, install with the matching provider extra: uv tool install 'evo-hq-cli[modal]' (or [e2b], [daytona], [aws], [azure], [all]).
Codex requires manual approval for plugin hooks. After install, run /hooks inside codex to trust evo's hooks — or pass --trust-hooks to evo install codex to skip the prompt.
The orchestrator dispatches subagents in parallel. Each runs in its own isolated workspace, picks up shared state (failure traces, annotations, discarded hypotheses), forms a hypothesis, edits, and runs the benchmark. A subagent with iteration budget remaining continues on its branch within the same round when its prior edit warrants a follow-up.
After each round, the orchestrator selects which committed branch to extend next. Available strategies:
Configure in the dashboard's Frontier tab, which lists each strategy's parameters.
Between rounds, RLM-inspired scan subagents read trace batches in parallel and surface compound failure patterns: gate-failure intersections, shared root causes across traces. Findings land in shared state, which the next round's subagents read at startup.
evo introduces gates: pass/fail checks that run on every experiment. An experiment that fails a gate is discarded even if its score beats the current best. Without gates, the search will find ways to return a constant, skip work, or trade correctness for speed.
Any command that exits zero on pass and non-zero on fail qualifies as a gate: a test suite, an invariant script, a score floor on a held-out slice of the benchmark. Gates inherit down the experiment tree: a gate registered at the root runs on every descendant. Narrower gates can be attached to specific branches.
When discover builds a benchmark from scratch, it attaches a held-out-slice score-floor gate automatically. When the benchmark already exists in the repo, gates are opt-in.
| Backend | Where | Install |
|---|---|---|
| worktree (default) | local git worktree per experiment | included |
| pool | reuse a fixed set of local workspaces | included |
| ssh | your own SSH host | included |
| modal | Modal serverless cloud | uv tool install 'evo-hq-cli[modal]' |
| e2b | E2B cloud sandboxes | uv tool install 'evo-hq-cli[e2b]' |
| daytona | Daytona cloud workspaces | uv tool install 'evo-hq-cli[daytona]' |
| aws | AWS EC2 sandboxes | uv tool install 'evo-hq-cli[aws]' |
| azure | Azure VMs | uv tool install 'evo-hq-cli[azure]' |
Pick and configure in the dashboard's Backend tab.
The dashboard starts automatically with /evo:discover (or evo init) and prints the URL in chat:
Dashboard live: http://127.0.0.1:8080 (pid 12345)
If 8080 is in use, evo increments to the next free port (8081, 8082, …) and prints it. Subsequent runs reuse the chosen port. Start it manually with:
uv run --project /path/to/evo/plugins/evo evo dashboard --port 8080
evo update # update CLI + every installed host
evo update <host> # update one host (also bumps CLI to match)
evo update <host> --version 0.4.1 # pin to a release
Every evo install / evo update keeps the CLI on PATH in lockstep with the host plugin version it just installed (uv tool install --force evo-hq-cli under the hood). Without a --version pin that resolves to the latest stable release, so running an unpinned evo install/evo update against a pre-release pulls the CLI back to stable — pin both sides for an alpha (see Testing a pre-release). The CLI binary, the skill files, and the hook protocol share wire formats — letting them drift caused silent failures in earlier versions. Editable installs (uv tool install --editable, pip install -e) are detected and left untouched.
See evo update --help for --force, --scope, and additional flags.
uv tool install --force evo-hq-cli && evo update --force
--force wipes the host plugin cache and reinstalls, working around anthropics/claude-code#14061: /plugin update returns success but does not replace cached plugin files.
uv and pip skip pre-releases by default. To install an alpha, pin both the CLI version and the host plugin tag:
uv tool install --force 'evo-hq-cli==0.4.1a2' && \
evo update --version 0.4.1-alpha.2 --force
Substitute the target alpha version. The CLI uses PEP 440 form (0.4.1a2); the marketplace tag uses the dash form (v0.4.1-alpha.2).
For development on evo:
git clone https://github.com/evo-hq/evo
cd evo
uv tool install --editable plugins/evo
Apache-2.0. See LICENSE.
If you use evo in your work, please cite it (see CITATION.cff):
@software{bishoyi_evo,
author = {Bishoyi, Alok Kumar},
title = {{evo: an autoresearch orchestrator for codebases}},
url = {https://github.com/evo-hq/evo},
doi = {10.5281/zenodo.20447923},
year = {2026}
}