by greyhaven-ai
a recursive self-improving harness designed to help your agents (and future iterations of those agents) succeed on any task
# Add to your Claude Code skills
git clone https://github.com/greyhaven-ai/autocontextLast scanned: 5/3/2026
{
"issues": [],
"status": "PASSED",
"scannedAt": "2026-05-03T06:26:18.500Z",
"semgrepRan": false,
"npmAuditRan": true,
"pipAuditRan": true
}Autocontext is a harness. You point it at a goal in plain language. It iterates against real evaluation, keeps what worked, throws out what didn't, and produces a structured trace of the work plus the artifacts, playbooks, datasets, and (optionally) a distilled local model that the next agent inherits. Repeated runs get better, not just different.
The fastest path uses our Pi runtime, a local coding agent that handles its own auth. No API key plumbing, no provider config: install Pi, install autocontext, point one at the other.
uv tool install autocontext==0.5.0
AUTOCONTEXT_AGENT_PROVIDER=pi \
AUTOCONTEXT_PI_COMMAND=pi \
uv run autoctx solve \
"improve customer-support replies for billing disputes" \
--iterations 3
Pi runs locally as a subprocess and emits live traces back into the harness. For a hosted Pi, set AUTOCONTEXT_AGENT_PROVIDER=pi-rpc and AUTOCONTEXT_PI_RPC_ENDPOINT instead.
Prefer TypeScript? Same surface, same command:
bun add -g autoctx@0.5.0
AUTOCONTEXT_AGENT_PROVIDER=pi bunx autoctx solve \
"improve customer-support replies for billing disputes" \
--iterations 5 --json
Already on Anthropic, OpenAI, Gemini, Mistral, Groq, OpenRouter, Azure, Claude CLI, Codex CLI, or MLX? Set AUTOCONTEXT_AGENT_PROVIDER and the matching credential env var:
AUTOCONTEXT_AGENT_PROVIDER=anthropic \
ANTHROPIC_API_KEY=sk-ant-... \
uv run autoctx solve "..." --iterations 3
See .env.example for every provider's variables. Prefer to clone and run a starter? examples/README.md has copy-paste recipes for Python CLI, Claude Code MCP, Python SDK, and TypeScript library usage.
No comments yet. Be the first to share your thoughts!
If you already work inside a coding agent, you can wire autocontext in once and give the agent a natural-language entry point. Hermes and other terminal-capable agents should start with the CLI-backed skill; MCP remains available for clients that want a tool-catalog protocol.
Pi ships an autocontext skill out of the box. Install the published Pi package and Pi loads natural-language wrappers over live tools such as autocontext_solve_scenario, autocontext_evaluate_output, autocontext_run_improvement_loop, autocontext_run_status, and autocontext_list_scenarios.
pi install npm:pi-autocontext
Then you just ask:
"Solve: improve customer-support replies for billing disputes."
"Judge this output against this rubric and improve it until it scores 0.85."
Claude Code (and any other MCP client) gets the same surface by adding one entry to .claude/settings.json:
{
"mcpServers": {
"autocontext": {
"command": "uv",
"args": ["run", "--directory", "/path/to/autocontext", "autoctx", "mcp-serve"],
"env": { "AUTOCONTEXT_AGENT_PROVIDER": "pi", "AUTOCONTEXT_PI_COMMAND": "pi" }
}
}
}
After that, Python MCP exposes prefixed tools such as autocontext_solve_scenario, autocontext_evaluate_output, autocontext_run_improvement_loop, autocontext_run_status, autocontext_list_scenarios, autocontext_export_skill, and autocontext_search_strategies. It also exposes runtime-session readers as autocontext_list_runtime_sessions, autocontext_get_runtime_session, and autocontext_get_runtime_session_timeline, with unprefixed aliases for parity with TypeScript MCP; Python runtime-backed run and solve role calls populate those logs automatically. The TypeScript package exposes the same capabilities with its documented tool names via bunx autoctx mcp-serve.
Hermes Agent can load a CLI-first skill and inspect Hermes Curator state without MCP:
cd autocontext
uv run autoctx hermes export-skill --output ~/.hermes/skills/autocontext/SKILL.md --json
uv run autoctx hermes inspect --json
Full integration guide: autocontext/docs/agent-integration.md.
Every run leaves a structured record on disk. Replay it, diff it, export it, feed it back into training.
runs/<run_id>/
├── trace.jsonl # every prompt, tool call, and outcome, in order
├── generations/
│ ├── gen_1/
│ │ ├── strategy.json # what the competitor proposed
│ │ ├── analysis.md # what the analyst observed
│ │ └── score.json # how it was evaluated
│ └── gen_2/ ...
├── report.md # human-readable summary of the whole run
└── artifacts/ # files, configs, packages the run produced
knowledge/<scenario>/
├── playbook.md # accumulated lessons that carried forward
├── hints.md # competitor hints that survived the curator
└── tools/ # any helper tools the architect generated
A playbook.md is plain markdown the next run reads as context:
<!-- PLAYBOOK_START -->
## Billing dispute replies
- Always restate the disputed charge in the first sentence; refunds requested without
explicit confirmation cause loops.
- "Pending" charges are not yet billable. Don't promise a refund until status flips
to `posted`. Verified gen_4, regressed in gen_7 when omitted.
- Empathy + specific next step beats empathy alone. Escalation rate dropped from
0.31 to 0.12 once the second sentence named the next-step owner.
<!-- PLAYBOOK_END -->
A trace.jsonl line is one event:
{
"ts": "2026-04-28T17:42:11Z",
"gen": 4,
"role": "competitor",
"event": "strategy_proposed",
"score": 0.78,
"tokens_in": 1840,
"tokens_out": 612,
"strategy_id": "s_4f2a"
}
Inspect, replay, or compare any of it:
uv run autoctx list
uv run autoctx status <run_id>
uv run autoctx replay <run_id> --generation 2
Inside each run, five roles cooperate:
Strategies are evaluated through scenario execution, staged validation, and gating. Weak changes are rolled back. Successful changes accumulate as reusable knowledge that future runs (and future agents) inherit automatically.
The full vocabulary (Scenario, Task, Mission, Campaign, Run, Verifier, Knowledge, Artifact, Budget, Policy) lives in docs/concept-model.md.
Autocontext can sit alongside your live application and record what your agents do, then turn that into training data. Wrap your existing Anthropic or OpenAI client once:
from anthropic import Anthropic
from autocontext.production_traces import instrument_client
client = instrument_client(Anthropic(), app="billing-bot", env="prod")
# use `client` exactly like before; calls are captured to JSONL with content blocks,
# cache-aware usage, and Anthropic-native outcome taxonomy.
import Anthropic from "@anthropic-ai/sdk";
import { instrumentClient } from "autoctx/production-traces";
const client = instrumentClient(new Anthropic(), { app: "billing-bot", env: "prod" });
Then build scoped datasets from the captured traces:
uv run autoctx build-dataset \
--app billing-bot --provider anthropic \
--env prod --outcome success \
--output training/billing.jsonl
And distill them into a smaller local model with MLX (Apple Silicon) or CUDA (Linux GPUs):
uv run autoctx train --scenario support_triage --data training/billing.jsonl --time-budget 300
--iterations, and run-scoped exports while preserving the existing flag forms.autocontext skill for Hermes agents.autocontext and npm autoctx to 0.5.0, with pi-autocontext moving to 0.2.4 on its own lower-numbered line.| If you want to... | Start here | | --------------------------------------------------------------- | ------------------------------------------------------------------------------ | | Run the full multi-generation control plane (Python) | autocontext/README.md | | Run from Node, or operate missions, simula