Codex Autoresearch Skill — A self-directed iterative system for Codex that continuously cycles through: modify, verify, retain or discard, and repeat indefinitely. Inspired by Karpathy’s autoresearch concept.
# Add to your Claude Code skills
git clone https://github.com/leo-lilinxiao/codex-autoresearchAutonomous goal-directed iteration. Modify -> Verify -> Keep/Discard -> Repeat.
loop, plan, debug, fix, security, ship, or exec, and parse any inline config from the prompt.references/core-principles.md and references/structured-output-spec.md. For active execution modes (loop, debug, fix, security, ship, exec), also load references/runtime-hard-invariants.md.references/session-resume-protocol.md when resuming or controlling an existing runreferences/environment-awareness.md before choosing hardware-sensitive workreferences/interaction-wizard.md for every new interactive launch (loop, debug, fix, security, ship) before execution beginsreferences/results-logging.md only when debugging TSV/state semantics or helper behavior directlylessons, pivot, health-check, parallel, web-search, hypothesis-perspectives).<skill-root>/scripts/...), not the target repo root. In the common repo-local install this means commands such as python3 .agents/skills/codex-autoresearch/scripts/autoresearch_init_run.py .... For repo-managed control-plane helpers (autoresearch_resume_check.py, autoresearch_launch_gate.py, autoresearch_resume_prompt.py, autoresearch_supervisor_status.py, autoresearch_runtime_ctl.py status/stop), prefer --repo <repo> and let the helper derive default artifact paths.| Mode | Purpose | Primary Reference |
|------|---------|-------------------|
| loop | Run the autonomous improvement loop | references/loop-workflow.md |
| plan | Convert a vague goal into a launch-ready config | references/plan-workflow.md |
| debug | Hunt bugs with evidence and hypotheses | references/debug-workflow.md |
| fix | Iteratively reduce errors to zero | references/fix-workflow.md |
| security | Run a structured security audit | references/security-workflow.md |
| ship | Gate and execute a ship workflow | references/ship-workflow.md |
| exec | Non-interactive CI/CD mode with JSON output | references/exec-workflow.md |
Use Mode: <name> in the prompt to force a specific subworkflow.
For the generic loop, the following fields are needed internally. Codex infers them from the user's natural language input and repo context, then fills gaps through guided conversation:
GoalScopeMetricDirectionVerifyOptional but recommended:
GuardIterationsRun tagStop conditionFor every new interactive run, use the wizard contract in references/interaction-wizard.md.
$codex-autoresearch is the only primary human-facing entrypoint.autoresearch_init_run.py, autoresearch_record_iteration.py, autoresearch_select_parallel_batch.py, autoresearch_supervisor_status.py) and do not create launch/runtime control artifacts.autoresearch_runtime_ctl.py launch to persist the confirmed launch manifest and start the detached runtime controller in one step. The runtime itself should execute non-interactive codex exec sessions with the generated runtime prompt supplied on stdin. This skill now defaults those detached sessions to danger_full_access (--dangerously-bypass-approvals-and-sandbox) unless the user explicitly asks for the sandboxed workspace_write path. If the mini-wizard outcome is "fresh start", call autoresearch_runtime_ctl.py launch --fresh-start so prior persistent run-control artifacts are archived as part of the same handoff.autoresearch-state.json internally before continuing. Background start already performs that sync automatically before it relaunches nested Codex sessions; autoresearch_set_session_mode.py remains an internal/scripted recovery helper, not a normal user-facing step.research-results.tsv / autoresearch-state.json artifacts at the same time.status, stop, or resume requests, stay on the same skill entry. status and stop apply to background runs only; foreground runs stay in the current session.exec remains the advanced / CI path. It is fully specified upfront and does not use the interactive handoff.loop, debug, fix, security, and ship, ALWAYS scan the repo and ask at least one round of clarifying questions before the run starts. Load and follow references/interaction-wizard.md for every new interactive launch. The launch wizard must include an explicit run-mode choice: foreground or background. exec mode is the exception: it is fully configured upfront and must not stop for a launch question.autoresearch_runtime_ctl.py launch. Background calls autoresearch_runtime_ctl.py launch, creating the confirmed launch manifest and detached runtime as a single script-level action. Detached sessions use the confirmed launch manifest's execution_policy; this skill defaults to danger_full_access unless the user explicitly asks for sandboxed workspace_write. If the chosen background path is a fresh start after recovery analysis, use autoresearch_runtime_ctl.py launch --fresh-start so stale persistent run-control artifacts are archived automatically. exec mode has no launch question; once safety checks pass, it begins immediately.go in either foreground or background mode, do not pause mid-run to ask anything -- not for clarification, not for confirmation, not for permission. If you encounter ambiguity during the loop, apply best practices and keep going. The user may be asleep.git reset --hard HEAD~1 is allowed; otherwise use git revert --no-edit HEAD.Iterations: N.references/autonomous-loop-protocol.md Stop Conditions for the full definition).references/runtime-hard-invariants.md as the primary runtime checklist. Foreground's core persistent artifacts are research-results.tsv and autoresearch-state.json; lessons are helper-derived secondary output.references/pivot-protocol.md instead of brute-force retrying.research-results.tsv, autoresearch-state.json, or runtime-control files. Always call them via the skill-bundle path (<skill-root>/scripts/...); never call bare scripts/autoresearch_*.py from the target repo root unless the skill bundle itself is actually installed there.exec mode, never leave repo-root autoresearch-state.json behind. If helper scripts need state, use the exec scratch path and explicitly clean it up before exit. When you use autoresearch_init_run.py --mode exec ... with the default repo-root artifact names, do not manually rename old research-results.tsv or autoresearch-state.json; the helper already archives them to the canonical research-results.prev.tsv and autoresearch-state.prev.json paths before it starts fresh.references/runtime-hard-invariants.md, references/core-principles.md, and the selected mode workflow from disk before the next iteration. Do not rely on memory of those documents after compaction.references/runtime-hard-invariants.md. Use Phase 8.7 of references/autonomous-loop-protocol.md only for the detailed re-anchoring procedure. If any item fails, re-read all loaded runtime docs from disk before continuing.Every mode should follow references/structured-output-spec.md.
Minimum requirement:
exec, emit only the machine-readable JSON payloads defined in references/exec-workflow.md,$codex-autoresearch
I want to get rid of all the `any` types in my TypeScript code
$codex-autoresearch
I want to make our API faster but I don't know where to start
$codex-autoresearch
pytest is failing, 12 tests broken after the refactor
Codex scans the repo, asks targeted questions to clarify your intent, asks you to choose foreground or background for interactive runs, then starts the loop. You never need to write key-value config.
references/core-principles.mdreferences/runtime-hard-invariants.mdreferences/loop-workflow.mdreferences/autonomous-loop-protocol.mdreferences/interaction-wizard.mdreferences/structured-output-spec.mdreferences/modes.mdreferences/plan-workflow.mdreferences/debug-workflow.mdreferences/fix-workflow.mdreferences/security-workflow.mdreferences/ship-workflow.mdreferences/exec-workflow.mdreferences/results-logging.mdreferences/lessons-protocol.mdreferences/pivot-protocol.mdreferences/web-search-protocol.mdreferences/environment-awareness.mdreferences/parallel-experiments-protocol.mdreferences/session-resume-protocol.mdreferences/health-check-protocol.mdreferences/hypothesis-perspectives.md1. Install:
git clone https://github.com/leo-lilinxiao/codex-autoresearch.git
cp -r codex-autoresearch your-project/.agents/skills/codex-autoresearch
Or use the skill installer in Codex:
$skill-installer install https://github.com/leo-lilinxiao/codex-autoresearch
2. Open Codex in your project and say what you want:
$codex-autoresearch
I want to get rid of all the `any` types in my TypeScript code
3. Codex scans, confirms, then iterates autonomously:
Codex: I found 47 `any` occurrences across src/**/*.ts.
Confirmed:
- Target: eliminate `any` types in src/**/*.ts
- Metric: `any` count (current: 47), direction: lower
- Verify: grep + tsc --noEmit as guard
Need to confirm:
- Run mode: foreground or background?
- Run until all gone, or cap at N iterations?
Runtime checklist:
- baseline first, then initialize results/state
- record every completed experiment before the next one starts
Choose a run mode, then reply "go" to start, or tell me what to change.
For truly unattended runs, launch Codex with approvals / sandbox settings
that will not interrupt git commit or revert commands.
You: Background, go. Run overnight.
Codex: Starting background run -- baseline: 47. Detached runtime is now iterating.
Each improvement stacks. Each failure reverts. Everything is logged.
See INSTALL.md for more install options. See GUIDE.md for full operator's manual.
A Codex skill that runs a modify-verify-decide loop on your codebase. Each iteration makes one atomic change, verifies it against a mechanical metric, and keeps or discards the result. Progress accumulates in git; failures auto-revert. Best for unattended runs where you want Codex to keep pushing toward a measurable result for minutes, hours, or overnight.
Inspired by Karpathy's autoresearch principles, generalized beyond ML.
Karpathy's autoresearch proved that a simple loop -- modify, verify, keep or discard, repeat -- can push ML training from baseline to new highs overnight. codex-autoresearch generalizes that loop to everything in software engineering that has a number. Test coverage, type errors, performance latency, lint warnings -- if there is a metric, it can iterate autonomously.
+----------------------+
| Environment Probe | <-- detect CPU/GPU/RAM/toolchains
+----------+-----------+
|
+----------v-----------+
| Session Resume? | <-- inspect prior results/state
+----------+-----------+
|
+----------v-----------+
| Read Context | <-- scope + lessons + repo state
+----------+-----------+
|
+----------v-----------+
| Wizard Confirm | <-- goal/metric/verify/guard
| + choose run mode | + foreground or background
+----------+-----------+
|
+---------+---------+
| |
+---------v--------+ +-------v---------+
| Foreground run | | Background run |
| current session | | launch manifest |
| no runtime files | | + detached ctl |
+---------+--------+ +-------+---------+
| |
+---------+---------+
|
+----------v-----------+
| Shared Loop Core |
| baseline -> change |
| -> verify/guard -> |
| keep/discard/log |
+----------+-----------+
|
+----------v-----------+
| Supervisor Outcome | <-- continue / stop / needs_human
+----------------------+
Foreground and background share the same experiment protocol. The difference is only where the loop executes: the current Codex session for foreground, or the detached runtime controller for background. Unbounded runs continue until you interrupt them or another terminal condition is reached (goal/stop condition satisfied, soft-blocker handoff, or hard blocker). Bounded runs follow the same terminal conditions, but also stop at Iterations: N.
The runtime checklist stays intentionally small in both modes:
In pseudocode:
PHASE 0: Probe environment, check for session resume
PHASE 1: Read context + lessons file
PHASE 2: Confirm config + choose foreground or background
IF foreground:
run the loop in the current Codex session
ELSE background:
write autoresearch-launch.json and start the detached runtime
SHARED LOOP (forever or N times):
1. Review current state + git history + results log + lessons
2. Pick ONE hypothesis (apply perspectives, filter by environment)
-- or N hypotheses if parallel mode is active
3. Make ONE atomic change
4. git commit (before verification)
5. Run mechanical verification + guard
6. Improved -> keep (extract lesson). Worse -> approved rollback strategy. Crashed -> fix or skip.
7. Log the result
8. Health check (disk, git, verify health)
9. If 3+ discards -> REFINE; 5+ -> PIVOT; 2 PIVOTs -> web search
10. Repeat until the stop condition, manual stop, needs_human, or the configured iteration cap.
You say what you want in one sentence. Codex does the rest.
It scans your repo, proposes a plan, confirms with you, then iterates autonomously:
| You say | What happens | |---------|-------------| | "Improve my test coverage" | Scans repo, proposes metric, iterates until target or interrupted | | "Fix the 12 failing tests" | Detects failures, repairs one by one until zero remain | | "Why is the API returning 503?" | Hunts root cause with falsifiable hypotheses and evidence | | "Is this code secure?" | Runs STRIDE + OWASP audit, every finding backed by code evidence | | "Ship it" | Verifies readiness, generates checklist, gates release | | "I want to optimize but don't know what to measure" | Analyzes repo, suggests metrics, generates launch-ready config |
Behind the scenes, Codex maps your sentence to one of 7 specialized modes (loop, plan, debug, fix, security, ship, exec). You never need to pick a mode -- just describe your goal.
Codex infers everything from your sentence and your repo. You never write config.
| What it needs | How it gets it | Example | |--------------|----------------|---------| | Goal | Your sentence | "get rid of all any types" | | Scope | Scans repo structure | auto-discovers src/**/*.ts | | Metric | Proposes based on goal + tooling | any count (current: 47) | | Direction | Infers from "improve" / "reduce" / "eliminate" | lower | | Verify command | Matches to repo tooling | grep count + tsc --noEmit | | Guard (optional) | Suggests if regression risk exists | npm test |
Before starting, Codex always shows you what it found and asks you to confirm. One round of confirmation minimum, up to five if needed. Then you choose foreground or background and say "go". Foreground keeps iterating in the current session; background hands off to detached runtime so you can walk away. For truly unattended runs, start Codex CLI with approvals / sandbox settings that will not interrupt git commit or revert commands. In a disposable or otherwise trusted repo, giving Codex fuller permissions is the simplest option. After launch, the most important execution rule is simple: every completed experiment must be recorded before the next one begins.
If your goal has a structural requirement in addition to a metric threshold, Codex can also gate both retention and stopping on structured labels. For example: "only retain results that use the production-path, and stop only when latency <= 120 ms and the retained keep is labeled production-path and real-backend." This avoids both falsely retaining and falsely stopping on a numerically be
No comments yet. Be the first to share your thoughts!