codex-autoresearch

Name: codex-autoresearch
Author: leo-lilinxiao

Verified

Codex Autoresearch Skill — A self-directed iterative system for Codex that continuously cycles through: modify, verify, retain or discard, and repeat indefinitely. Inspired by Karpathy’s autoresearch concept.

1,934stars

116forks

Python

Installation

# Add to your Claude Code skills
git clone https://github.com/leo-lilinxiao/codex-autoresearch

Getting Started

Guides for using ai agents skills like codex-autoresearch.

Caveman: Cut Claude Token Use by 65%
How agent-side prompt compression works, when to use it, and when not to.
What is an AI Skills Marketplace?
Definitions, how marketplaces work, and how to choose between them in 2026.
Getting Started with AI Skills

SKILL.md

Security ReportVerified

Last scanned: 5/4/2026

{
  "issues": [],
  "status": "PASSED",
  "scannedAt": "2026-05-04T06:42:28.449Z",
  "semgrepRan": false,
  "npmAuditRan": true,
  "pipAuditRan": true
}

README.md

Frequently Asked Questions

What is codex-autoresearch?

codex-autoresearch is an open-source ai agents skill for AI coding assistants such as Claude Code, Codex CLI, and ChatGPT, built by leo-lilinxiao. Codex Autoresearch Skill — A self-directed iterative system for Codex that continuously cycles through: modify, verify, retain or discard, and repeat indefinitely. Inspired by Karpathy’s autoresearch concept. It has 1,934 GitHub stars.

Is codex-autoresearch safe to use?

Yes. codex-autoresearch passed SkillsLLM's automated security scan — a dependency vulnerability audit plus prompt-injection heuristics — with no high-severity issues. You can read the full report in the Security Report section on this page.

How do I install codex-autoresearch?

Clone the repository with "git clone https://github.com/leo-lilinxiao/codex-autoresearch" and add it to your Claude Code skills directory (see the Installation section above). codex-autoresearch ships a SKILL.md manifest, so compatible agents can discover and load it automatically.

What programming language is codex-autoresearch written in?

codex-autoresearch is primarily written in Python. It is open-source under leo-lilinxiao on GitHub, so you can review or fork the full source.

Are there alternatives to codex-autoresearch?

Yes. SkillsLLM lists many other AI Agents skills you can browse and compare side by side. Open the AI Agents category from the badge at the top of this page, or use the Related Skills and comparison links further down to weigh codex-autoresearch against similar tools.

Agentic AI for Beginners

Build your first AI agent from scratch - tool use, ReAct pattern, memory, deployment

41 minBeginner

Comments (0)

to leave a comment.

No comments yet. Be the first to share your thoughts!

Related Skills

superpowers

by obra

An agentic skills framework & software development methodology that works.

234,966

Developers Also Liked

Based on votes and bookmarks from developers who liked this skill

lets-workflow

by restarter

A Claude Code plugin for a structured dev workflow — 14 expert agents, code review, planning, and task tracking, all from the terminal.

claude-code claude-usage

name: codex-autoresearch description: "Autonomous long-running iteration for Codex CLI. Use when the user wants Codex to plan or run an unattended improve-verify loop toward a measurable or verifiable outcome, especially for overnight runs; it also covers repeated debugging, fixing, security auditing, and ship-readiness workflows. Do not use for ordinary one-shot coding help or casual Q&A." metadata: short-description: "Run an unattended improve-verify loop"

codex-autoresearch

Autonomous goal-directed iteration. Modify -> Verify -> Keep/Discard -> Repeat.

When Activated

Classify the request as loop, plan, debug, fix, security, ship, or exec, and parse any inline config from the prompt.
Load references/core-principles.md and references/structured-output-spec.md. For active execution modes (loop, debug, fix, security, ship, exec), also load references/runtime-hard-invariants.md.
Load only the additional references the current situation needs:
- references/session-resume-protocol.md for every interactive launch or existing-run control path, before deciding fresh vs resumable
- references/environment-awareness.md before choosing hardware-sensitive work
- references/interaction-wizard.md for every new interactive launch (loop, debug, fix, security, ship) before execution begins
- references/results-logging.md only when debugging TSV/state semantics or helper behavior directly
Load the selected mode workflow reference plus only the detailed cross-cutting protocols that actually apply (lessons, pivot, health-check, parallel, web-search, hypothesis-perspectives).
Use the bundled helper scripts when stateful artifacts or runtime control are involved. Resolve them relative to the loaded skill bundle root (<skill-root>/scripts/...), not the target repo root. In the common repo-local install this means commands such as python3 .agents/skills/codex-autoresearch/scripts/autoresearch_init_run.py --repo <primary_repo> --workspace-root <workspace_root> .... New-run helpers (autoresearch_init_run.py and autoresearch_runtime_ctl.py launch/create-launch) require both --repo <primary_repo> and --workspace-root <workspace_root>. Existing-run control-plane helpers (autoresearch_resume_check.py, autoresearch_resume_prompt.py, autoresearch_supervisor_status.py, autoresearch_health_check.py, autoresearch_runtime_ctl.py status/stop/start) require --repo <primary_repo> and resolve the workspace-owned Results directory from the repo-local pointer plus canonical context. autoresearch_launch_gate.py --repo <primary_repo> is the pre-wizard gate: it returns fresh for a clean repo with no prior artifacts and otherwise uses the same pointer/context recovery path.
Execute the selected workflow exactly as written and produce the required structured output and artifacts.

Core Loop

Read the relevant context.
Define a mechanical success metric.
Establish a baseline.
Make one focused change.
Verify with a command.
Keep or discard the change.
Log the result.
Repeat.

Modes

Mode	Purpose	Primary Reference
`loop`	Run the autonomous improvement loop	`references/loop-workflow.md`
`plan`	Convert a vague goal into a launch-ready config	`references/plan-workflow.md`
`debug`	Hunt bugs with evidence and hypotheses	`references/debug-workflow.md`
`fix`	Iteratively reduce errors to zero	`references/fix-workflow.md`
`security`	Run a structured security audit	`references/security-workflow.md`
`ship`	Gate and execute a ship workflow	`references/ship-workflow.md`
`exec`	Non-interactive CI/CD mode with JSON output	`references/exec-workflow.md`

Use Mode: <name> in the prompt to force a specific subworkflow.

Required Config

For the generic loop, the following fields are needed internally. Codex infers them from the user's natural language input and repo context, then fills gaps through guided conversation:

Goal
Scope
Metric
Direction
Verify

Optional but recommended:

Guard
Iterations
Run tag
Stop condition

For every new interactive run, use the wizard contract in references/interaction-wizard.md.

Explicit Run Modes

Use $codex-autoresearch for interactive autoresearch launches and follow-up controls.
For a new interactive run, scan the repo, ask the confirmation questions, and require an explicit run-mode choice: foreground or background.
If the user chooses foreground, keep the loop in the current Codex session. When model-visible goal tools are available, use the official Codex goal only as the thread-level continuation anchor: after launch approval, call get_goal; reuse a matching non-complete current goal, or call create_goal with the confirmed objective when no goal exists. If an existing goal cannot be reused, surface it in the confirmation summary before launch and do not create a second one. Mark the goal complete with update_goal only when the autoresearch stop condition is actually satisfied; mark it blocked only when the run truly cannot continue without external input or an environment change. Use the shared helper scripts (autoresearch_init_run.py --repo <primary_repo> --workspace-root <workspace_root>, autoresearch_record_iteration.py, autoresearch_select_parallel_batch.py, autoresearch_supervisor_status.py --repo <primary_repo>) and do not create launch/runtime control artifacts.
If the user chooses background, call autoresearch_runtime_ctl.py launch --repo <primary_repo> --workspace-root <workspace_root> to persist the confirmed launch manifest and start the detached runtime controller in one step, then return a short handoff summary instead of tailing or polling the run unless the user explicitly asked you to wait. Do not create or mutate official Codex goals for background runs; the runtime controller owns detached continuation. The runtime itself should execute non-interactive codex exec sessions with the generated runtime prompt supplied on stdin. Detached sessions default to danger_full_access (--dangerously-bypass-approvals-and-sandbox) unless the user explicitly asks for the sandboxed workspace_write path. If the mini-wizard outcome is "fresh start", call autoresearch_runtime_ctl.py launch --repo <primary_repo> --workspace-root <workspace_root> --fresh-start so prior persistent run-control artifacts are archived as part of the same handoff.
If the user resumes an existing interactive run in the other mode, synchronize autoresearch-results/state.json internally before continuing. Background start already performs that sync automatically before it relaunches nested Codex sessions; autoresearch_set_session_mode.py remains an internal/scripted recovery helper, not a normal user-facing step.
Treat the repo where the run starts as the primary repo. Single-repo runs are the default. If the task truly spans multiple codebases, declare companion repos explicitly and give each repo its own scope instead of stuffing absolute paths into one mixed scope string.
For a new interactive run, default the workspace_root from the launch context: if Codex started inside a git repo, use that repo root; otherwise use the current launch directory. Do not silently widen to a parent workspace just because sibling repos or old artifacts exist. Only widen when the user explicitly confirms a broader multi-repo workspace, and show the resulting Results directory in the confirmation summary.
Foreground and background share the same experiment protocol, but they are mutually exclusive for a given workspace/run. Never try to keep both modes active against the same autoresearch-results/ artifacts at the same time.
For every interactive foreground/background launch that proceeds past the session-resume gate, check python3 <skill-root>/scripts/autoresearch_hooks_ctl.py status and then follow the readiness flow in references/interaction-wizard.md. Capture the first startup_tip_needed value from that status; if it is true, include one product-facing launch tip in the confirmation summary. If setup is missing, stale, disabled, or untrusted, run python3 <skill-root>/scripts/autoresearch_hooks_ctl.py install before clarification continues. Treat setup details as internal preparation unless a setup failure blocks launch. Use model-visible goal tools when they are actually available.
For status, stop, or resume requests, stay on the same skill entry. status and stop apply to background runs only; foreground runs stay in the current session.
exec remains the advanced / CI path. It is fully specified upfront and does not use the interactive handoff.

Hard Rules

Ask before act for new interactive launches. For loop, debug, fix, security, and ship, scan the repo, run the session-resume launch gate, and ask at least one repo-grounded confirmation round before the run starts. Load and follow references/interaction-wizard.md for every new interactive launch. The launch wizard must include an explicit run-mode choice: foreground or background. exec mode is the exception: it is fully configured upfront and must not stop for a launch question.
Respect the chosen run mode after launch approval. In interactive modes, once the user says "go" (or equivalent: "start", "launch", or any clear approval), follow the selected run mode exactly. Foreground stays in the current session, may align the official Codex goal when goal tools are available, and must not call autoresearch_runtime_ctl.py launch. Background calls autoresearch_runtime_ctl.py launch --repo <primary_repo> --workspace-root <workspace_root>, creating the confirmed launch manifest and detached runtime as a single script-level action; after launch, return a short handoff summary and do not monitor in the foreground unless explicitly asked. Background must not create or update official Codex goals. Detached sessions use the confirmed launch manifest's execution_policy and default to danger_full_access unless the user explicitly asks for sandboxed workspace_write. If the chosen background path is a fresh start after recovery analysis, use autoresearch_runtime_ctl.py launch --repo <primary_repo> --workspace-root <workspace_root> --fresh-start so stale persistent run-control artifacts are archived automatically. exec mode has no launch question; once safety checks pass, it begins immediately.
Never ask after the user approves the run. Once the user has approved go in either foreground or background mode, do not pause mid-run to ask anything -- not for clarification, not for confirmation, not for permission. If you encounter ambiguity during the loop, apply best practices and keep going. The user may be asleep.
Read all in-scope files before the first write.
One focused change per iteration.
Mechanical verification only.
After launch approval, scoped per-iteration trial commits are part of the approved run; do not ask separately before creating them. Create a trial commit before verification only when every managed repo's worktree stays within that repo's declared scope or autoresearch-owned artifacts, remove generated verify/guard byproducts, apply the approved keep/discard closeout, then record the current clean HEAD commit(s). The background runtime enforces the same scope-aware gate before each relaunch boundary, but foreground runs must still honor it before creating a trial commit.
Never stage or revert unrelated user changes.
Keep run artifacts uncommitted and never stage them.
Use the rollback strategy approved during setup. In a dedicated experiment branch/worktree with pre-launch approval, git reset --hard HEAD~1 is allowed; otherwise use git revert --no-edit HEAD.
Discard gains under 1% that add disproportionate complexity.
Unlimited runs by default unless the user explicitly asks for Iterations: N.
External ship actions (deploy, publish, release) must be confirmed during the pre-launch wizard phase. If not confirmed before launch, skip them and log as blocker.
Do not ask "should I continue?". Once launched, keep the chosen run mode active until interrupted or a hard blocker / configured terminal condition appears (see references/autonomous-loop-protocol.md Stop Conditions for the full definition).
During active execution, keep references/runtime-hard-invariants.md as the primary runtime checklist. Foreground's core persistent artifacts are autoresearch-results/results.tsv, autoresearch-results/state.json, autoresearch-results/context.json, and autoresearch-results/lessons.md; background also uses autoresearch-results/launch.json, autoresearch-results/runtime.json, and autoresearch-results/runtime.log.
When stuck (3+ consecutive discards), use the PIVOT/REFINE escalation ladder from references/pivot-protocol.md instead of brute-force retrying.
Prefer the bundled helper scripts over hand-editing autoresearch-results/results.tsv, autoresearch-results/state.json, autoresearch-results/context.json, or runtime-control files. Always call them via the skill-bundle path (<skill-root>/scripts/...); never call bare scripts/autoresearch_*.py from the target repo root unless the skill bundle itself is actually installed there.
In exec mode, never leave repo-root state artifacts behind. If helper scripts need state, use the exec scratch path and explicitly clean it up before exit. New schema artifacts still belong under the workspace-owned autoresearch-results/ directory; legacy repo-root artifacts trigger the unsupported-layout error unless the user explicitly chooses a fresh start.
After any context compaction event (the CLI warns about thread length and compaction), re-read references/runtime-hard-invariants.md, references/core-principles.md, and the selected mode workflow from disk before the next iteration. Do not rely on memory of those documents after compaction.
Every 10 iterations, perform the Protocol Fingerprint Check defined in references/runtime-hard-invariants.md. Use Phase 8.7 of references/autonomous-loop-protocol.md only for the detailed re-anchoring procedure. If any item fails, re-read all loaded runtime docs from disk before continuing.

Structured Output

Every mode should follow references/structured-output-spec.md.

Minimum requirement:

for interactive and user-facing modes, print a setup summary before the loop starts,
for interactive and user-facing modes, print progress updates during the loop,
for interactive and user-facing modes, print a completion summary at the end,
for exec, emit no prose; every assistant-visible payload must be one of the JSON lines defined in references/exec-workflow.md,
write the mode-specific output files when the workflow defines an output directory.

Quick Start

$codex-autoresearch
I want to get rid of all the `any` types in my TypeScript code

$codex-autoresearch
I want to make our API faster but I don't know where to start

$codex-autoresearch
pytest is failing, 12 tests broken after the refactor

Codex scans the repo, asks targeted questions to clarify your intent, asks you to choose foreground or background for interactive runs, then starts the loop. You never need to write key-value config.

References

references/core-principles.md
references/runtime-hard-invariants.md
references/loop-workflow.md
references/autonomous-loop-protocol.md
references/interaction-wizard.md
references/structured-output-spec.md
references/modes.md
references/plan-workflow.md
references/debug-workflow.md
references/fix-workflow.md
references/security-workflow.md
references/ship-workflow.md
references/exec-workflow.md
references/results-logging.md
references/lessons-protocol.md
references/pivot-protocol.md
references/web-search-protocol.md
references/environment-awareness.md
references/parallel-experiments-protocol.md
references/session-resume-protocol.md
references/health-check-protocol.md
references/hypothesis-perspectives.md

The idea: tell Codex what you want to improve, then walk away. It modifies your code, verifies the result, keeps or discards, and repeats. You come back to a log of experiments and a better codebase.

Inspired by Karpathy's autoresearch, generalized beyond ML to anything you can verify mechanically: test coverage, type errors, latency, lint warnings, security findings, release readiness — if a command can tell whether it improved, the loop can iterate on it.

Quick Start

[!IMPORTANT] Start Codex with Full Access:
codex --dangerously-bypass-approvals-and-sandbox
Use this before starting autoresearch for the smoothest foreground and background experience.

Install in Codex:

$skill-installer install https://github.com/leo-lilinxiao/codex-autoresearch

Open your project and go:

You:   $codex-autoresearch
       I want to get rid of all the `any` types in my TypeScript code

Codex: I found 47 `any` occurrences across src/**/*.ts.
       Results directory: ./autoresearch-results/
       Metric: `any` count (current: 47), direction: lower
       Verify: grep count + tsc --noEmit as guard
       Run mode: foreground or background?

You:   Background, go. Run overnight.

Codex: Starting background run — baseline: 47. Iterating.

Each improvement stacks. Each failure reverts. Everything is logged.

For background runs, start Codex from a trusted Full Access session.

See INSTALL.md for skill installer, manual copy, user-scope, and development symlink options. See GUIDE.md for the full manual.

How It Works

You say one sentence  →  Codex scans & confirms  →  You say "go"
                                                        |
                                         +--------------+--------------+
                                         |                             |
                                    foreground                    background
                                  (current session)            (detached, overnight)
                                         |                             |
                                         +--------------+--------------+
                                                        |
                                                        v
                                              +-------------------+
                                              |    The Loop       |
                                              |                   |
                                              |  modify one thing |
                                              |  trial commit     |
                                              |  run verify       |
                                              |  improved? keep   |
                                              |  worse? revert    |
                                              |  log the result   |
                                              |  repeat           |
                                              +-------------------+

That's it. You pick one: foreground keeps the loop in your current session, background hands it off to a detached process so you can sleep. Same loop either way, but they don't run at the same time.

What You Say vs What Happens

You say	What happens
"Improve my test coverage"	Iterates until target or interrupted
"Fix the 12 failing tests"	Repairs one by one until zero remain
"Why is the API returning 503?"	Hunts root cause with falsifiable hypotheses
"Is this code secure?"	STRIDE + OWASP audit, every finding backed by code evidence
"Ship it"	Verifies readiness, generates checklist, gates release
"I want to optimize but don't know what"	Analyzes repo, suggests metrics, generates config

Behind the scenes, Codex maps your sentence to one of 7 modes (loop, plan, debug, fix, security, ship, exec). You never need to pick one.

What Codex Figures Out

You don't write config. Codex infers everything from your sentence and your repo:

What it needs	How it gets it	Example
Goal	Your sentence	"get rid of all any types"
Scope	Scans repo structure	`src/*/.ts`
Metric	Proposes based on goal + tooling	any count (current: 47)
Direction	Infers from "improve" / "reduce" / "eliminate"	lower
Verify	Matches to repo tooling	`grep` count + `tsc --noEmit`
Guard	Suggests a baseline-passing regression check	`npm test`

Before starting, Codex always shows what it found and asks you to confirm. Then you choose foreground or background and say "go." By default, the Results directory stays in the launch context: if you started Codex inside a git repo, that repo root is the default workspace root; if you started outside a git repo, the current launch directory is the default workspace root. Codex should not silently widen that to a parent directory unless you explicitly confirm a broader multi-repo workspace. The confirmation summary should always show the chosen Results directory before launch.

When It Gets Stuck

Instead of blind retrying, the loop escalates:

Trigger	Action
3 consecutive failures	REFINE — adjust within current strategy
5 consecutive failures	PIVOT — try a fundamentally different approach
2 PIVOTs without progress	Web search — look for external solutions
3 PIVOTs without progress	Stop — report that human input is needed

One success resets all counters.

Results Log

Every iteration is recorded in the workspace Results directory at autoresearch-results/results.tsv:

iteration  commit   metric  delta   status    description
0          a1b2c3d  47      0       baseline  initial any count
1          b2c3d4e  41      -6      keep      replace any in auth module
2          -        49      +8      discard   generic wrapper introduced new anys
3          d4e5f6g  38      -3      keep      type-narrow API response handlers

Failed experiments revert from git but stay in the log. The log is the real audit trail, while autoresearch-results/state.json is the resume snapshot.

More Features

These are covered in detail in GUIDE.md:

Cross-run learning — lessons from past runs bias future hypothesis generation
Parallel experiments — test up to 3 hypotheses simultaneously via git worktrees
Session resume — interrupted runs pick up from the last consistent state
CI/CD mode (exec) — non-interactive, JSON output, for automation pipelines
Dual-gate verification — separate verify (did it improve?) and guard (did anything break?)

FAQ

It only makes small incremental changes. Can it try bigger ideas? By default the loop favors small, verifiable steps — that's by design. But it can go bigger: describe a larger hypothesis in your prompt (e.g., "try replacing the attention mechanism with linear attention and run the full eval"), and it will treat that as a single experiment to verify. The loop is best when the human sets the research direction and the agent does the heavy execution and analysis.

Is this more for engineering optimization than research? It's strongest when the goal and metric are clear — push coverage up, push errors down, push latency lower. For open-ended research where the direction itself is uncertain, use plan mode first to explore, then switch to loop once you know what to measure. Think of it as a human-AI collaboration: you provide judgment, it provides iteration speed.

How do I stop it? Foreground: interrupt Codex. Background: $codex-autoresearch then ask to stop.

Can it resume after interruption? Yes. It resumes from autoresearch-results/state.json automatically.

How do I use it in CI? Mode: exec with codex exec. All config upfront, JSON output, exit codes 0/1/2.

Documentation

Doc	What it covers
INSTALL.md	Skill installer, manual copy, user-scope, and development install options
GUIDE.md	Full operator's manual: modes, config fields, safety model, advanced usage
EXAMPLES.md	Recipes by domain: coverage, performance, types, security, etc.

Acknowledgments

Built on ideas from Karpathy's autoresearch. The Codex skills platform is by OpenAI.

Citation

If you use Codex Autoresearch in your work, please cite it as:

@misc{codex-autoresearch,
  author = {Li, Linxiao},
  title = {Codex Autoresearch: Autonomous Goal-Driven Experimentation for Codex},
  year = {2026},
  publisher = {GitHub}