by gaasher
Loop until it's better — drop-in agentic loops (autoresearch, scientific writing, data analysis, code/SQL/prompt optimization, red-teaming) as open-standard Agent Skills. Verification-gated; native on Claude Code, portable across Codex, Cursor & other Skills hosts.
# Add to your Claude Code skills
git clone https://github.com/gaasher/Agent-Loop-SkillsGuides for using ai agents skills like Agent-Loop-Skills.
Agent-Loop-Skills is an open-source ai agents skill for AI coding assistants such as Claude Code, Codex CLI, and ChatGPT, built by gaasher. Loop until it's better — drop-in agentic loops (autoresearch, scientific writing, data analysis, code/SQL/prompt optimization, red-teaming) as open-standard Agent Skills. Verification-gated; native on Claude Code, portable across Codex, Cursor & other Skills hosts. It has 50 GitHub stars.
Agent-Loop-Skills's catalog security scan is still queued. You can run an instant dependency and prompt-injection check now with the "Scan for vulnerabilities" button above.
Clone the repository with "git clone https://github.com/gaasher/Agent-Loop-Skills" and add it to your Claude Code skills directory (see the Installation section above).
Agent-Loop-Skills is primarily written in Python. It is open-source under gaasher on GitHub, so you can review or fork the full source.
Yes. SkillsLLM lists many other AI Agents skills you can browse and compare side by side. Open the AI Agents category from the badge at the top of this page, or use the Related Skills and comparison links further down to weigh Agent-Loop-Skills against similar tools.
No comments yet. Be the first to share your thoughts!
Unlocks once the catalog security scan passes (runs nightly).
The deep catalog scan for this skill is still queued. Run an instant dependency check now instead.
Autoresearch · scientific writing · data analysis · code/SQL/prompt optimization · red-teaming — each a generic, reusable loop you bind to your own task at invocation time, that iterates against a real signal until the work is actually better.
A real run. The tournament-autoresearch loop on a CIFAR-10 model under a fixed 5-epoch budget — competing agents propose a change each step, a self-calibrating judge keeps the winners (green) and discards the regressions (gray): 0.734 → 0.798 val_acc, hands-off, 7 of 11 kept. Full ledger: showcase/tournament-autoresearch. Far from SOTA by design — a deliberately tiny CNN at 5 epochs on a laptop GPU (Apple MPS). The demo is the loop's decision-making, not the absolute accuracy.
Two ideas collided in late 2025, and this repo lives in the overlap:
SKILL.md now runs across
~30 hosts (Claude Code, Codex, Cursor, …).This repo makes the loop be the skill. Instead of task-specific skills, each entry is a generic loop — program · artifact · feedback signal · run ledger · termination — that you bind to your task at invocation time. Paste your goal; the loop proposes a change, runs it in your environment, scores it on a real signal (tests, latency, a metric, a calibrated judge), keeps it only if it's better, logs it, and repeats.
The honest part: unsupervised agent loops are famous for spinning forever and confidently shipping garbage — at 90% per-step accuracy, a 5-step chain fails ~40% of the time. Every loop here is verification-gated: an objective feedback signal decides each step and an explicit termination condition ends it. That discipline — not autonomy for its own sake — is the point. (See Limitations.)
flowchart LR
T["bind your task<br/>(artifact + signal + budget)"] --> P["propose<br/>one change"]
P --> R["run it in<br/>your env"]
R --> S{"score<br/>tests · metric · judge"}
S -->|better| K["keep + log"]
S -->|worse| X["revert"]
K --> G{stop?}
X --> G
G -->|"plateau · budget · threshold"| B(["best artifact"])
G -->|no| P
Every loop decomposes into the same five ingredients — program (SKILL.md), artifact slot
(what's improved), feedback signal (what drives the next step), run ledger (append-only log), and
termination (when to stop). Skills ship zero heavy dependencies: your code (a torch trainer, a SQL
database, a dataset) runs in your environment via a bound run command; the skill shells out and reads the
result. Multi-role loops use spawn-or-degrade — real isolated subagents on Claude Code, the same roles
inline elsewhere.
Any one of these installs all the loops:
Claude Code — plugin marketplace (add once, then install):
/plugin marketplace add gaasher/agent-loop-skills
/plugin install agent-loops@agent-loop-skills
Loops install namespaced as agent-loops:<name> (e.g. agent-loops:karpathy).
Any Agent-Skills host — the standard installers:
npx skills add gaasher/agent-loop-skills # auto-detects host, installs to the right dir
gh skill install gaasher/agent-loop-skills --agent <host> # claude-code | codex | cursor | … (--pin, gh skill update)
Manual — clone, then copy the loops into your host's skills dir (pick the line for your host):
git clone https://github.com/gaasher/agent-loop-skills
cp -r agent-loop-skills/loops/* ~/.agents/skills/ # cross-tool: Codex, Cursor, Pi, OpenClaw, …
cp -r agent-loop-skills/loops/* ~/.claude/skills/ # Claude Code
# Hermes: hermes skills tap add gaasher/agent-loop-skills
Then just describe your task — the host loads the matching loop. Research loops also call the shared
literature-search skill; installing everything puts it alongside them, and any
loop degrades gracefully (to WebSearch) if it's absent.
Most skill repos tell you what a skill is. Here's what these loops actually do — real Sonnet runs,
full ledgers in showcase/.
tournament-autoresearch — competing ideas, a self-calibrating judge<n> agents pitch competing changes each step; a judge critiques them, picks one, runs it, and recalibrates
by comparing its predicted vs realized gain. On a CIFAR-10 SmallCNN under a fixed 5-epoch budget it
climbed 0.734 → 0.798 val_acc, keeping 7 of 11 changes and reverting all 4 that regressed —
escaping the plateaus a single-thread loop gets stuck on. (That's the run charted up top.)
→ showcase/tournament-autoresearch
ml-autoresearch — analysis-first, every change traced to a causeThis loop reads inside each run — gradient flow, dead neurons, the loss curve — and grounds the next
change in that evidence rather than guessing: "FC grad 57% vs first conv 3.3% — severe imbalance; 54% dead
neurons" → add BatchNorm; "cosine schedule fixed the epoch-3 dip entirely (monotonic!), +0.033". It also
reverts what hurts (augmentation, over-aggressive LR). The point isn't a leaderboard number — it's that
every accepted change has a measured reason behind it. → showcase/ml-autoresearch
data-analysis — findings with a number behind every oneHypothesis → verify, stdlib-only. On a planted dataset it surfaced 3 real findings and correctly refuted 2,
with effect sizes matching ground truth and no hallucinations: enterprise vs consumer order value
184.90 vs 109.16 (Cohen's d = 2.13), mobile return rate 32.8% vs 8.2% (RR 4.0) — and it reversed a
plausible-but-wrong claim once it spotted a mobile confound. → showcase/data-analysis
| Loop | What the run did |
|---|---|
optimize-loop |
Correctness-gated speedup: a SQLite query 1,131.75 ms → 1.055 ms (~1,073×), result-set hash matching baseline on every kept iteration; in code mode cut cyclomatic complexity 23 → 15 (nesting 7 → 3) with 13/13 tests green. |
research-proposal |
ScholarEval graded a proposal against the literature; Judge + Reviser iterated grade 45 → 84 (soundness 2→4, contribution 1→4) over 5 rounds. |
scientific-figure |
Same ImageNet top-1-accuracy bar-chart brief, with vs without the loop: a single call truncated the y-axis at 50% and used non-paper numbers; the loop verified every value against the arXiv papers, flagged GoogLeNet's borrowed top-1, and iterated 80 → 96 (PASS). |
red-team |
Against a naive content filter, surfaced all 5 planted weaknesses (case bypass, leetspeak, spacing, synonyms, over-block) — 39 bypasses + 6 over-blocks — with a one-line root-cause fix each. |
power-analysis |
Solved n = 100/group for 80% power via Monte-Carlo, fixed all 6 validity flaws, and emitted a full pre-registration. |
research-question |
Sharpened 5 vague drafts → 3 strong questions (≥75), with real web novelty checks pivoting already-answered questions toward the open sub-problem. |
† = multi-role (real subagents on Claude Code, inline elsewhere). Browse any folder for its SKILL.md.
| Loop | Why you'd reach for it |
|---|---|
karpathy |
The minimal baseline — propose, train, keep-if-better, loop. A faithful nod to Karpathy's autoresearch. |
ml-autoresearch † |
Analysis-first: diagnoses each run and grounds the next change in evidence. A literature dial adds paper-grounded changes. |
exploratory-autoresearch |
Forces broad exploration via a temperatur |