Turns AI agents from chaotic code generators into disciplined engineers. 12-stage workflow from research to production.
# Add to your Claude Code skills
git clone https://github.com/artemiimillier/bulletproofAuthor: Artemiy Miller (@artemiimillier) · Telegram · who.ismillerr@gmail.com · TG Channel Version: 5.0 · March 2026 License: MIT Compatible: Claude Code, Codex, Gemini CLI, Cursor, Windsurf, OpenCode
Code to solve problems, not code for code's sake.
Before EVERY change ask: "Does this actually solve our problem? Is this the most efficient solution?" If the answer isn't clear — stop, research alternatives, pick the best one.
Not every task needs the full pipeline.
| Size | Examples | Mode | Stages | |------|----------|------|--------| | S | Bug fix, small edit, 1-2 files | Lightweight | 1 → 4 → 5 → 6 → 7 → Gates (skip spec/plan) | | M | New feature, module refactor, 3-10 files | Standard | Stages 1-10 | | L | Architecture change, new service, 10+ files | Full | Stages 1-12 (all) |
How stages relate: Stages 5-6-7 (Self-Audit, Verification, Impact) run inside each implementation phase as an inner loop. Stages 8-12 run once after all phases complete as an outer loop.
Code quality degrades when context fills beyond 40% ("Dumb Zone"). Rules:
/compact at 50% — don't wait for auto/clear → fresh startEvery major stage = clean context window:
A complete development methodology for AI agents. From idea to production.
AI agents without a system are chaotic code generators. They start coding before they understand the task, grab the first solution instead of the best one, "find bugs" that aren't bugs, and say "done" when half the work isn't finished. Bulletproof turns that chaos into discipline.
Author: Artemiy Miller · GitHub · Telegram · Channel · Email
You describe a feature. The AI writes code. Looks great. Then:
No comments yet. Be the first to share your thoughts!
/clearBefore /clear always create progress/<task>-handoff.md.
See templates/handoff.md for format.
Don't dump the entire codebase into context:
"For details, see path/to/docs.md" (not @file)Mode: Read-Only. No code. No changes.
thoughts/research/YYYY-MM-DD-<task>.md
(see templates/research.md for format)→ /clear
Mode: Read + Write only in specs/. No code.
Spec = WHAT and WHY. Not how. Spec = contract.
thoughts/research/specs/YYYY-MM-DD-<name>.md
(see templates/spec.md for format)Skip for size S tasks.
→ /clear
Mode: Read + Write only in plans/. No code yet.
specs/) and Research (thoughts/research/)Before finalizing the plan, answer 3 questions:
1. DOES THIS SOLVE THE PROBLEM?
Compare every plan item against acceptance criteria from spec.
If any criterion is uncovered — the plan is incomplete.
2. IS THIS THE MOST EFFICIENT SOLUTION?
Search: who has already solved this problem? What approach did they use?
Name 2-3 alternative approaches (including ones found via research).
For each: pros, cons, effort.
Justify why the chosen approach is better than all alternatives.
3. IS THERE "CODE FOR CODE'S SAKE"?
Every change must directly serve acceptance criteria.
If a change isn't tied to solving the problem — remove it.
Drive-by refactoring = separate task, not part of this one.
Ctrl+G — plan opens in editor> NOTE: annotations"Address all notes, don't implement yet"Create plans/YYYY-MM-DD-<name>.md
(see templates/plan.md for full template with Challenge Log, phases, prompts)
→ /clear
Each phase = separate session, fresh context, feature branch.
Phases can be run in parallel via separate Claude Code sessions/terminals when they don't depend on each other. Check the plan for dependencies before parallelizing.
Guard phrase to start coding: Only begin implementation after the plan is finalized and all annotation notes are addressed. The trigger: "Implement Phase N according to plan."
Order within each phase:
feature/<task>in_progresscompleted, write to Changelog/clearMandatory BEFORE marking completed:
Check the phase implementation:
1. SPEC COMPLIANCE
Open spec. Walk through every acceptance criterion.
For each: implemented? Where exactly in code?
If any not covered — finish it.
2. CHALLENGE THE SOLUTION
Look at the written code with fresh eyes.
Does this actually solve the problem from spec?
Is there a simpler/more efficient way?
Any "code for code's sake" — changes unrelated to the task?
Not just linting. Thoughtful review with false-positive filtering.
Check ALL code from this phase for:
- Logic errors (wrong conditions, off-by-one, race conditions)
- Data handling (null/undefined, type mismatches)
- Security (injection, auth bypass, exposed secrets)
- Performance (N+1 queries, memory leaks, unnecessary re-renders)
For EACH found bug:
1. Is this a REAL bug or a false positive?
2. Can you prove this bug is reproducible?
3. If you can't prove it — it's NOT a bug. Don't touch it.
RULE: Don't fix code "for beauty" or "just in case".
Fix ONLY proven bugs that actually affect functionality.
Every "fix" without proof = risk of introducing a new bug.
Final code cleanliness check:
- Logic: is the data flow correct from input to output?
- Efficiency: any redundant operations?
- Readability: is the code understandable without comments?
BUT: don't refactor "for beauty". Only if it affects correctness.
The most underestimated stage. 75% of AI agents break previously working code.
MANDATORY CHECK BEFORE MERGE:
1. REGRESSION
What other modules/functions depend on changed files?
Run ALL project tests (not just current phase).
If anything broke — this is priority #1.
2. SIDE EFFECTS
Did any contracts/interfaces change (API, props, types)?
If yes — who uses them? Are all consumers updated?
3. THINK AHEAD
What problems could these changes cause in a week/month?
Edge cases we haven't tested?
What happens with: zero data? Huge data? Concurrent requests?
What if the user does something unexpected?
4. COMPATIBILITY
Backward compatibility preserved?
Data migrations needed?
Feature flags needed for gradual rollout?
completed → run gates across entire projectNew session. No implementation bias.
@code-reviewer agent (see agents/code-reviewer.md)semgrep --config=auto .
# or
/security-review # built into Claude Code
If review/scan found issues:
mv plans/<file> plans/archive/A phase CANNOT be completed without passing ALL required gates.
# Frontend
cd frontend && npx tsc --noEmit # 0 type errors
cd frontend && npm run lint # 0 lint errors
cd frontend && npm test # all tests green
# Backend
cd backend && python -m py_compile app/main.py
cd backend && pytest --tb=short -q
cd backend && ruff check .
npx madge --circular src/ # circular dependencies
npm audit --audit-level=high # dependency vulnerabilities
pip-audit
semgrep --config=auto .
# or /security-review
If a gate fails — fix and re-run. Never skip.
Add to .claude/settings.json:
{
"hooks": {
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [{
"type": "command",
"command": "bash -c \"CMD=$(echo $TOOL_INPUT | jq -r '.command // empty'); echo \\\"$CMD\\\" | grep -qE '(git push.*(main|master)|rm -rf /|DROP TABLE)' && echo 'BLOCKED: Use feature branch / safe alternative.' >&2 && exit 2 || exit 0\""
}]
}
],
"Stop": [
{
"hooks": [{
"type": "prompt",
"prompt": "You are a JSON-only evaluator. Respond ONLY with raw JSON, no markdown.\n\nReview the assistant's final response. Reject if:\n- Rationalizing incomplete work ('pre-existing', 'out of scope', 'follow-up')\n- Listing problems without fixing them\n- Skipping test/lint failures with excuses\n- Making changes unrelated to the stated problem ('code for code's sake')\n- Claiming completion without running verification gates\n\nRespond: {\"ok\": false, \"reason\": \"[issue]. Go back and finish.\"}\nor: {\"ok\": true}"
}]
}
]
}
}
feature/<task> branch{
"matcher": "Write|Edit",
"hooks": [{
"type": "command",
"command": "npx prettier --write \"$FILE_PATH\" 2>/dev/null || true"
}]
}
Claude generates well-formatted code; the hook handles the last 10% to avoid CI failures.
{
"matcher": "Write|Edit",
"hooks": [{
"type": "command",
"command": "bash -c \"CONTENT=$(echo $TOOL_INPUT | jq -r '.content // empty'); echo \\\"$CONTENT\\\" | grep -qiP '(api.?key|secret|password)\\s*=\\s*[\\x27\\\"][^\\x27\\\"]{10,}' && echo 'BLOCKED: Hardcoded secret. Use env vars.' >&2 && exit 2 || exit 0\""
}]
}
Fragile (regex-based) but catches obvious mistakes. For production, use semgrep or /security-review instead.
| Stage | Model | Why | |-------|-------|-----| | Research, Planning | Opus | Cross-file reasoning | | Implementation | Sonnet | Speed, cost-efficiency | | Code Review, Security | Opus | Deep analysis | | Anti-rationalization hook | Haiku | Fast, cheap gate |
project/
├── .claude/
│ ├── settings.json # hooks config
│ ├── skills/
│ │ └── bulletproof/
│ │ ├── SKILL.md # ← this file
│ │ ├── templates/
│ │ │ ├── research.md
│ │ │ ├── spec.md
│ │ │ ├── plan.md
│ │ │ └── handoff.md
│ │ └── agents/
│ │ └── code-reviewer.md
│ └── agents/ # project-level agents
├── CLAUDE.md # project brain
├── specs/ # WHAT and WHY
├── plans/ # HOW
│ └── archive/ # completed plans
├── thoughts/research/ # research artifacts
└── progress/ # handoff files
75% of AI agents introduce regressions into working code (SWE-CI benchmark, Alibaba 2025). This isn't an AI problem. It's a process problem.
Bulletproof is a 12-stage workflow. Every stage exists because without it, something specific breaks. Not every task goes through all 12 - a bug fix runs through 6, a feature through 10, an architecture change through all 12.
Here's what happens at each one:
The pain: AI jumps straight into coding. Doesn't study the codebase, doesn't look for existing solutions, doesn't understand context.
What Bulletproof does: AI launches parallel research agents. Each one digs into a different area - project structure, patterns, dependencies, tests. At the same time, it searches the web: who's already solved this? What libraries exist? What's the proven best practice?
The key thing: The output isn't a list of options. It's a concrete recommendation: "the best approach is X, because Y." The AI has to make a decision and defend it, not dump the choice on you.
The pain: AI starts writing code without defining what exactly needs to be done. No criteria for "done." It ends up building the wrong thing, or building too much.
What Bulletproof does: Creates a specification: WHAT we're building and WHY. Not how - just what. With clear acceptance criteria - an objective measure of "done" that the AI can't argue with later.
The key thing: The spec is a contract. When the AI checks its own work at Stage 5, it checks against this contract, not against its gut feeling of "seems about right."
The pain: AI grabs the first solution that pops into its head. Doesn't consider alternatives, doesn't think about consequences.
What Bulletproof does: AI creates a plan. But before it can start coding, it has to pass the Challenge Loop - answer 3 questions:
The key thing: AI can't start coding until it has proven that its plan is the best option available. Not "I think so" - "here are 3 options, here's the comparison, here's why this one wins."
The pain: AI writes code in one big chunk, context fills up, quality drops. No tests, no iterations.
What Bulletproof does: Implementation is split into phases. Each phase runs in a fresh context window (so the AI doesn't get dumber as it goes). Order: tests first (TDD), then code. Phases with no dependencies can run in parallel across separate terminals.
The key thing: The 40% rule. AI output quality degrades when context fills beyond 40%. Bulletproof runs /clear between stages and passes context through handoff documents. The AI always works in its "smart zone."
The pain: AI says "done" - but half the criteria aren't met. Or it did extra stuff nobody asked for.
What Bulletproof does: AI opens the spec and walks through every acceptance criterion: implemented? Where exactly in the code? Anything in there that wasn't part of the task?
The key thing: It doesn't check based on vibes. It checks against the contract. Every criterion - yes or no. If no - go back and finish.
The pain: AI "finds bugs" that aren't bugs. Fixes things that aren't broken. Makes "improvements" that create real problems.
What Bulletproof does: Three-step check. Step 1 - find errors (logic, security, performance). Step 2 - prove every bug is real. Can you reproduce it? No? Then it's not a bug, don't touch it. Step 3 - logic and efficiency review.
The key thing: The rule is "don't fix code for aesthetics or just in case." Every fix without proof is a risk of introducing a new bug. Early AI code reviewers flagged 9 false positives for every 1 real bug (Anthropic). This stage cuts out 90% of wasted work.
The pain: The code works. But it broke something somewhere else. You find out a week later.
What Bulletproof does: Mandatory check before merge: (1) What modules depend on the changed files? Run ALL project tests, not just the current phase. (2) Did any contracts change - APIs, types, interfaces? Are all consumers updated? (3) What could go wrong in a month? With zero data? With a million records? With concurrent requests? (4) Backward compatibility? Migrations needed?
The key thing: 75% of AI agents break working code - precisely because this stage doesn't exist. Dependency graph analysis cuts regressions by 70% (TDAD/arXiv).
All phases done - full test suite across the entire project. Audit: is everything from the spec actually implemented?
The pain: AI reviews its own code and thinks it's great.
What Bulletproof does: New session. Fresh context. A separate agent that has never seen the implementation. Checks edge cases, race conditions, security, performance.
The key thing: AI reviewing AI, but without the implementer's bias. For critical code - you still need a human, and Bulletproof says so explicitly.
Automated vulnerability scanning. AI-generated code has 2-3x more security issues than human-written code. This catches them.
Found issues? Fix only proven bugs (Stage 6 rule still applies). After fixes - run impact analysis again. Fixes break code more often than original development does.
Archive the plan. Squash merge. Deploy - only when you explicitly say so.
This one lives outside the stages. It's a Stop hook that fires every time the AI tries to wrap up. It checks:
If yes - blocks completion and sends the AI back to finish.
Not every task needs all 12 stages:
| Size | What | Stages | |------|------|--------| | S - bug fix, 1-2 files | Lightweight | Research → Build → Self-Audit → Verify → Impact → Gates | | M - feature, 3-10 files | Standard | Stages 1-10 | | L - architecture, 10+ files | Full pipeline | All 12 stages |
Not theory. Every mechanism is backed by research:
| Mechanism | Source | |-----------|--------| | 40% context rule | HumanLayer | | Challenge Loop (justify decisions) | Addy Osmani, spec-first workflow | | False-positive filter | Anthropic Code Review | | Impact Analysis (dependency graphs) | SWE-CI (Alibaba), TDAD/arXiv | | Anti-rationalization | Trail of Bits | | Phase separation | RIPER-5, Spotify Engineering |
# Into your project
mkdir -p .claude/skills && git clone https://github.com/artemiimillier/bulletproof.git .claude/skills/bulletproof
# Global (all projects)
mkdir -p ~/.claude/skills && git clone https://github.com/artemiimillier/bulletproof.git ~/.claude/skills/bulletproof
# For teams
git submodule add https://github.com/artemiimillier/bulletproof.git .claude/skills/bulletproof
Open Claude Code → type /bulletproof → done.
Or