bulletproof is an open-source ai agents skill for AI coding assistants such as Claude Code, Codex CLI, and ChatGPT, built by artemiimillier. Turns AI agents from chaotic code generators into disciplined engineers. 12-stage workflow from research to production. It has 129 GitHub stars.

Is bulletproof safe to use?

Yes. bulletproof passed SkillsLLM's automated security scan — a dependency vulnerability audit plus prompt-injection heuristics — with no high-severity issues. You can read the full report in the Security Report section on this page.

How do I install bulletproof?

Clone the repository with "git clone https://github.com/artemiimillier/bulletproof" and add it to your Claude Code skills directory (see the Installation section above). bulletproof ships a SKILL.md manifest, so compatible agents can discover and load it automatically.

Are there alternatives to bulletproof?

Yes. SkillsLLM lists many other AI Agents skills you can browse and compare side by side. Open the AI Agents category from the badge at the top of this page, or use the Related Skills and comparison links further down to weigh bulletproof against similar tools.

name: bulletproof description: Use when building a feature, refactoring, fixing a complex bug, changing architecture, or starting any non-trivial coding task. 12-stage verified dev workflow from research to deploy.

Bulletproof — Adaptive Development Workflow

Name: bulletproof
Author: artemiimillier

Author: Artemiy Miller (@artemiimillier) · Telegram · who.ismillerr@gmail.com · TG Channel Version: 5.0 · March 2026 License: MIT Compatible: Claude Code, Codex, Gemini CLI, Cursor, Windsurf, OpenCode

Core Principle

Code to solve problems, not code for code's sake.

Before EVERY change ask: "Does this actually solve our problem? Is this the most efficient solution?" If the answer isn't clear — stop, research alternatives, pick the best one.

Pick Your Mode

Not every task needs the full pipeline.

Size	Examples	Mode	Stages
S	Bug fix, small edit, 1-2 files	Lightweight	1 → 4 → 5 → 6 → 7 → Gates (skip spec/plan)
M	New feature, module refactor, 3-10 files	Standard	Stages 1-10
L	Architecture change, new service, 10+ files	Full	Stages 1-12 (all)

How stages relate: Stages 5-6-7 (Self-Audit, Verification, Impact) run inside each implementation phase as an inner loop. Stages 8-12 run once after all phases complete as an outer loop.

Context Management (ALWAYS applies)

The 40% Rule

Code quality degrades when context fills beyond 40% ("Dumb Zone"). Rules:

Stay within 40-60% of the context window
Manual /compact at 50% — don't wait for auto
If overloaded: save progress → /clear → fresh start

Fresh Context Between Stages

Every major stage = clean context window:

Save stage artifact (research / spec / plan / handoff)
/clear
Start new stage pointing the agent to the artifact path

Handoff Protocol

Before /clear always create progress/<task>-handoff.md. See templates/handoff.md for format.

Progressive Disclosure

Don't dump the entire codebase into context:

Research: sub-agents → compact summary
Planning: summary + key interfaces only
Implementation: only files for current phase
In CLAUDE.md: "For details, see path/to/docs.md" (not @file)

Stage 1: Deep Research

Mode: Read-Only. No code. No changes.

Launch parallel Explore agents (1 per area: structure, patterns, deps, tests)
WebSearch: Who has already solved this problem? How did they solve it? What is the most efficient known solution? Don't reinvent — find the best existing approach first.
Analyze all findings and make a conclusion: which solution is the BEST and why. The research artifact must end with a clear recommendation, not just a list of options.
Save to thoughts/research/YYYY-MM-DD-<task>.md (see templates/research.md for format)

→ /clear

Stage 2: Spec / PRD

Mode: Read + Write only in specs/. No code.

Spec = WHAT and WHY. Not how. Spec = contract.

Read Research Artifact from thoughts/research/
Create specs/YYYY-MM-DD-<name>.md (see templates/spec.md for format)
Key sections: Problem, Goal, Scope, Acceptance Criteria, Constraints, Non-Goals

Skip for size S tasks. → /clear

Stage 3: Planning + Questions

Mode: Read + Write only in plans/. No code yet.

Read both Spec (specs/) and Research (thoughts/research/)
Launch Plan agents to check the approach
Find gaps: what's unthought? What edge cases? What could break?
Be creative and proactive: anticipate ALL possible problems BEFORE writing code. Think several steps ahead. What could go wrong in a week? A month? Under load? With unexpected user behavior? Solve problems before they exist.
WebSearch: How have others solved this exact problem? What libraries/patterns exist? What's the proven best practice? Choose the most efficient solution, not the first one that comes to mind.
After Plan agents verify the approach — rewrite the plan into an improved version incorporating all findings, edge cases, and research results. Not just patch it — rewrite it better.

Challenge Loop (mandatory before finalizing plan)

Before finalizing the plan, answer 3 questions:

1. DOES THIS SOLVE THE PROBLEM?
   Compare every plan item against acceptance criteria from spec.
   If any criterion is uncovered — the plan is incomplete.

2. IS THIS THE MOST EFFICIENT SOLUTION?
   Search: who has already solved this problem? What approach did they use?
   Name 2-3 alternative approaches (including ones found via research).
   For each: pros, cons, effort.
   Justify why the chosen approach is better than all alternatives.

3. IS THERE "CODE FOR CODE'S SAKE"?
   Every change must directly serve acceptance criteria.
   If a change isn't tied to solving the problem — remove it.
   Drive-by refactoring = separate task, not part of this one.

Annotation Cycle

Claude drafts the plan
Ctrl+G — plan opens in editor
User adds > NOTE: annotations
Claude: "Address all notes, don't implement yet"
Repeat until no notes remain

Questions for User

Only for real forks where there's a genuine decision to make
Use AskUserQuestion with options
For each question: recommend which option you think is best and why
Don't ask the obvious

Final Plan

Create plans/YYYY-MM-DD-<name>.md (see templates/plan.md for full template with Challenge Log, phases, prompts)

→ /clear

Stage 4: Phased Implementation

Each phase = separate session, fresh context, feature branch.

Phases can be run in parallel via separate Claude Code sessions/terminals when they don't depend on each other. Check the plan for dependencies before parallelizing.

Guard phrase to start coding: Only begin implementation after the plan is finalized and all annotation notes are addressed. The trigger: "Implement Phase N according to plan."

Order within each phase:

Create/switch to feature branch: feature/<task>
Update status → in_progress
TDD: tests FIRST (red)
Implement: code to make tests pass (green)
Refactor (if needed)
Self-Audit (Stage 5)
Verification (Stage 6)
Impact Analysis (Stage 7)
Gates (see Gates section)
Commit (checkpoint)
Status → completed, write to Changelog
Handoff → /clear

Stage 5: Self-Audit (after each phase)

Mandatory BEFORE marking completed:

Check the phase implementation:

1. SPEC COMPLIANCE
   Open spec. Walk through every acceptance criterion.
   For each: implemented? Where exactly in code?
   If any not covered — finish it.

2. CHALLENGE THE SOLUTION
   Look at the written code with fresh eyes.
   Does this actually solve the problem from spec?
   Is there a simpler/more efficient way?
   Any "code for code's sake" — changes unrelated to the task?

Stage 6: Verification — Deep Bug Hunt

Not just linting. Thoughtful review with false-positive filtering.

Step 1: Find errors

Check ALL code from this phase for:
- Logic errors (wrong conditions, off-by-one, race conditions)
- Data handling (null/undefined, type mismatches)
- Security (injection, auth bypass, exposed secrets)
- Performance (N+1 queries, memory leaks, unnecessary re-renders)

Step 2: Verify bugs are REAL

For EACH found bug:
1. Is this a REAL bug or a false positive?
2. Can you prove this bug is reproducible?
3. If you can't prove it — it's NOT a bug. Don't touch it.

RULE: Don't fix code "for beauty" or "just in case".
Fix ONLY proven bugs that actually affect functionality.
Every "fix" without proof = risk of introducing a new bug.

Step 3: Logic and efficiency check

Final code cleanliness check:
- Logic: is the data flow correct from input to output?
- Efficiency: any redundant operations?
- Readability: is the code understandable without comments?
BUT: don't refactor "for beauty". Only if it affects correctness.

Stage 7: Impact Analysis — "Did we break anything?"

The most underestimated stage. 75% of AI agents break previously working code.

MANDATORY CHECK BEFORE MERGE:

1. REGRESSION
   What other modules/functions depend on changed files?
   Run ALL project tests (not just current phase).
   If anything broke — this is priority #1.

2. SIDE EFFECTS
   Did any contracts/interfaces change (API, props, types)?
   If yes — who uses them? Are all consumers updated?

3. THINK AHEAD
   What problems could these changes cause in a week/month?
   Edge cases we haven't tested?
   What happens with: zero data? Huge data? Concurrent requests?
   What if the user does something unexpected?

4. COMPATIBILITY
   Backward compatibility preserved?
   Data migrations needed?
   Feature flags needed for gradual rollout?

Stage 8: Integration Check

All phases completed → run gates across entire project
Explore agents for audit: everything from spec implemented?
Every acceptance criterion → fulfilled?

Stage 9: Code Review (fresh context)

New session. No implementation bias.

Launch @code-reviewer agent (see agents/code-reviewer.md)
Checklist: edge cases, race conditions, backward compat, security, error handling, performance
If possible: cross-model review (different model checks Claude's work)
Warning: AI reviewing AI has shared blind spots. For critical code — human review is mandatory.

Stage 10: Security Scan (for M and L)

semgrep --config=auto .
# or
/security-review    # built into Claude Code

Stage 11: Fixes + Re-verification

If review/scan found issues:

Fix (only proven bugs — rule from Stage 6)
Re-run gates
Repeat Impact Analysis (Stage 7) — fixes didn't break anything else?
Re-review if major changes were made

Stage 12: Cleanup + Deploy

Archive plan: mv plans/<file> plans/archive/
Keep spec as documentation
Squash merge → main
Deploy — ONLY on explicit user request

Deterministic Gates

A phase CANNOT be completed without passing ALL required gates.

Tier 1: Required (block the phase)

# Frontend
cd frontend && npx tsc --noEmit          # 0 type errors
cd frontend && npm run lint               # 0 lint errors
cd frontend && npm test                   # all tests green

# Backend
cd backend && python -m py_compile app/main.py
cd backend && pytest --tb=short -q
cd backend && ruff check .

Tier 2: Recommended (for M and L)

npx madge --circular src/         # circular dependencies
npm audit --audit-level=high      # dependency vulnerabilities
pip-audit

Tier 3: Deep Security (for Security Scan stage)

semgrep --config=auto .
# or /security-review

If a gate fails — fix and re-run. Never skip.

Hooks

Add to .claude/settings.json:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [{
          "type": "command",
          "command": "bash -c \"CMD=$(echo $TOOL_INPUT | jq -r '.command // empty'); echo \\\"$CMD\\\" | grep -qE '(git push.*(main|master)|rm -rf /|DROP TABLE)' && echo 'BLOCKED: Use feature branch / safe alternative.' >&2 && exit 2 || exit 0\""
        }]
      }
    ],
    "Stop": [
      {
        "hooks": [{
          "type": "prompt",
          "prompt": "You are a JSON-only evaluator. Respond ONLY with raw JSON, no markdown.\n\nReview the assistant's final response. Reject if:\n- Rationalizing incomplete work ('pre-existing', 'out of scope', 'follow-up')\n- Listing problems without fixing them\n- Skipping test/lint failures with excuses\n- Making changes unrelated to the stated problem ('code for code's sake')\n- Claiming completion without running verification gates\n\nRespond: {\"ok\": false, \"reason\": \"[issue]. Go back and finish.\"}\nor: {\"ok\": true}"
        }]
      }
    ]
  }
}

Git Discipline

Each task = feature/<task> branch
Commit after each passed gate (checkpoint for rollback)
NEVER push to main directly (hook blocks it)
Squash merge on completion

Optional Enhancements

PostToolUse: Auto-format after every file write

{
  "matcher": "Write|Edit",
  "hooks": [{
    "type": "command",
    "command": "npx prettier --write \"$FILE_PATH\" 2>/dev/null || true"
  }]
}

Claude generates well-formatted code; the hook handles the last 10% to avoid CI failures.

PreToolUse: Block hardcoded secrets on file write

{
  "matcher": "Write|Edit",
  "hooks": [{
    "type": "command",
    "command": "bash -c \"CONTENT=$(echo $TOOL_INPUT | jq -r '.content // empty'); echo \\\"$CONTENT\\\" | grep -qiP '(api.?key|secret|password)\\s*=\\s*[\\x27\\\"][^\\x27\\\"]{10,}' && echo 'BLOCKED: Hardcoded secret. Use env vars.' >&2 && exit 2 || exit 0\""
  }]
}

Fragile (regex-based) but catches obvious mistakes. For production, use semgrep or /security-review instead.

Model Recommendations

Stage	Model	Why
Research, Planning	Opus	Cross-file reasoning
Implementation	Sonnet	Speed, cost-efficiency
Code Review, Security	Opus	Deep analysis
Anti-rationalization hook	Haiku	Fast, cheap gate

Project Structure

project/
├── .claude/
│   ├── settings.json           # hooks config
│   ├── skills/
│   │   └── bulletproof/
│   │       ├── SKILL.md        # ← this file
│   │       ├── templates/
│   │       │   ├── research.md
│   │       │   ├── spec.md
│   │       │   ├── plan.md
│   │       │   └── handoff.md
│   │       └── agents/
│   │           └── code-reviewer.md
│   └── agents/                 # project-level agents
├── CLAUDE.md                   # project brain
├── specs/                      # WHAT and WHY
├── plans/                      # HOW
│   └── archive/                # completed plans
├── thoughts/research/          # research artifacts
└── progress/                   # handoff files

🛡️ Bulletproof

A complete development methodology for AI agents. From idea to production.

AI agents without a system are chaotic code generators. They start coding before they understand the task, grab the first solution instead of the best one, "find bugs" that aren't bugs, and say "done" when half the work isn't finished. Bulletproof turns that chaos into discipline.

Author: Artemiy Miller · GitHub · Telegram · Channel · Email

The Problem

You describe a feature. The AI writes code. Looks great. Then:

It broke something that was working fine
It "fixed" bugs that didn't exist and created real ones
It refactored code nobody asked it to touch
It said "done" when half the acceptance criteria aren't met
It picked the first solution that came to mind, not the best one
Next morning you realize it ignored the entire architecture of your project

75% of AI agents introduce regressions into working code (SWE-CI benchmark, Alibaba 2025). This isn't an AI problem. It's a process problem.

How Bulletproof Fixes This

Bulletproof is a 12-stage workflow. Every stage exists because without it, something specific breaks. Not every task goes through all 12 - a bug fix runs through 6, a feature through 10, an architecture change through all 12.

Here's what happens at each one:

🔍 Stage 1: Deep Research

The pain: AI jumps straight into coding. Doesn't study the codebase, doesn't look for existing solutions, doesn't understand context.

What Bulletproof does: AI launches parallel research agents. Each one digs into a different area - project structure, patterns, dependencies, tests. At the same time, it searches the web: who's already solved this? What libraries exist? What's the proven best practice?

The key thing: The output isn't a list of options. It's a concrete recommendation: "the best approach is X, because Y." The AI has to make a decision and defend it, not dump the choice on you.

📋 Stage 2: Spec

The pain: AI starts writing code without defining what exactly needs to be done. No criteria for "done." It ends up building the wrong thing, or building too much.

What Bulletproof does: Creates a specification: WHAT we're building and WHY. Not how - just what. With clear acceptance criteria - an objective measure of "done" that the AI can't argue with later.

The key thing: The spec is a contract. When the AI checks its own work at Stage 5, it checks against this contract, not against its gut feeling of "seems about right."

🧠 Stage 3: Plan + Challenge Loop

The pain: AI grabs the first solution that pops into its head. Doesn't consider alternatives, doesn't think about consequences.

What Bulletproof does: AI creates a plan. But before it can start coding, it has to pass the Challenge Loop - answer 3 questions:

Does this actually solve the problem? Compare every item in the plan against acceptance criteria from the spec. If anything isn't covered - the plan is incomplete.
Is this the best solution? Find 2-3 alternative approaches, compare pros and cons, and justify why the chosen one is better than all of them.
Is there any "code for code's sake"? Every change must tie directly to solving the problem. Drive-by refactoring = separate task, not part of this one.

The key thing: AI can't start coding until it has proven that its plan is the best option available. Not "I think so" - "here are 3 options, here's the comparison, here's why this one wins."

⚡ Stage 4: Implementation

The pain: AI writes code in one big chunk, context fills up, quality drops. No tests, no iterations.

What Bulletproof does: Implementation is split into phases. Each phase runs in a fresh context window (so the AI doesn't get dumber as it goes). Order: tests first (TDD), then code. Phases with no dependencies can run in parallel across separate terminals.

The key thing: The 40% rule. AI output quality degrades when context fills beyond 40%. Bulletproof runs /clear between stages and passes context through handoff documents. The AI always works in its "smart zone."

✅ Stage 5: Self-Audit

The pain: AI says "done" - but half the criteria aren't met. Or it did extra stuff nobody asked for.

What Bulletproof does: AI opens the spec and walks through every acceptance criterion: implemented? Where exactly in the code? Anything in there that wasn't part of the task?

The key thing: It doesn't check based on vibes. It checks against the contract. Every criterion - yes or no. If no - go back and finish.

🔬 Stage 6: Verification

The pain: AI "finds bugs" that aren't bugs. Fixes things that aren't broken. Makes "improvements" that create real problems.

What Bulletproof does: Three-step check. Step 1 - find errors (logic, security, performance). Step 2 - prove every bug is real. Can you reproduce it? No? Then it's not a bug, don't touch it. Step 3 - logic and efficiency review.

The key thing: The rule is "don't fix code for aesthetics or just in case." Every fix without proof is a risk of introducing a new bug. Early AI code reviewers flagged 9 false positives for every 1 real bug (Anthropic). This stage cuts out 90% of wasted work.

💥 Stage 7: Impact Analysis

The pain: The code works. But it broke something somewhere else. You find out a week later.

What Bulletproof does: Mandatory check before merge: (1) What modules depend on the changed files? Run ALL project tests, not just the current phase. (2) Did any contracts change - APIs, types, interfaces? Are all consumers updated? (3) What could go wrong in a month? With zero data? With a million records? With concurrent requests? (4) Backward compatibility? Migrations needed?

The key thing: 75% of AI agents break working code - precisely because this stage doesn't exist. Dependency graph analysis cuts regressions by 70% (TDAD/arXiv).

🔗 Stage 8: Integration Check

All phases done - full test suite across the entire project. Audit: is everything from the spec actually implemented?

👁️ Stage 9: Code Review

The pain: AI reviews its own code and thinks it's great.

What Bulletproof does: New session. Fresh context. A separate agent that has never seen the implementation. Checks edge cases, race conditions, security, performance.

The key thing: AI reviewing AI, but without the implementer's bias. For critical code - you still need a human, and Bulletproof says so explicitly.

🔒 Stage 10: Security Scan

Automated vulnerability scanning. AI-generated code has 2-3x more security issues than human-written code. This catches them.

🔧 Stage 11: Fixes + Re-verification

Found issues? Fix only proven bugs (Stage 6 rule still applies). After fixes - run impact analysis again. Fixes break code more often than original development does.

🚀 Stage 12: Deploy

Archive the plan. Squash merge. Deploy - only when you explicitly say so.

Plus: Anti-Rationalization Hook

This one lives outside the stages. It's a Stop hook that fires every time the AI tries to wrap up. It checks:

Is the AI rationalizing incomplete work? ("pre-existing issue", "out of scope", "I'll note this for a follow-up")
Listing problems without actually fixing them?
Skipping failed tests with excuses?
Making changes unrelated to the task?
Claiming "done" without running verification?

If yes - blocks completion and sends the AI back to finish.

Adaptive Sizing

Not every task needs all 12 stages:

Size	What	Stages
S - bug fix, 1-2 files	Lightweight	Research → Build → Self-Audit → Verify → Impact → Gates
M - feature, 3-10 files	Standard	Stages 1-10
L - architecture, 10+ files	Full pipeline	All 12 stages

Why This Works

Not theory. Every mechanism is backed by research:

Mechanism	Source
40% context rule	HumanLayer
Challenge Loop (justify decisions)	Addy Osmani, spec-first workflow
False-positive filter	Anthropic Code Review
Impact Analysis (dependency graphs)	SWE-CI (Alibaba), TDAD/arXiv
Anti-rationalization	Trail of Bits
Phase separation	RIPER-5, Spotify Engineering

Who This Is For

You're building real products with AI, not throwaway prototypes
You're tired of code that works on demo but breaks in production
You need a system that scales - from a one-line fix to a new microservice

Install (30 seconds)

# Into your project
mkdir -p .claude/skills && git clone https://github.com/artemiimillier/bulletproof.git .claude/skills/bulletproof

# Global (all projects)
mkdir -p ~/.claude/skills && git clone https://github.com/artemiimillier/bulletproof.git ~/.claude/skills/bulletproof

# For teams
git submodule add https://github.com/artemiimillier/bulletproof.git .claude/skills/bulletproof

Open Claude Code → type /bulletproof → done.

bulletproof

Frequently Asked Questions

What is bulletproof?