A Claude Code plugin that turns natural language into blueprints, blueprints into parallel build plans, and build plans into working software with automated iteration, validation, and cross-model peer review.
# Add to your Claude Code skills
git clone https://github.com/JuliusBrussee/cavekitA Claude Code plugin that turns natural language into specs, specs into parallel build plans, and build plans into working software — with automated iteration, validation, and dual-model adversarial review.
You describe what you want. Cavekit writes the contract. Agents build from the contract. Every line of code traces to a requirement. Every requirement has acceptance criteria. Nothing gets lost, nothing gets guessed.
> Build me a task management API
(agent writes 2000 lines)
(no tests)
(forgot the auth middleware)
(wrong database schema)
(you spend 3 hours fixing it)
One shot. No validation. No traceability. The agent guessed what you wanted.
> /ck:sketch
4 kits, 22 requirements, 69 criteria
> /ck:map
34 tasks across 5 dependency tiers
> /ck:make
18 iterations — each validated against
the spec before committing
CAVEKIT COMPLETE
Every requirement traced. Every criterion checked.
Same feature. Zero guesswork. Full traceability.
AI coding agents are powerful, but they fail the same way every time:
| Failure | What Happens | |---------|-------------| | Context loss | Agent forgets what it said three steps ago | | No validation | Code written, never verified against intent | | No parallelism | One agent, one task, one branch — even when work is independent | | No iteration | Single pass produces a rough draft, not production code |
Cavekit fixes all four.
Instead of "prompt and pray," Cavekit puts a specification layer between your intent and the code.
┌─── Task 1 ─── Agent A ───┐
│ │
You ── /ck:sketch ──► Kits ── /ck:map ──► Build Site ──┤─── Task 2 ─── Agent B ───┤──► done
│ │
└─── Task 3 ─── Agent C ───┘
Kits are the source of truth. Agents read them, build from them, validate against them. When something breaks, the system traces the failure back to the kit — not the code.
Spec is the product. Code is the derivative.
git clone https://github.com/JuliusBrussee/cavekit.git ~/.cavekit
cd ~/.cavekit && ./install.sh
Registers the plugin with Claude Code, syncs into Codex marketplace, installs the cavekit CLI. Restart Claude Code after installing.
Requires: Claude Code, git, macOS/Linux.
Optional: Codex (npm install -g @openai/codex) — adds adversarial review. Cavekit works without it. Codex makes it significantly harder to ship flawed specs and broken code.
Four phases. Each one a slash command.
RESEARCH DRAFT ARCHITECT BUILD INSPECT
──────── ───── ───────── ───── ───────
(optional) "What are we Break into tasks, Auto-parallel: Gap analysis:
Multi-agent building?" map dependencies, /ck:make built vs.
codebase + organize into groups work intended.
web research Produces: tiered build site into adaptive Peer review.
kits with + dependency graph subagent packets Trace to specs.
Produces: R-numbered tier by tier
research brief requirements Produces: Produces:
task graph Codex reviews findings report
Codex challenges every tier gate
the design
/ck:research "build a C+ compiler"
Dispatches 2–8 parallel subagents to explore the codebase and search the web for best practices, library landscape, reference implementations, and common pitfalls. A synthesizer agent cross-validates findings and produces a research brief in context/refs/.
/ck:design
Creates or imports a DESIGN.md design system — a cross-cutting constraint layer enforced across the entire pipeline. Every kit references its design tokens, every task carries a Design Ref, every build result is audited for violations.
| Sub-command | What it does |
|------------|-------------|
| /ck:design create | Generate new DESIGN.md via guided Q&A |
| /ck:design import | Extract DESIGN.md from existing codebase |
| /ck:design audit | Check implementation against DESIGN.md |
| /ck:design update | Revise DESIGN.md, log to changelog |
/ck:sketch
Describe what you're building in natural language. Cavekit decomposes it into domain kits — structured documents with numbered requirements (R1, R2, ...) and testable acceptance criteria. Stack-independent. Human-readable.
After internal review, kits go to Codex for a design challenge — adversarial review that catches decomposition flaws, missing requirements, and ambiguous criteria before any code is written.
For existing codebases: /ck:sketch --from-code reverse-engineers kits from your code and identifies gaps.
/ck:map
Reads all kits. Breaks requirements into tasks. Maps dependencies. Organizes into a tiered build site — a dependency graph where Tier 0 has no deps, Tier 1 depends only on Tier 0, and so on. Includes a Coverage Matrix mapping every acceptance criterion to its task(s). Nothing specified gets lost in translation.
/ck:make
Pre-flight coverage check validates all acceptance criteria are covered. Then the loop runs:
┌──────────────────────────────────────────────────────┐
│ │
│ Read build site → Find next unblocked task │
│ │ │
│ ▼ │
│ Load relevant kit + acceptance criteria │
│ │ │
│ ▼ │
│ Implement the task │
│ │ │
│ ▼ │
│ Validate (build + tests + acceptance criteria) │
│ │ │
│ ├── PASS → commit → mark done → next ──┐ │
│ │ │ │
│ └── FAIL → diagnose → fix → revalidate │ │
│ │ │
│ ◄────────────────────────────────────────────┘ │
│ │
│ Loop until: all tasks done OR limit reached │
└──────────────────────────────────────────────────────┘
At every tier boundary, Codex adversarial review gates advancement. P0/P1 findings must be fixed before the next tier starts. With speculative review (default), this adds near-zero latency.
Post-flight verification cross-references what was built against original kits. Gaps get remediation tasks.
/ck:check
Gap analysis: built vs. specified. Peer review: bugs, security, missed requirements. Everything traced back to kit requirements.
Greenfield:
> /ck:sketch
What are you building?
> A REST API for task management. Users, projects, tasks
with priorities and due dates. PostgreSQL.
Created 4 kits (22 requirements, 69 acceptance criteria)
Next: /ck:map
> /ck:map
Generated build site: 34 tasks, 5 tiers
Next: /ck:make
> /ck:make
Loop activated — 34 tasks, 20 max iterations.
...
All tasks done. Build passes. Tests pass.
CAVEKIT COMPLETE — 34 tasks in 18 iterations.
Existing codebase:
> /ck:sketch --from-code
Exploring codebase... Next.js 14, Prisma, NextAuth.
Created 6 kits — 4 requirements are gaps (not yet implemented).
> /ck:map --filter collaboration
Generated build site: 8 tasks, 3 tiers
> /ck:make
CAVEKIT COMPLETE — 8 tasks in 8 iterations.
See example.md for
No comments yet. Be the first to share your thoughts!