by AnamKwon
Claude Code skill that applies Naur's Programming as Theory Building to coding-agent workflows.
The deep catalog scan for this skill is still queued. Run an instant dependency check now instead.
# Add to your Claude Code skills
git clone https://github.com/AnamKwon/programming-as-theory-building-skillGuides for using ai agents skills like programming-as-theory-building-skill.
No comments yet. Be the first to share your thoughts!
Unlocks once the catalog security scan passes (runs nightly).
A Claude Code plugin and reusable coding-agent skill that turns code generation from prompt completion into theory-preserving engineering work.
Most coding-agent failures are not syntax failures. They are theory failures: the agent writes code that looks right, but does not understand the invariant the code protects, why the current boundary exists, where the change belongs, or what behavior proves the change is correct.
The skill is grounded in Peter Naur's paper "Programming as Theory Building" (1985). Naur's central claim is that the durable asset in programming is not only the program text, but the programmer's theory of how the program maps real-world affairs into behavior. This skill converts that idea into operational checks for coding agents: map the domain rule, explain the current shape, place the change beside the closest existing facility, and verify the behavior that matters.
General coding agents often produce plausible files that satisfy the prompt surface while missing the program's governing invariant. For code generation, that shows up as:
programming-as-theory-building narrows the agent's behavior around the question Naur's paper makes unavoidable: what theory of the program is being preserved or extended?
The plugin packages one Claude Code skill and one project-level CLAUDE.md guideline file. The skill asks the agent to answer these checks before non-trivial code work:
| Principle | Addresses |
|---|---|
| Rebuild the theory | Context-free patches and wrong assumptions |
| Place by similarity | Misplaced helpers, duplicated domain concepts |
| Keep changes surgical | Drive-by rewrites and unrelated cleanup |
| Avoid speculative flexibility | Bloated abstractions and unused options |
| Verify the theory | Tests that pass without proving the domain rule |
That makes the agent inspect code paths, names, tests, docs, and runtime behavior before editing. It also discourages one-off abstractions and asks for verification tied to the domain behavior, not just syntax.
The benchmark compares commerce-backend code generation across three isolated arms:
skills_off: managed Claude Code skills disabled.karpathy_only: only the Karpathy guidelines skill enabled.theory_only: only this Programming as Theory Building skill enabled.Code generation used Claude Haiku through the Claude Code MODEL=haiku setting for every arm. Each generation ran in a fresh temporary workspace, and generated projects were reviewed by a separate Claude Opus review pass using benchmark-codegen-review-v1.
The copied benchmark now contains three prompt families:
basic-commerce: the original, looser FastAPI + SQLite inventory reservation/order orchestration prompt.strict-production: a later, more explicit prompt that specifies endpoints, status codes, error bodies, expiration behavior, stock restoration, 401 auth behavior, and pagination semantics. This maps to benchmark/prompts/strict-commerce.md.strict-commerce-no-mcp: the same strict prompt run after MCP usage was disabled in the harness, also using benchmark/prompts/strict-commerce.md. It is reported separately because the execution environment changed.Because the prompt changed, the headline result is reported by prompt family rather than as one flattened average.
| Prompt family | Arm | n | Avg weighted | Functional | Executability | Test quality | Verdict summary |
|---|---|---|---|---|---|---|---|
basic-commerce |
skills_off |
40 | 71.0 | 61.4 | 68.9 | 65.8 | 12 good, 27 mixed, 1 poor |
basic-commerce |
karpathy_only |
40 | 73.9 | 63.8 | 71.0 | 70.5 | 19 good, 21 mixed |
basic-commerce |
theory_only |
40 | 77.9 | 68.6 | 78.5 | 76.1 | 27 good, 13 mixed |
strict-production |
skills_off |
19 | 80.9 | 76.6 | 74.2 | 80.3 | 4 excellent, 7 good, 8 mixed |
strict-production |
karpathy_only |
19 | 82.5 | 77.5 | 80.5 | 83.2 | 5 excellent, 5 good, 9 mixed |
strict-production |
theory_only |
20 | 83.4 | 81.8 | 77.8 | 83.8 | 4 excellent, 12 good, 4 mixed |
strict-commerce-no-mcp |
skills_off |
10 | 78.5 | 64.3 | 73.9 | 88.0 | 2 excellent, 2 good, 6 mixed |
strict-commerce-no-mcp |
karpathy_only |
9 | 84.6 | 82.8 | 83.7 | 82.9 | 3 excellent, 4 good, 2 mixed |
strict-commerce-no-mcp |
theory_only |
10 | 88.5 | 89.5 | 91.2 | 88.9 | 4 excellent, 6 good |
The basic-commerce prompt is the cleaner test of skill behavior because the prompt leaves more program theory to be inferred. In that family, theory_only won all four run-level comparisons. Its advantage was strongest in executability and tests, where it led skills_off by +9.6 and +10.3 points respectively.
The strict-production prompt raised every arm. It explicitly supplied many rules that the theory-building skill otherwise had to recover: status codes, stock restoration, expiration behavior, idempotency expectations, and pagination semantics. In that stricter family, the gap narrowed; karpathy_only won one run and theory_only won the other.
The MCP-disabled strict run is separated from the earlier strict runs. In that run, theory_only led with 88.5, followed by karpathy_only at 84.6 and skills_off at 78.5.
The overall pattern is that karpathy_only improves readability and compactness, while theory_only more consistently improves domain correctness, executability, and behavioral tests. Across all parseable isolated reviews, theory_only has the best weighted average: 81.0 vs 77.7 for karpathy_only and 74.8 for skills_off. Neither skill eliminates recurring failures by itself: inventory/reservation invariants, idempotency, expiration/state transitions, SQLite isolation, runtime entrypoints, dead code, and README overclaims still appear in reviews.
Run-by-run results, excluded review-output notes, copied raw result folders, manifest join notes, and recurring failure categories are documented in benchmark/README.md.
benchmark/prompts/benchmark/results-20260609.jsonbenchmark/raw-results/.skill-codegen-runs/benchmark/raw-results/.skill-review-runs/Option A: Claude Code plugin
/plugin marketplace add AnamKwon/programming-as-theory-building-skill
/plugin install programming-as-theory-building-skill@programming-as-theory-building-skill
For a fork, replace AnamKwon with the account or organization that publishes the repository. The install command is <plugin-name>@<marketplace-id>; this repository uses programming-as-theory-building-skill for both.
Option B: manual Claude Code skill install
mkdir -p ~/.claude/skills/programming-as-theory-building
cp skills/programming-as-theory-building/SKILL.md ~/.claude/skills/programming-as-theory-building/SKILL.md
Option C: per-project CLAUDE.md
cp CLAUDE.md /path/to/project/CLAUDE.md
For Codex CLI, copy the operating rules into AGENTS.md; Codex does not import Claude Code SKILL.md automatically. For Gemini CLI, put the rules in GEMINI.md, or import the skill content with the CLI's memory mechanism.
These guidelines are working if you see:
Theory, Changed, Verified, and Risk.From the parent experiment workspace, run 10-repeat sets and aggregate results by prompt family:
MODEL=haiku REPEATS=10 ARMS="skills_off karpathy_only theory_only" ./run_skill_codegen_experiment.sh
MODEL=opus ./run_opus_code_review_experiment.sh .skill-codegen-runs/<run_id>
The published benchmark combines multiple 10-repeat batches. Keep prompt revisions and environment changes separate when aggregating; the basic-commerce, strict-production, and strict-commerce-no-mcp groups are not directly interchangeable samples.
The benchmark harness intentionally keeps both out of the default comparison set. ARMS=both remains available as an explicit opt-in, but the default comparison isolates single-skill effects.
Naur, Peter. "Programming as Theory Building." Microprocessing and Microprogramming, vol. 15, no. 5, 1985, pp. 253-261.
.
|-- README.md
|-- PROMOTION.md
|-- LICENSE
|-- CITATION.cff
|-- CLAUDE.md
|-- .claude-plugin/
| |-- marketplace.json
| `-- plugin.json
|-- benchmark/
| |-- README.md
| |-- prompts/
| | |-- README.md
| | |-- basic-commerce.md
| | `-- strict-commerce.md
| |-- raw-results/
| | |-- .skill-codegen-runs/
| | `-- .skill-review-runs/
| `-- results-20260609.json
|-- skills/
| `-- programming-as-theory-building/
| `-- SKILL.md
`-- .gitignore