Make every model cheaper or better. Zero-config behavioral harness for Claude Code, with a reproducible 408-run benchmark
# Add to your Claude Code skills
git clone https://github.com/vitaliikapliuk/modelharnessmodelharness is an open-source ai agents skill for AI coding assistants such as Claude Code, Codex CLI, and ChatGPT, built by vitaliikapliuk. Make every model cheaper or better. Zero-config behavioral harness for Claude Code, with a reproducible 408-run benchmark. It has 61 GitHub stars.
modelharness's catalog security scan is still queued. You can run an instant dependency and prompt-injection check now with the "Scan for vulnerabilities" button above.
Clone the repository with "git clone https://github.com/vitaliikapliuk/modelharness" and add it to your Claude Code skills directory (see the Installation section above).
modelharness is primarily written in Python. It is open-source under vitaliikapliuk on GitHub, so you can review or fork the full source.
Yes. SkillsLLM lists many other AI Agents skills you can browse and compare side by side. Open the AI Agents category from the badge at the top of this page, or use the Related Skills and comparison links further down to weigh modelharness against similar tools.
No comments yet. Be the first to share your thoughts!
Unlocks once the catalog security scan passes (runs nightly).
The deep catalog scan for this skill is still queued. Run an instant dependency check now instead.
Make every model cheaper or better. Measured on four Claude models — none got worse.
What it actually is: a zero-config Claude Code plugin. On every session start it injects a ≈910-token behavioral core — six working practices distilled from how Fable 5 was trained to operate — plus three on-demand skills and a fresh-context verifier agent. No commands to learn; the model just starts working differently:
| ✅ Grounded progress — only claims backed by a tool result; "tests fail" said plainly | ⚡ Act, don't overplan — enough information means act, not narrate options |
| 🎯 Autonomy calibration — decides minor things itself, asks only on scope or destructive actions | 🔍 Self-verification loops — a checkable definition of done, real checks on a cadence, a fresh-context verifier before "done" |
| 🔀 Delegation triggers — explicit rules for when to fan work out to subagents | 📝 Cross-session memory — writes lessons and plans to files, so the next session can pick up the work |
| 🐛 Bug hunts 4 tasks · 96 runsFind and fix planted defects: TTL cache, CSV quoting, rate limiter, date rollover | ✨ Features from spec 4 tasks · 96 runsBuild to a written spec: retry backoff, config merging, cursor pagination, slugify | ♻️ Refactors 2 tasks · 48 runsRestructure code with zero behavior change, verified structurally |
| 🧠 Long-horizon builds 2 tasks · 48 runsMulti-stage pipelines where later steps depend on earlier decisions | 🧩 Spec-dense traps 3 tasks · 72 runs18+ interacting rules (discount engine, mini-interpreter) that punish shallow reading | 🔁 Session handoffs 2 tasks · 48 runsA fresh session must finish another session's work — memory is the only bridge |
17 tasks × 3 attempts × 8 configurations = 408 runs. Grading is hidden and binary: test suites the agent never sees decide pass/fail. No LLM judge. Every task ships with a reference solution proving it solvable.
Same 17 tasks, 3 runs per configuration. Higher pass rate and lower cost/time are better. 🟢 = better with modelharness, 🔴 = worse (explained in the last row).
The bottom line. modelharness packages the same working practices Fable 5 was trained on. The practices land hardest on Opus 4.8 — the flagship model available on every subscription — at −14% cost / −16% time, and that win is statistically significant (see below). Even Fable 5, competing against itself, runs significantly faster. On smaller models the average hides a trade: Haiku saves up to 19% on routine bugfixes but spends more on spec-dense tasks — extra verification work that is exactly what lifted its pass rate from 98% to 100%. Cheaper where it can be, more careful where it must be — and never significantly worse on any model.
Averages can hide noise, so we ran the honest test: pair each model's plain vs +modelharness runs on the same task (3 reps averaged), take the per-task percentage delta, and put a 95% confidence interval around the mean across all 17 tasks. A CI that clears zero is a real effect; one that straddles zero is within run-to-run noise. Regenerate with python3 bench/stats.py.
| Model | Cost Δ (95% CI) | Time Δ (95% CI) | Tasks cheaper |
|---|---|---|---|
| Opus 4.8 | −12.0% [−17.3, −6.7] · significant | −16.5% [−25.3, −7.7] · significant | 15 / 17 |
| Fable 5 | −3.2% [−10.6, +4.2] · within noise | −11.4% [−20.1, −2.8] · significant | 8 / 17 |
| Sonnet 4.6 | −4.0% [−11.3, +3.3] · within noise | −7.8% [−15.7, +0.0] · within noise | 10 / 17 |
| Haiku 4.5 | +0.3% [−8.7, +9.3] · within noise | −4.5% [−17.6, +8.6] · within noise | 9 / 17 |
What this means, stated plainly: the harness delivers a statistically significant cost-and-time reduction on Opus 4.8 — the model most people run on a subscription — and a significant speed-up on Fable 5. For Sonnet 4.6 and Haiku 4.5 the cost and time changes are within noise: not a reliable saving, but never a reliable loss either. Quality is not a sampled average — it is an exact binary count: 407 of 408 runs passed, and the one failure (bare Haiku 4.5 on a session-handoff task) is fixed 3/3 by the harness. So the defensible claim is narrow and true: **Opus gets meaningfully cheaper and faster, every model gets a memory-dri