modelharness

Name: modelharness
Author: vitaliikapliuk

Pending

Make every model cheaper or better. Zero-config behavioral harness for Claude Code, with a reproducible 408-run benchmark

61stars

1forks

Python

Installation

# Add to your Claude Code skills
git clone https://github.com/vitaliikapliuk/modelharness

Getting Started

Guides for using ai agents skills like modelharness.

Caveman: Cut Claude Token Use by 65%
How agent-side prompt compression works, when to use it, and when not to.
What is an AI Skills Marketplace?
Definitions, how marketplaces work, and how to choose between them in 2026.
Getting Started with AI Skills

README.md

Frequently Asked Questions

What is modelharness?

modelharness is an open-source ai agents skill for AI coding assistants such as Claude Code, Codex CLI, and ChatGPT, built by vitaliikapliuk. Make every model cheaper or better. Zero-config behavioral harness for Claude Code, with a reproducible 408-run benchmark. It has 61 GitHub stars.

Is modelharness safe to use?

modelharness's catalog security scan is still queued. You can run an instant dependency and prompt-injection check now with the "Scan for vulnerabilities" button above.

How do I install modelharness?

Clone the repository with "git clone https://github.com/vitaliikapliuk/modelharness" and add it to your Claude Code skills directory (see the Installation section above).

What programming language is modelharness written in?

modelharness is primarily written in Python. It is open-source under vitaliikapliuk on GitHub, so you can review or fork the full source.

Are there alternatives to modelharness?

Yes. SkillsLLM lists many other AI Agents skills you can browse and compare side by side. Open the AI Agents category from the badge at the top of this page, or use the Related Skills and comparison links further down to weigh modelharness against similar tools.

Agentic AI for Beginners

Build your first AI agent from scratch - tool use, ReAct pattern, memory, deployment

41 minBeginner

Comments (0)

to leave a comment.

No comments yet. Be the first to share your thoughts!

Related Skills

superpowers

by obra

An agentic skills framework & software development methodology that works.

234,966

mindmap-markmap-viewer mosoo-agent-driver

modelharness

Make every model cheaper or better. Measured on four Claude models — none got worse.

What it actually is: a zero-config Claude Code plugin. On every session start it injects a ≈910-token behavioral core — six working practices distilled from how Fable 5 was trained to operate — plus three on-demand skills and a fresh-context verifier agent. No commands to learn; the model just starts working differently:


✅ Grounded progress — only claims backed by a tool result; "tests fail" said plainly	⚡ Act, don't overplan — enough information means act, not narrate options
🎯 Autonomy calibration — decides minor things itself, asks only on scope or destructive actions	🔍 Self-verification loops — a checkable definition of done, real checks on a cadence, a fresh-context verifier before "done"
🔀 Delegation triggers — explicit rules for when to fan work out to subagents	📝 Cross-session memory — writes lessons and plans to files, so the next session can pick up the work

What the 17 tasks test


🐛 Bug hunts 4 tasks · 96 runsFind and fix planted defects: TTL cache, CSV quoting, rate limiter, date rollover	✨ Features from spec 4 tasks · 96 runsBuild to a written spec: retry backoff, config merging, cursor pagination, slugify	♻️ Refactors 2 tasks · 48 runsRestructure code with zero behavior change, verified structurally
🧠 Long-horizon builds 2 tasks · 48 runsMulti-stage pipelines where later steps depend on earlier decisions	🧩 Spec-dense traps 3 tasks · 72 runs18+ interacting rules (discount engine, mini-interpreter) that punish shallow reading	🔁 Session handoffs 2 tasks · 48 runsA fresh session must finish another session's work — memory is the only bridge

17 tasks × 3 attempts × 8 configurations = 408 runs. Grading is hidden and binary: test suites the agent never sees decide pass/fail. No LLM judge. Every task ships with a reference solution proving it solvable.

The results

Cost per task with and without modelharness

The numbers, exactly

Same 17 tasks, 3 runs per configuration. Higher pass rate and lower cost/time are better. 🟢 = better with modelharness, 🔴 = worse (explained in the last row).

The bottom line. modelharness packages the same working practices Fable 5 was trained on. The practices land hardest on Opus 4.8 — the flagship model available on every subscription — at −14% cost / −16% time, and that win is statistically significant (see below). Even Fable 5, competing against itself, runs significantly faster. On smaller models the average hides a trade: Haiku saves up to 19% on routine bugfixes but spends more on spec-dense tasks — extra verification work that is exactly what lifted its pass rate from 98% to 100%. Cheaper where it can be, more careful where it must be — and never significantly worse on any model.

How confident are we?

Averages can hide noise, so we ran the honest test: pair each model's plain vs +modelharness runs on the same task (3 reps averaged), take the per-task percentage delta, and put a 95% confidence interval around the mean across all 17 tasks. A CI that clears zero is a real effect; one that straddles zero is within run-to-run noise. Regenerate with python3 bench/stats.py.

Model	Cost Δ (95% CI)	Time Δ (95% CI)	Tasks cheaper
Opus 4.8	−12.0% [−17.3, −6.7] · significant	−16.5% [−25.3, −7.7] · significant	15 / 17
Fable 5	−3.2% [−10.6, +4.2] · within noise	−11.4% [−20.1, −2.8] · significant	8 / 17
Sonnet 4.6	−4.0% [−11.3, +3.3] · within noise	−7.8% [−15.7, +0.0] · within noise	10 / 17
Haiku 4.5	+0.3% [−8.7, +9.3] · within noise	−4.5% [−17.6, +8.6] · within noise	9 / 17

What this means, stated plainly: the harness delivers a statistically significant cost-and-time reduction on Opus 4.8 — the model most people run on a subscription — and a significant speed-up on Fable 5. For Sonnet 4.6 and Haiku 4.5 the cost and time changes are within noise: not a reliable saving, but never a reliable loss either. Quality is not a sampled average — it is an exact binary count: 407 of 408 runs passed, and the one failure (bare Haiku 4.5 on a session-handoff task) is fixed 3/3 by the harness. So the defensible claim is narrow and true: **Opus gets meaningfully cheaper and faster, every model gets a memory-dri