by michaelabrt
Studying the gap between what agents know and when they act on it.
# Add to your Claude Code skills
git clone https://github.com/michaelabrt/clarte[!IMPORTANT] This is an experimental research project, not a polished product. The findings are based on 700+ controlled sessions and 30+ experiments, but the real-world evaluation covers a small number of tasks. We’re sharing it early because the results are interesting enough to warrant wider testing. Contributions, replications and skepticism are welcome.
We ran 30+ experiments across 700+ agent sessions to find what measurably changes agent behavior.
First, we measured how agents spend their time. 170 sessions, 7,595 turns:
We assumed the fix was better information. So we built 15 context enrichments: instability metrics, facade maps, API surfaces, type-aware ordering, task-relevant weighting. Each benchmarked in isolation and combination.
Zero wins. Not one survived our combinatorial benchmark at realistic temperature. Three optimizations that individually showed -26%, -16% and -32% improvements combined to +63% overhead.
Then we found the placebo. A minimal context file - just the project language and test framework, two lines, zero analysis - performed identically to our full 2,000-token enrichment. The content was irrelevant. The file’s existence alone suppressed the agent’s exploration phase.
The real signal turned out to be first-edit timing. Strong correlation with session length across most tasks tested. Each delayed turn adds ~1.3 total turns. With context, agents start editing around turn 5. Without, turn 8. They find the right files on their own given enough time. They just lack the confidence to stop reading and start editing.
So we stopped injecting information. We started injecting confidence: instead of telling the agent what’s important, we tell it which files to edit.
No comments yet. Be the first to share your thoughts!
For the full research story, see docs/research.md. All 30+ experiment writeups are in docs/experiments/.
Clarté is the experimental application of these findings. It parses your source code with tree-sitter, builds a weighted dependency graph from imports, call sites and git history, and on every prompt predicts which files need editing. The predictions go to a pre-flight agent that reads each target once and returns exact edit locations.
The full query pipeline runs in under 100ms. The Architecture section has the math.
npx @michaelabrt/clarte
Zero config. Works with Claude Code, Cursor, Copilot, Windsurf, Cline and OpenCode. TypeScript, Python, Go, Rust, Java.
npm install -g @michaelabrt/clarte --omit=optional
These are promising but based on limited evaluation. Treat them as directional, not definitive.
Real-world tests - 5 bug fixes in open-source repos (opaque prompts, Claude Sonnet, small n per task):
| Task | Repo | Without Clarté | With Clarté | n | |------|------|----------------|-------------|---| | JSX async context loss | Hono | wrong file, did not finish | correct file, 2 min to first edit | 2+2 | | Form validator prototype pollution | Hono | did not finish | completed (18 turns) | 1+1 | | SQLite simple-enum array | TypeORM | 47.7 turns | 16.3 turns (-66%) | 3+3 | | WebSocket adapter shutdown | NestJS | 53 turns | 38 turns (-28%) | 7+7 | | URL fragment stripping | Hono | completed, high variance | completed, 3x more consistent | 8+8 |
Baseline completed 3/5 within budget. With Clarté, 5/5. These are the controlled, reproducible runs from a larger iterative development process (hundreds of sessions across more tasks and repos). The 32 experiment writeups and 7 studies document the full research arc.
Fixture benchmarks (v0, context file only - no hooks or pre-flight):
| Metric | Without Context | With Context | Delta | Significance | |--------|----------------|--------------|-------|--------------| | Wall-clock time (median) | 130s | 98s | -25% | p<0.001, small effect | | Turns (median) | 16 | 11.5 | -28% | p<0.001, medium effect | | Input tokens (median) | 272K | 108K | -60% | p<0.001, large effect |
135 sessions (Claude Sonnet 4.6), 9 opaque tasks, statistical testing with Wilcoxon signed-rank, bootstrap CIs, Benjamini-Hochberg FDR correction and Cliff’s delta effect sizes. Methodology and full reports in the benchmark repo.
This project benefits from wider testing. If you’re interested:
graph TD
subgraph offline ["Build Phase · offline"]
A[tree-sitter] --> B[Dependency Graph]
C[git log] --> D[Change Coupling]
B --> E["HITS · Betweenness · Communities"]
D --> F[Bayesian EWMA Priors]
E & D --> G[Logistic Fusion Training]
end
subgraph prompt ["Query Phase · per prompt · sub-100ms"]
H[Task Prompt] --> I["① BM25F Seed Resolution"]
I --> J["② LSA Seed Expansion"]
J --> K["③ Katz Propagation"]
K --> L["④ Score Fusion"]
L --> M[Pre-flight Agent]
end
B -.-> I
G -.-> L
F -.-> K
M --> N((Agent))
You submit a task: "fix the JWT session leak." Two problems need solving.
Lexical matching. The query tokens "JWT" and "session" should match files like auth/jwt.ts or session/manager.ts. Clarté runs true multi-field BM25F (Robertson et al. 2004) across three document fields: file path segments, exported symbol names and import statements, each with independent length normalization and field weights.
Path segments are weighted 2x higher than symbols. auth/middleware.ts tells you more about a session-handling bug than a function named validate. Import names get 0.5x because they signal consumption, not definition. The query is tokenized with camelCase splitting, compound-word preservation and domain-specific synonym expansion (auth → authentication, db → database). IDF is computed globally across the corpus.
Each query term's IDF is weighted by a saturated pseudo-term-frequency that blends all three fields before applying the k₁ = 1.2 saturation constant. The weighted pseudo-term-frequency combines all three fields before saturation (true BM25F, not per-field BM25+).
$$\text{score}(d, q) = \sum_{t \in q} \text{IDF}(t) \cdot \frac{\widetilde{tf}(t, d)}{\widetilde{tf}(t, d) + k_1}$$
$$\widetilde{tf}(t, d) = \sum_{f \in \lbrace \text{path, sym, imp} \rbrace} w_f \cdot \frac{tf_{f}(t, d)}{1 - b_f + b_f \cdot |d_f| , / , \overline{dl}_f}$$
Three post-processing steps refine the candidate set: spreading activation propagates scores along import edges for 3 hops with 0.5^(hop-1) decay; test proxy scoring transfers test file scores to their source files at 0.6x (test paths encode what they cover); and an import ceiling caps re-export barrels at 0.5x the minimum direct-match score.
Conceptual matching. BM25F will never connect a bug report about "session tokens" to a file named SessionGuard.ts that exports validateJWT. No surface tokens overlap.
Latent Semantic Analysis bridges this gap. We build a file-symbol incidence matrix and compute a rank-32 approximation via randomized truncated SVD (Halko-Martinsson-Tropp algorithm). Files project into a 32-dimensional latent space where cosine similarity captures shared structural role rather than shared tokens.
The top BM25F seeds are averaged into a centroid vector. Non-seed files within cosine distance 0.3 enter the candidate pool at 0.4x discount, expanding the set with up to 5 conceptually related files. Activates only on codebases with 50+ files; below that, BM25F alone has sufficient coverage.
Sub-millisecond for typical codebases (1,000 files, 20 imports/file).
| Parameter | Value | Role | |-----------|-------|------| | k₁ | 1.2 | Saturation constant | | w_path | 2.0 | Path field weight | | w_sym | 1.0 | Symbol field weight | | w_imp | 0.5 | Import field weight | | b_path | 0.3 | Path length normalization | | b_sym | 0.4 | Symbol len