by yogthos
MCP server for token-efficient large document analysis via the use of REPL state
# Add to your Claude Code skills
git clone https://github.com/yogthos/MatryoshkaNo comments yet. Be the first to share your thoughts!
Process documents 100x larger than your LLM's context window—without vector databases or chunking heuristics.
LLMs have fixed context windows. Traditional solutions (RAG, chunking) lose information or miss connections across chunks. RLM takes a different approach: the model reasons about your query and outputs symbolic commands that a logic engine executes against the document.
Based on the Recursive Language Models paper.
Unlike traditional approaches where an LLM writes arbitrary code, RLM uses Nucleus—a constrained symbolic language based on S-expressions. The LLM outputs Nucleus commands, which are parsed, type-checked, and executed by Lattice, our logic engine.
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ User Query │────▶│ LLM Reasons │────▶│ Nucleus Command │
│ "total sales?" │ │ about intent │ │ (sum RESULTS) │
└─────────────────┘ └─────────────────┘ └────────┬────────┘
│
┌─────────────────┐ ┌─────────────────┐ ┌────────▼────────┐
│ Final Answer │◀────│ Lattice Engine │◀────│ Parser │
│ 13,000,000 │ │ Executes │ │ Validates │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Why this works better than code generation:
The LLM outputs commands in the Nucleus DSL—an S-expression language designed for document analysis:
; Search for patterns
(grep "ERROR")
; Filter results
(filter RESULTS (lambda x (match x "timeout" 0)))
; Aggregate
(sum RESULTS) ; Auto-extracts numbers from lines
(count RESULTS) ; Count matching items
; Final answer
<<<FINAL>>>13000000<<<END>>>
Matryoshka has two execution paths and not every primitive works in both:
| Feature | runRLM (CLI / programmatic) | lattice-mcp (MCP server) |
|---|---|---|
| (grep …), (filter …), (map …), etc. | ✅ | ✅ |
| (llm_query …), (llm_batch …) | ✅ | ✅ via MCP sampling protocol |
| (rlm_query …), (rlm_batch …) | ✅ (concurrent rlm_batch) | ✅ — child Nucleus session spawns via the same MCP sampling bridge; M suspensions per rlm_query call. rlm_batch runs sequentially (children one at a time) because the multi-turn suspension protocol only carries one pending request at a time — concurrent children would lose suspensions. Round-trip count is the same; wall-clock is N×slower for non-sampling clients. |
| (context N) selector | ✅ (multi-doc via runRLMFromContent(query, string[])) | partial — (context 0) works; multi-doc loading not exposed via lattice_load |
| (grep "X" haystack) | ✅ | ✅ |
| (show_vars) | ✅ | ✅ (internal _<name> bindings filtered out) |
| FINAL_VAR(name) resolution | ✅ | N/A — MCP returns query results directly |
| maxTimeoutMs / maxChars / maxErrors | ✅ | ❌ — MCP has its own session timeout |
| compactionThresholdChars | ✅ | ❌ — MCP doesn't have a multi-turn FSM history |
The resource-limit features remain runRLM-only. The recursive primitives (rlm_query/rlm_batch) work in both paths — the MCP path spawns a child runRLMFromContent whose llmClient is the same sampling bridge as the parent, so each child turn flows through the existing MCP suspension/sampling protocol.
rlm_query spawns a child Nucleus session with its own FSM loop. The child runs to FINAL and returns a string — useful when a sub-task needs multi-turn reasoning over a structured handle:
; Child sees the resolved handle as its working document, NOT a
; JSON-stringified prompt blob. Lets the child use grep/lines/
; chunk_by_lines over arrays without JSON-syntax noise.
(rlm_query "extract dates" (context RESULTS))
; No (context …) → child's document is the prompt itself.
(rlm_query "summarize each error type")
rlm_batch runs the same per-item recursion across a collection. Each item produces one entry in the returned array, in input order. Per-item failures surface as "Error: rlm_batch item N failed — …" strings without aborting the rest of the batch:
(rlm_batch (chunk_by_lines 100)
(lambda c (rlm_query "extract metrics" (context c))))
runRLM: children fan out concurrently via a worker pool capped at maxConcurrentSubcalls (default 4).lattice-mcp: children run sequentially because the multi-turn suspension protocol can carry only one pending request at a time. Round-trip count is identical to the concurrent path (N children × M turns each); only wall-clock differs.Pass string[] to runRLMFromContent to load multiple documents. Address them via (context N); index 0 is the default for primitives that don't specify a haystack:
(grep "DEPLOY" (context 0)) ; deploy.log
(grep "OUTAGE" (context 2)) ; comms.log
; (context N) is just a term — pipe it anywhere a string is expected
(rlm_query "scan" (context (context 1))) ; child sees doc 1
Per-doc line numbers come back, so the LLM can cite "doc 0 line 4, doc 2 line 2" with confidence rather than inventing absolute offsets across a concatenation.
(show_vars) ; Returns a string summary of every binding currently
; in scope. Useful before a (filter RESULTS …) or a
; FINAL_VAR(name) reference when the LLM lost track of
; what's bound. Same surface as the `lattice_bindings`
; MCP tool but reachable from inside a query.
Unknown FINAL_VAR markers surface a clear error rather than passing the literal text through:
<<<FINAL>>>FINAL_VAR(_99)<<<END>>>
→ "[FINAL_VAR error: unknown binding "_99". Available: _1, RESULTS]"
All optional. With none set, behavior is unchanged:
runRLM(query, file, {
maxTimeoutMs: 30_000, // wall-clock cap, propagates to children
maxChars: 100_000, // cumulative chars sent + received
maxErrors: 5, // consecutive parse/execution errors
compactionThresholdChars: 50_000, // summarize history when prompt grows past this
})
When a limit hits, the run terminates cleanly with a string of the form:
[aborted: timeout 32100ms of 30000ms]
Best partial answer:
<the most recent meaningful solver result>
The partial answer is always preserved when present — completed work is never silently lost on abort.
The Lattice engine (src/logic/) processes Nucleus commands:
lc-parser.ts) - Parses S-expressions into an ASTtype-inference.ts) - Validates types before executionconstraint-resolver.ts) - Handles symbolic constraints like [Σ⚡μ]lc-solver.ts) - Executes commands against the documentLattice uses miniKanren (a relational programming engine) for pattern classification and filtering operations.
For large result sets, RLM uses a handle-based architecture with in-memory SQLite (src/persistence/) that achieves 97%+ token savings:
Traditional: LLM sees full array [15,000 tokens for 1000 results]
Handle-based: LLM sees stub [50 tokens: "$grep_error: Array(1000) [preview...]"]
How it works:
$grep_error, $bm25_timeout, $filter_status)Handle names are auto-generated from the Nucleus command: (grep "ERROR") produces $grep_error, (list_symbols "function") produces $list_symbols_function. Repeated commands get a numeric suffix ($grep_error_2, $grep_error_3).
The Lattice engine doubles as a context memory for LLM agents. Instead of roundtripping large text blobs in every message, agents stash context server-side and carry only compact handle stubs:
Agent reads file, summarizes → lattice_memo "auth architecture"
→ $memo_auth_architecture: "auth architecture" (2.1KB, 50 lines)
20 messages later, needs it → lattice_expand $memo_auth_architecture
→ Full 50-line summary
Token math (30-message session, 3 source files stashed):
Memos persist across document loads (lattice_load clears query handles but keeps memos), support LRU eviction (100 memo cap, 10MB budget), and can be explicitly deleted when stale. No document needs to be loaded to use memos.
The LLM does reasoning, not code generation:
The LLM never writes JavaS