by kyegomez
A theoretical reconstruction of the Claude Mythos architecture, built from first principles using the available research literature.
# Add to your Claude Code skills
git clone https://github.com/kyegomez/OpenMythosLast scanned: 4/21/2026
{
"issues": [],
"status": "PASSED",
"scannedAt": "2026-04-21T06:02:56.972Z",
"semgrepRan": false,
"npmAuditRan": true,
"pipAuditRan": false
}Disclaimer: OpenMythos is an independent, community-driven theoretical reconstruction based solely on publicly available research and speculation. It is not affiliated with, endorsed by, or connected to Anthropic or any of their proprietary systems.
OpenMythos is an open-source, theoretical implementation of the Claude Mythos model. It implements a Recurrent-Depth Transformer (RDT) with three stages: Prelude (transformer blocks), a looped Recurrent Block (up to max_loop_iters), and a final Coda. Attention is switchable between MLA and GQA, and the feed-forward uses a sparse MoE with routed and shared experts ideal for exploring compute-adaptive, depth-variable reasoning.
pip install open-mythos
#uv pip install open-mythos
import torch
from open_mythos.main import OpenMythos, MythosConfig
attn_type = "mla" # or "gqa"
base = {
"vocab_size": 1000,
"dim": 256,
"n_heads": 8,
"max_seq_len": 128,
"max_loop_iters": 4,
"prelude_layers": 1,
"coda_layers": 1,
"n_experts": 8,
"n_shared_experts": 1,
"n_experts_per_tok": 2,
"expert_dim": 64,
"lora_rank": 8,
"attn_type": attn_type,
}
if attn_type == "gqa":
cfg = MythosConfig(**base, n_kv_heads=2)
else:
cfg = MythosConfig(
**base,
n_kv_heads=8,
kv_lora_rank=32,
q_lora_rank=64,
qk_rope_head_dim=16,
qk_nope_head_dim=16,
v_head_dim=16,
)
model = OpenMythos(cfg)
total = sum(p.numel() for p in model.parameters())
print(f"\n[{attn_type.upper()}] Parameters: {total:,}")
ids = torch.randint(0, cfg.vocab_size, (2, 16))
logits = model(ids, n_loops=4)
print(f"[{attn_type.upper()}] Logits shape: {logits.shape}")
out = model.generate(ids, max_new_tokens=8, n_loops=8)
print(f"[{attn_type.upper()}] Generated shape: {out.shape}")
A = model.recurrent.injection.get_A()
print(
f"[{attn_type.upper()}] Spectral radius ρ(A) max: {A.max().item():.4f} (must be < 1)"
)
Pre-configured scales from 1B to 1T parameters:
from open_mythos import (
mythos_1b,
mythos_3b,
mythos_10b,
mythos_50b,
mythos_100b,
mythos_500b,
mythos_1t,
OpenMythos,
)
cfg = mythos_7b() # returns a MythosConfig
model = OpenMythos(cfg)
total = sum(p.numel() for p in model.parameters())
print(f"Parameters: {total:,}")
| Variant | dim | Experts | expert_dim | Loop iters | Context | Max output |
|---|---|---|---|---|---|---|
| mythos_1b | 2048 | 64 | 2048 | 16 | 4k | 4k |
| mythos_3b | 3072 | 64 | 4096 | 16 | 4k | 4k |
| mythos_10b | 4096 | 128 | 5632 | 24 | 8k | 4k |
| mythos_50b | 6144 | 256 | 9728 | 32 | 8k | 4k |
| mythos_100b | 8192 | 256 | 13568 | 32 | 1M | 128k |
| mythos_500b | 12288 | 512 | 23040 | 48 | 1M | 128k |
| mythos_1t | 16384 | 512 | 34560 | 64 | 1M | 128k |
The training script for the 3B model on FineWeb-Edu is at training/3b_fine_web_edu.py.
Single GPU:
python training/3b_fine_web_edu.py
Multi-GPU (auto-detects GPU count):
torchrun --nproc_per_node=$(python -c "import torch; print(torch.cuda.device_count())") training/3b_fine_web_edu.py
Key design choices:
| Feature | Detail |
|---|---|
| Optimizer | AdamW |
| Dataset | HuggingFaceFW/fineweb-edu (sample-10BT by default, swap to sample-100BT or default for full run) |
| Tokenizer | openai/gpt-oss-20b via MythosTokenizer |
| Parallelism | PyTorch DDP via torchrun, sharded streaming dataset |
| Precision | bfloat16 on H100/A100, float16 + GradScaler on older GPUs |
| Schedule | Linear warmup (2000 steps) → cosine decay |
| Target | 30B tokens (~Chinchilla-adjusted for looped architecture) |
| Page | Description |
|---|---|
| docs/open_mythos.md | Full API reference for the OpenMythos class — constructor, forward, generate, all sub-modules, configuration reference, and usage examples |
| docs/datasets.md | Recommended training datasets with token budget guidance per model size |
Claude Mythos is suspected to be a Recurrent-Depth Transformer (RDT) — also called a Looped Transformer (LT). Rather than stacking hundreds of unique layers, a subset of layers is recycled and run through multiple times per forward pass. Same weights. More loops. Deeper thinking.
This is not chain-of-thought. There is no intermediate token output. All of this reasoning happens silently, inside a single forward pass, in continuous latent space.
A looped transformer divides its layers into three functional blocks:
Input
↓
[Prelude P] — standard transformer layers, run once
↓
[Recurrent Block R] — looped T times
↑_______↓ (hidden state h updated each loop with input injection e)
↓
[Coda C] — standard transformer layers, run once
↓
Output
The recurrent block update rule at each loop step t:
h_{t+1} = A·h_t + B·e + Transformer(h_t, e)
Where:
h_t is the hidden state after loop te is the encoded input (from the Prelude), injected at every loopA and B are learned injection parametersThe injection of e at every step is what prevents the model from drifting — it keeps the original input signal alive throughout the entire recurrence depth.
The full implementation is in open_mythos/main.py. See the OpenMythos class reference for a detailed API walkthrough, configuration options, and usage examples.
Vanilla transformers fail to combine knowledge in ways they have never seen during training. Looped transformers pass this test. The ability emerges through a three-stage grokking process:
This is why Mythos feels qualitatively different from other models on novel questions — the capability phase-transitions in, rather than emerging gradually.
Train on 5-hop reasoning chains. Test on 10-hop. Vanilla transformer fails. Looped transformer succeeds — by running more inference-time loops. This maps directly to the observation that Mythos handles deeply compositional problems (multi-step math, long-horizon planning, layered arguments) without explicit chain-of-thought.
More loops at inference = deeper reasoning chains = harder problems solved.
Each loop iteration is the functional equivalent of one step of chain-of-thought, but operating in continuous latent space rather than token space. A looped model running T loops implicitly simulates T steps of CoT reasoning. This has been formally proven (Saunshi et al., 2025).
Furthermore, continuous latent thoughts — unlike discrete token outputs — can encode multiple alternative next steps simultaneously. This allows something closer to breadth-first search over the reasoning space, rather than a single committed reasoning path. The model is effectively exploring many possible directions inside each forward pass before converging.
A looped model with k layers run L times achieves the quality of a kL-layer non-looped model, with only k layers worth of parameters. For Mythos-scale deployments, this matters enormously:
Training looped models is notoriously unstable. Two failure modes dominate:
h_t grows unboundedly across loopsRecast looping as a discrete linear time-invariant (LTI) dynamical system over the residual stream. Ignoring the nonlinear Transformer contribution, the recurrence becomes:
h_{t+1} = A·h_t + B·e
For this LTI system, s
No comments yet. Be the first to share your thoughts!