OpenMythos

Name: OpenMythos
Author: kyegomez

Verified

A theoretical reconstruction of the Claude Mythos architecture, built from first principles using the available research literature.

14,724stars

3,303forks

Python

Installation

# Add to your Claude Code skills
git clone https://github.com/kyegomez/OpenMythos

Getting Started

Guides for using ide extensions skills like OpenMythos.

Getting Started with AI Skills
First-time install walkthrough for Claude Code, Codex CLI, and ChatGPT.
What is an AI Skills Marketplace?
Definitions, how marketplaces work, and how to choose between them in 2026.

Security ReportVerified

Last scanned: 4/21/2026

{
  "issues": [],
  "status": "PASSED",
  "scannedAt": "2026-04-21T06:02:56.972Z",
  "semgrepRan": false,
  "npmAuditRan": true,
  "pipAuditRan": false
}

README.md

Frequently Asked Questions

What is OpenMythos?

OpenMythos is an open-source ide extensions skill for AI coding assistants such as Claude Code, Codex CLI, and ChatGPT, built by kyegomez. A theoretical reconstruction of the Claude Mythos architecture, built from first principles using the available research literature. It has 14,724 GitHub stars.

Is OpenMythos safe to use?

Yes. OpenMythos passed SkillsLLM's automated security scan — a dependency vulnerability audit plus prompt-injection heuristics — with no high-severity issues. You can read the full report in the Security Report section on this page.

How do I install OpenMythos?

Clone the repository with "git clone https://github.com/kyegomez/OpenMythos" and add it to your Claude Code skills directory (see the Installation section above).

What programming language is OpenMythos written in?

OpenMythos is primarily written in Python. It is open-source under kyegomez on GitHub, so you can review or fork the full source.

Are there alternatives to OpenMythos?

Yes. SkillsLLM lists many other IDE Extensions skills you can browse and compare side by side. Open the IDE Extensions category from the badge at the top of this page, or use the Related Skills and comparison links further down to weigh OpenMythos against similar tools.

Agentic AI for Beginners

Build your first AI agent from scratch - tool use, ReAct pattern, memory, deployment

41 minBeginner

Comments (0)

to leave a comment.

No comments yet. Be the first to share your thoughts!

Related Skills

superpowers

by obra

An agentic skills framework & software development methodology that works.

234,966

Popular in IDE Extensions

Top skills in this category by stars

claudian

by YishenTu

An Obsidian plugin that embeds Claude Code/Codex as an AI collaborator in your vault

14,195

Developers Also Liked

Based on votes and bookmarks from developers who liked this skill

everything-claude-code

by affaan-m

The agent harness performance optimization system. Skills, instincts, memory, security, and research-first development for Claude Code, Codex, Opencode, Cursor and beyond.

185,940

claudian

OpenMythos

Disclaimer: OpenMythos is an independent, community-driven theoretical reconstruction based solely on publicly available research and speculation. It is not affiliated with, endorsed by, or connected to Anthropic or any of their proprietary systems.

OpenMythos is an open-source, theoretical implementation of the Claude Mythos model. It implements a Recurrent-Depth Transformer (RDT) with three stages: Prelude (transformer blocks), a looped Recurrent Block (up to max_loop_iters), and a final Coda. Attention is switchable between MLA and GQA, and the feed-forward uses a sparse MoE with routed and shared experts ideal for exploring compute-adaptive, depth-variable reasoning.

Installation

pip install open-mythos

#uv pip install open-mythos

To enable Flash Attention 2 in GQAttention (requires CUDA and build tools):

pip install open-mythos[flash]

Usage


import torch
from open_mythos.main import OpenMythos, MythosConfig


attn_type = "mla"  # or "gqa"

base = {
    "vocab_size": 1000,
    "dim": 256,
    "n_heads": 8,
    "max_seq_len": 128,
    "max_loop_iters": 4,
    "prelude_layers": 1,
    "coda_layers": 1,
    "n_experts": 8,
    "n_shared_experts": 1,
    "n_experts_per_tok": 2,
    "expert_dim": 64,
    "lora_rank": 8,
    "attn_type": attn_type,
}

if attn_type == "gqa":
    cfg = MythosConfig(**base, n_kv_heads=2)
else:
    cfg = MythosConfig(
        **base,
        n_kv_heads=8,
        kv_lora_rank=32,
        q_lora_rank=64,
        qk_rope_head_dim=16,
        qk_nope_head_dim=16,
        v_head_dim=16,
    )

model = OpenMythos(cfg)
total = sum(p.numel() for p in model.parameters())
print(f"\n[{attn_type.upper()}] Parameters: {total:,}")

ids = torch.randint(0, cfg.vocab_size, (2, 16))
logits = model(ids, n_loops=4)
print(f"[{attn_type.upper()}] Logits shape: {logits.shape}")

out = model.generate(ids, max_new_tokens=8, n_loops=8)
print(f"[{attn_type.upper()}] Generated shape: {out.shape}")

A = model.recurrent.injection.get_A()
rho = torch.linalg.eigvals(A).abs().max().item()
print(
    f"[{attn_type.upper()}] Spectral radius ρ(A) = {rho:.4f} (must be < 1)"
)

Model Variants

Pre-configured scales from 1B to 1T parameters:

from open_mythos import (
    mythos_1b,
    mythos_3b,
    mythos_10b,
    mythos_50b,
    mythos_100b,
    mythos_500b,
    mythos_1t,
    OpenMythos,
)

cfg = mythos_7b()  # returns a MythosConfig
model = OpenMythos(cfg)

total = sum(p.numel() for p in model.parameters())
print(f"Parameters: {total:,}")

Variant	`dim`	Experts	`expert_dim`	Loop iters	Context	Max output
`mythos_1b`	2048	64	2048	16	4k	4k
`mythos_3b`	3072	64	4096	16	4k	4k
`mythos_10b`	4096	128	5632	24	8k	4k
`mythos_50b`	6144	256	9728	32	8k	4k
`mythos_100b`	8192	256	13568	32	1M	128k
`mythos_500b`	12288	512	23040	48	1M	128k
`mythos_1t`	16384	512	34560	64	1M	128k

Training

The training script for the 3B model on FineWeb-Edu is at training/3b_fine_web_edu.py.

Single GPU:

python training/3b_fine_web_edu.py

Multi-GPU (auto-detects GPU count):

torchrun --nproc_per_node=$(python -c "import torch; print(torch.cuda.device_count())") training/3b_fine_web_edu.py

Key design choices:

Feature	Detail
Optimizer	AdamW
Dataset	`HuggingFaceFW/fineweb-edu` (`sample-10BT` by default, swap to `sample-100BT` or `default` for full run)
Tokenizer	`openai/gpt-oss-20b` via `MythosTokenizer`
Parallelism	PyTorch DDP via `torchrun`, sharded streaming dataset
Precision	bfloat16 on H100/A100, float16 + GradScaler on older GPUs
Schedule	Linear warmup (2000 steps) → cosine decay
Target	30B tokens (~Chinchilla-adjusted for looped architecture)

Documentation

Page	Description
`docs/open_mythos.md`	Full API reference for the `OpenMythos` class — constructor, `forward`, `generate`, all sub-modules, configuration reference, and usage examples
`docs/datasets.md`	Recommended training datasets with token budget guidance per model size

The Central Hypothesis

Claude Mythos is suspected to be a Recurrent-Depth Transformer (RDT) — also called a Looped Transformer (LT). Rather than stacking hundreds of unique layers, a subset of layers is recycled and run through multiple times per forward pass. Same weights. More loops. Deeper thinking.

This is not chain-of-thought. There is no intermediate token output. All of this reasoning happens silently, inside a single forward pass, in continuous latent space.

Architecture

A looped transformer divides its layers into three functional blocks:

Input
  ↓
[Prelude P]        — standard transformer layers, run once
  ↓
[Recurrent Block R] — looped T times
  ↑_______↓         (hidden state h updated each loop with input injection e)
  ↓
[Coda C]           — standard transformer layers, run once
  ↓
Output

The recurrent block update rule at each loop step t:

h_{t+1} = A·h_t + B·e + Transformer(h_t, e)

Where:

h_t is the hidden state after loop t
e is the encoded input (from the Prelude), injected at every loop
A and B are learned injection parameters
The Transformer blocks apply attention and MLP as usual

The injection of e at every step is what prevents the model from drifting — it keeps the original input signal alive throughout the entire recurrence depth.

The full implementation is in open_mythos/main.py. See the OpenMythos class reference for a detailed API walkthrough, configuration options, and usage examples.

Attention Implementations

The attention layer is switchable via cfg.attn_type:

Option	Class	Description
`"gqa"`	`GQAttention`	Grouped Query Attention (Ainslie et al., 2023) — fewer KV heads than Q heads (`n_kv_heads < n_heads`), reducing KV-cache memory by `n_heads / n_kv_heads`. Uses Flash Attention 2 (Dao et al., 2023) when `flash-attn>=2.8.3` is installed: GQA is handled natively (no KV head expansion), I/O-bound-optimal, with a transparent fallback to manual scaled dot-product attention when the package is absent.
`"mla"`	`MLAttention`	Multi-Latent Attention (DeepSeek-V2) — caches a compressed KV latent (`kv_lora_rank`) rather than full K/V, with split RoPE / no-RoPE head dims for position-aware compression.

RoPE is applied to Q and K before caching, so cached values do not need to be re-rotated on retrieval.

Why This Explains Mythos

1. Systematic Generalization

Vanilla transformers fail to combine knowledge in ways they have never seen during training. Looped transformers pass this test. The ability emerges through a three-stage grokking process:

Memorization — model fits training distribution
In-distribution generalization — model handles known compositions
Systematic generalization — model handles novel compositions OOD, abruptly and suddenly

This is why Mythos feels qualitatively different from other models on novel questions — the capability phase-transitions in, rather than emerging gradually.

2. Depth Extrapolation

Train on 5-hop reasoning chains. Test on 10-hop. Vanilla transformer fails. Looped transformer succeeds — by running more inference-time loops. This maps directly to the observation that Mythos handles deeply compositional problems (multi-step math, long-horizon planning, layered arguments) without explicit chain-of-thought.

More loops at inference = deeper reasoning chains = harder problems solved.

3. Latent Thoughts as Implicit Chain-of-Thought

Each loop iteration is the functional equivalent of one step of chain-of-thought, but operating in continuous latent space rather than token space. A looped model running T loops implicitly simulates T steps of CoT reasoning. This has been formally proven (Saunshi et al., 2025).

Furthermore, continuous latent thoughts — unlike discrete token outputs — can encode multiple alternative next steps simultaneously. This allows something closer to breadth-first search over the reasoning space, rather than a single committed reasoning path. The model is effectively exploring many possible