
OpenMythos v0.5.0 Code Review - Audit Report
OpenMythos collected thousands of GitHub stars and dominated AI discourse for a week. This is what happens when you actually read the code — and why the people who do always arrive too late to matter.

🔎 Executive Summary
OpenMythos is a theoretically sound, research-grade reconstruction of a Recurrent-Depth Transformer grounded in peer-reviewed literature. Core modules (LTIInjection, MLAttention, RecurrentBlock) are mathematically correct and paper-accurate.
Three production-blocking deficiencies prevent training as described: a Python-loop MoE dispatch that cannot scale to any real GPU workload, an absent router_bias update that will cause routing collapse, and a missing ACT ponder loss that prevents the halting mechanism from being learned.
The 770M-vs-1.3B efficiency claim is externally sourced from Parcae (Prairie et al., 2026) and has not been reproduced by this codebase.
1. Narrative vs. Technical Reality
OpenMythos has achieved viral traction through three interlocking narrative frames. Each frame contains a technically-grounded kernel wrapped in a layer of framing that community discourse amplifies beyond what the source code supports.
1.1 The Three Narrative Drivers
Narrative Frame | Community Reception | Technical Reality |
Claude Mythos reverse-engineering | David-vs-Goliath framing: one developer decoding a closed frontier model drives open-source sentiment | README disclaimer is explicit: "independent, community-driven theoretical reconstruction based solely on publicly available research and speculation." Not a leak. Not a distillation. A hypothesis in code. |
Silent Reasoning / Latent Thought | "AI that thinks in loops, not tokens" resonates with audiences unfamiliar with transformer internals; perceived as a step change in AI capability | Correctly grounded in Saunshi et al. (2025) and Geiping et al. (2025). The recurrent loop is real and correctly implemented. Whether Anthropic uses this architecture is unverified speculation. |
770M achieves 1.3B quality | Parameter-efficiency narrative drives GitHub stars from hardware-constrained developers; positions OpenMythos as democratizing AI | Figure cited verbatim from Parcae (Prairie et al., 2026). Not reproduced by this codebase. No trained checkpoint exists. No benchmark results are provided anywhere in the repository. |
1.2 README Claims vs. Source Code: Verified Checklist
README Claim | Verdict | Evidence |
Prelude + Recurrent Block + Coda three-stage architecture | PASS | OpenMythos.__init__ in main.py confirms structure |
h_{t+1} = A*h_t + B*e + Transformer(h_t,e) update rule | PASS | LTIInjection.forward() implements this exactly |
rho(A) < 1 guaranteed by construction | PASS | A_discrete = exp(-exp(log_dt+log_A).clamp(-20,20)) ensures (0,1) range always |
Flash Attention 2 with transparent fallback | PASS | GQAttention auto-detects flash_attn; fallback to scaled dot-product |
MLA with compressed KV latent cache | PASS | MLAttention caches c_kv (kv_lora_rank) + k_rope only; full K/V not stored |
ACT halting per position | PASS | ACTHalting.forward() correct Graves remainder trick; bounded behavior |
Loop-index RoPE embedding | PASS | loop_index_embedding() injects sinusoidal signal into first dim//8 channels |
Depth-wise LoRA per iteration | PASS | LoRAAdapter: per-loop Embedding(max_loops, rank). Depth extrapolation via clamp. |
Depth extrapolation at inference | PASS | n_loops overridable at generate() time; works correctly |
Router bias for aux-loss-free load balancing | WARN | router_bias exists as nn.Buffer in main.py; update logic entirely absent from training script |
ACT ponder loss for halting training | FAIL | No ponder_loss term anywhere in the training script |
PyTorch FSDP multi-GPU training | PASS | 3b_fine_web_edu.py correctly uses FullyShardedDataParallel with FULL_SHARD |
README training table says DDP; code is FSDP | WARN | Documentation inconsistency. FSDP is more capable but differs from README claim. |
mythos_7b() variant | FAIL | Called in README usage example; does not exist in variants.py; ImportError on use |
torch==2.11.0 (pyproject.toml) | WARN | Version 2.11.0 does not exist on PyPI; pip install will fail from pyproject.toml |
770M achieves 1.3B quality | NOTE | Externally sourced from Parcae paper; not reproduced here |
openai/gpt-oss-20b tokenizer | WARN | Unverifiable on HuggingFace Hub as of April 2026 |
2. Architecture Implementation Audit
2.1 Module-Level Analysis
Module | Status | Paper Reference |
RMSNorm | PASS | Zhang & Sennrich (2019) |
RoPE (precompute / apply) | PASS | LLaMA-3 (theta=500000) |
GQAttention | PASS | Ainslie et al. (2023) |
GQAttention edge case | WARN | Dao et al. (2023) |
MLAttention | PASS | DeepSeek-V2 (2024) |
MoEFFN routing logic | PASS | DeepSeek-V3 (2024) |
MoEFFN dispatch | FAIL | N/A |
router_bias update | FAIL | DeepSeek-V3 aux-loss-free |
LTIInjection | PASS | Prairie et al. (2026) |
ACTHalting logic | PASS | Graves (2016) |
ACT ponder loss | FAIL | Graves (2016) |
LoRAAdapter | PASS | Bae et al. (2024) |
RecurrentBlock | PASS | Composite |
RecurrentBlock KV cache | WARN | N/A |
OpenMythos full model | PASS | Composite |
2.2 Critical Production Blockers (P0 Required Before Training)
The routed expert dispatch in MoEFFN uses a nested Python for-loop over topk and n_experts. At mythos_1b (topk=4, n_experts=64) this executes up to 256 Python-level iterations per forward pass. At mythos_50b (n_experts=256), the ceiling is 1,024 iterations. This is not a micro-optimization issue — it is a fundamental throughput ceiling that renders GPU acceleration ineffective.
# current implementation (research-grade only)for i in range(self.topk):for eid in range(self.n_experts):mask = expert_ids == eidif not mask.any(): continueout[mask] += token_scores[mask] * self.routed_experts[eid](flat[mask])
Required fix: replace with scatter_add over a batched expert tensor, or a Triton kernel following the pattern in Megablocks or vLLM's MoE implementation.
The router_bias is declared as an nn.Buffer in main.py (gradient-free, as intended by DeepSeek-V3). However, the training script 3b_fine_web_edu.py contains zero code to update it. DeepSeek-V3's aux-loss-free routing requires a per-step bias adjustment to maintain load balance:
# required per-step update (entirely absent from 3b_fine_web_edu.py)expert_load = topk_idx_counts / total_tokens # fraction routed to each experttarget_load = 1.0 / self.n_expertsbias_delta = lr_bias * torch.sign(target_load - expert_load)moe.router_bias += bias_delta
Without this, the router will collapse to a subset of experts within thousands of training steps, making the MoE architecture functionally equivalent to a much smaller dense model.
ACTHalting correctly implements Graves (2016) remainder logic for inference. However, without a ponder loss term in the training objective, the halting probabilities receive no gradient signal and cannot learn when to stop:
# ponder loss required per Graves (2016) Section 3 (absent from training script)ponder_loss = beta * accumulated_loop_weights.mean()total_loss = lm_cross_entropy_loss + ponder_loss
In practice, without this loss the model will either run max_loop_iters for all inputs, or halt at a constant loop depth determined by initialization, providing none of the variable-compute benefits the README describes.
3. moda.py: Parallel Experimental Architecture
moda.py (1,063 lines) implements Mixture-of-Depths Attention (arXiv 2603.15619) with DeepSeekMoE. It shares zero imports with main.py and functions as a completely independent codebase bundled within the same package.
Dimension | main.py | moda.py |
Load balancing | router_bias, aux-loss-free (DeepSeek-V3) | balance_loss (DeepSeekMoE section 3.3) — incompatible approach |
RoPE implementation | Complex phasor (view_as_complex) | cos/sin decomposition (RotaryEmbedding class) — same result, different code path |
Attention mechanism | GQA or MLA (switchable via cfg.attn_type) | MoDA: causal KV + cross-layer depth KV in single softmax |
Integration with main.py | N/A | None — zero import links, no shared interface contract |
Independently runnable | Yes | Yes (moda_example.py smoke test passes) |
Training support | 3b_fine_web_edu.py (FSDP) | No training script provided |
Diagnostic: moda.py is a parallel research experiment, not a component of the primary architecture. Evolution_Action_Plan.md references 'MLA architecture consideration' but no integration code exists. The two architectures define conflicting load balancing schemes and would require substantial interface design work to unify.
4. Training Script Audit
Component | Status | Notes |
FSDP configuration | PASS | FULL_SHARD + TransformerBlock wrapping. Correct use of FSDP.clip_grad_norm_ for distributed global norm. |
Mixed precision | PASS | BF16 for H100/A100; float16 + GradScaler fallback for older GPUs. Correctly handled. |
Cosine LR schedule | PASS | Linear warmup (2000 steps) -> cosine decay. Standard and correct. |
Gradient accumulation | PASS | no_sync context prevents redundant all-reduce during accumulation steps. |
Checkpoint save/load | PASS | FSDP.state_dict_type context for distributed state gathering. Resume logic correct. |
DataLoader sharding | PASS | FineWeb-Edu streaming with per-rank shard splitting. |
router_bias update | FAIL | Entirely absent. Routing collapse will occur under sustained training. |
ACT ponder loss | FAIL | Entirely absent. Halting mechanism receives no gradient signal. |
Tokenizer model ID | WARN | openai/gpt-oss-20b unverifiable on HuggingFace Hub as of April 2026. |
README says DDP; code is FSDP | WARN | Documentation mismatch. Actual parallelism is FSDP, which differs from README table. |
5. Variants and Dependencies
Variant | dim | Loops | Experts | Context | Memory Risk |
mythos_1b | 2048 | 16 | 64 | 4K | None |
mythos_3b | 3072 | 16 | 64 | 4K | None |
mythos_10b | 4096 | 24 | 128 | 8K | Low |
mythos_50b | 6144 | 32 | 256 | 8K | Medium |
mythos_100b | 8192 | 32 | 256 | 1M | [FAIL] 16GB+ RoPE buffer at init |
mythos_500b | 12288 | 48 | 512 | 1M | [FAIL] 16GB+ RoPE buffer at init |
mythos_1t | 16384 | 64 | 512 | 1M | [FAIL] 16GB+ RoPE buffer at init |
Note: mythos_100b/500b/1t set max_seq_len=1,000,000. precompute_rope_freqs allocates this as an eager register_buffer, creating a ~16GB tensor at initialization that transfers to GPU on model.to(device). Lazy or on-the-fly RoPE computation is required before these variants are usable.
Package | Status | Issue |
torch>=2.1.0 (requirements.txt) | PASS | Correct lower bound |
torch>=2.11.0 (training/requirements.txt) | WARN | Version 2.11.0 does not exist on PyPI |
torch==2.11.0 (pyproject.toml) | WARN | Hard pin to non-existent version; pip install from pyproject.toml will fail |
transformers>=4.40.0 | PASS | Reasonable lower bound |
datasets>=2.18.0 | PASS | Correct for FineWeb-Edu streaming |
flash-attn>=2.8.3 | NOTE | Optional, CUDA-only. Fallback path functions without it. |
pytest>=8.1.1 | PASS | Test dependencies correctly separated |
6. Applicability to General AI Reasoning Development
Setting aside the Claude Mythos narrative entirely, this section evaluates OpenMythos as a research engineering artifact with potential applicability to developers building general-purpose reasoning systems. The assessment is based on extractability, correctness, and adoption cost of individual components.
6.1 High-Value Extractable Components
LTIInjection: Contractive State Management for Iterative Systems
LTIInjection is the strongest standalone contribution in the codebase. The parameterization of A in log-space with an exponential map guarantees spectral radius < 1 by construction — independent of learning rate or gradient noise. This pattern is directly applicable to:
- Agent loop architectures maintaining state across tool calls or environment steps
- Iterative refinement pipelines (draft-critique-revise) where state explosion is a failure mode
- Any persistent memory system requiring formal stability guarantees rather than empirical tuning
The key property: stability is architectural, not hyperparameter-dependent. Gradient clipping and norm regularization provide weaker, conditional guarantees. LTIInjection's guarantee holds at lr=1000, verified by the test suite.
MLAttention: KV Cache Compression for Long-Context Reasoning
MLAttention's compression of the KV cache to a low-rank latent (kv_lora_rank) is a validated technique from DeepSeek-V2. For long-context reasoning systems, the memory reduction is meaningful: instead of caching full K and V tensors, only the compressed latent c_kv and positional k_rope are stored per token. The implementation is paper-accurate, test-verified, and extractable with minimal modification. Developers building custom inference engines can use this module directly.
RecurrentBlock + ACT: Variable Compute Allocation
The RecurrentBlock with ACT halting provides a clean mechanism for allocating variable compute per input within a single batch. For reasoning systems, this directly maps to a useful property: simple queries terminate early, multi-step reasoning problems run more loops. The mechanism is correctly implemented.
Practical adoption requires one addition: the ponder loss in the training objective (approximately 5 lines following Graves 2016, Section 3). Once added, this module provides a trainable, differentiable compute-budget controller without requiring architectural changes.
LoRAAdapter: Low-Overhead Loop Differentiation
The per-loop scale Embedding(max_loops, rank) breaks the symmetry of weight-tied iterations at negligible parameter cost (128 scalars at rank=8, max_loops=16). Any architecture using weight sharing or parameter-efficient fine-tuning can adapt this pattern to enable loop-differentiated behavior without full layer duplication.
6.2 Components Requiring Substantial Work Before Use
Component | Required Work | Effort Estimate |
MoEFFN dispatch | Replace nested Python loop with scatter_add or Triton kernel | High: requires CUDA kernel development or Megablocks/vLLM integration |
router_bias training | Add per-step bias update to training loop | Low: ~15 lines following DeepSeek-V3 paper Algorithm 1 |
ACT ponder loss | Add ponder_loss term to training objective | Low: ~5 lines following Graves (2016) Section 3 |
Large-variant RoPE | Replace eager register_buffer with lazy/on-the-fly computation | Medium: refactor precompute_rope_freqs call sites |
Tokenizer | Replace openai/gpt-oss-20b with verifiable model ID | Trivial: one-line change |
moda.py integration | Define interface contract if MoDA is to replace MLA | High: architectural redesign required |
6.3 Recommended Adoption Pattern
For developers who want to leverage specific components without inheriting unresolved issues:
- Use LTIInjection as-is. No modifications required. Stable under all tested conditions.
- Use MLAttention as-is for KV cache compression. Test coverage is sufficient to validate correctness.
- Use RecurrentBlock after adding ponder_loss to training objective. All other logic is correct.
- Do not use MoEFFN in any GPU-scale context until the dispatch loop is rewritten.
- Do not attempt to integrate moda.py with main.py without an explicit interface design phase.
- Replace the tokenizer model ID before running any test that instantiates MythosTokenizer.
7. Final Assessment
Dimension | Status | Assessment |
Theoretical accuracy | PASS | RDT, LTI, ACT, MoE, MLA all implemented with paper fidelity across all cited references |
Code quality | PASS | Type hints, docstrings, structured module separation throughout. Above-average for academic codebase. |
Test coverage | PASS | RoPE: 9 invariants. LTI: 4 stability checks. Full model: 8 integration tests. Meaningful coverage. |
Smoke test (small config) | PASS | CPU smoke test passes with dim=256 tiny config |
Training correctness | FAIL | router_bias update and ACT ponder loss both absent. Model will not train as described. |
Production readiness | FAIL | MoE Python loop dispatch is a fundamental throughput barrier for any GPU workload |
Large variant viability | WARN | 100b/500b/1t: ~16GB RoPE buffer allocated at initialization. OOM before first forward pass. |
Documentation accuracy | WARN | mythos_7b() absent; DDP/FSDP mismatch; torch==2.11.0 non-existent on PyPI |
moda.py integration | FAIL | No integration path with main.py. Treat as an independent codebase. |
Tokenizer availability | WARN | openai/gpt-oss-20b unverifiable on HuggingFace Hub as of April 2026 |
💡What the project is: A high-quality research-grade implementation of recurrent-depth transformer concepts from peer-reviewed literature. Core modules are mathematically correct and individually verifiable. Test coverage is above average for an academic codebase. The LTIInjection module in particular represents a clean, extractable contribution.What the project is not: A production-ready model, a verified reconstruction of any Anthropic system, or a codebase that can be trained as the README describes without three non-trivial engineering fixes. The 770M-vs-1.3B efficiency claim is a citation, not a result.💡The narrative gap: Community discourse consumes this project as an open-source rebellion against closed AI — a technically-grounded narrative frame that substantially overstates what the codebase delivers. The gap between 'algorithmic enthusiasm' and technical reality is measurable and documented in this report. Both the narrative value and the technical limitations are real. They are not the same thing.
7.1 Priority Action Items
Priority | Action | Complexity | Impact |
P0 | Implement router_bias per-step update in training loop | Low (~15 lines) | Prevents routing collapse under sustained training |
P0 | Add ACT ponder loss to training objective | Low (~5 lines) | Enables halting mechanism to be learned |
P0 | Replace MoE dispatch Python loop with scatter_add | High (kernel rewrite) | Required for any GPU-scale training throughput |
P1 | Fix torch==2.11.0 pin to >=2.1.0 in pyproject.toml | Trivial | Unblocks pip install from pyproject.toml |
P1 | Add mythos_7b() variant or remove README reference | Trivial | Eliminates ImportError on README example |
P1 | Replace openai/gpt-oss-20b with verifiable tokenizer ID | Trivial | Unblocks all tokenizer tests |
P2 | Implement lazy RoPE for 100b+ variants | Medium | Required for large-variant initialization |
P2 | Define moda.py integration interface or document as standalone | Medium | Clarifies package architecture |
P3 | Correct README parallelism description: DDP -> FSDP | Trivial | Documentation accuracy |
Flamehaven Internal Document | Protocol Re-Genesis Audit v2.0 (EN) | April 24, 2026
Next Step
If your AI system works in demos but still feels fragile, start here.
Flamehaven reviews where AI systems overclaim, drift quietly, or remain operationally fragile under real conditions. Start with a direct technical conversation or review how the work is structured before you reach out.
Direct founder contact · Response within 1-2 business days
Share
Related Reading
Cloud & Engineering Foundations
The Sheepwave Has a New Shape: OpenMythos and the Rise of Architecture Hype
Cloud & Engineering Foundations
It Gets Smarter Every Scan: AI-SLOP Detector v3.5.0 and the Self-Calibration Loop
Cloud & Engineering Foundations