Flamehaven LogoFlamehaven.space
back to writing
OpenMythos v0.5.0 Code Review - Audit Report

OpenMythos v0.5.0 Code Review - Audit Report

OpenMythos collected thousands of GitHub stars and dominated AI discourse for a week. This is what happens when you actually read the code — and why the people who do always arrive too late to matter.

notion image

💡

🔎 Executive Summary

OpenMythos is a theoretically sound, research-grade reconstruction of a Recurrent-Depth Transformer grounded in peer-reviewed literature. Core modules (LTIInjection, MLAttention, RecurrentBlock) are mathematically correct and paper-accurate.
Three production-blocking deficiencies prevent training as described: a Python-loop MoE dispatch that cannot scale to any real GPU workload, an absent router_bias update that will cause routing collapse, and a missing ACT ponder loss that prevents the halting mechanism from being learned.
The 770M-vs-1.3B efficiency claim is externally sourced from Parcae (Prairie et al., 2026) and has not been reproduced by this codebase.

1. Narrative vs. Technical Reality

OpenMythos has achieved viral traction through three interlocking narrative frames. Each frame contains a technically-grounded kernel wrapped in a layer of framing that community discourse amplifies beyond what the source code supports.

1.1 The Three Narrative Drivers

Narrative Frame
Community Reception
Technical Reality
Claude Mythos reverse-engineering
David-vs-Goliath framing: one developer decoding a closed frontier model drives open-source sentiment
README disclaimer is explicit: "independent, community-driven theoretical reconstruction based solely on publicly available research and speculation." Not a leak. Not a distillation. A hypothesis in code.
Silent Reasoning / Latent Thought
"AI that thinks in loops, not tokens" resonates with audiences unfamiliar with transformer internals; perceived as a step change in AI capability
Correctly grounded in Saunshi et al. (2025) and Geiping et al. (2025). The recurrent loop is real and correctly implemented. Whether Anthropic uses this architecture is unverified speculation.
770M achieves 1.3B quality
Parameter-efficiency narrative drives GitHub stars from hardware-constrained developers; positions OpenMythos as democratizing AI
Figure cited verbatim from Parcae (Prairie et al., 2026). Not reproduced by this codebase. No trained checkpoint exists. No benchmark results are provided anywhere in the repository.

1.2 README Claims vs. Source Code: Verified Checklist

README Claim
Verdict
Evidence
Prelude + Recurrent Block + Coda three-stage architecture
PASS
OpenMythos.__init__ in main.py confirms structure
h_{t+1} = A*h_t + B*e + Transformer(h_t,e) update rule
PASS
LTIInjection.forward() implements this exactly
rho(A) < 1 guaranteed by construction
PASS
A_discrete = exp(-exp(log_dt+log_A).clamp(-20,20)) ensures (0,1) range always
Flash Attention 2 with transparent fallback
PASS
GQAttention auto-detects flash_attn; fallback to scaled dot-product
MLA with compressed KV latent cache
PASS
MLAttention caches c_kv (kv_lora_rank) + k_rope only; full K/V not stored
ACT halting per position
PASS
ACTHalting.forward() correct Graves remainder trick; bounded behavior
Loop-index RoPE embedding
PASS
loop_index_embedding() injects sinusoidal signal into first dim//8 channels
Depth-wise LoRA per iteration
PASS
LoRAAdapter: per-loop Embedding(max_loops, rank). Depth extrapolation via clamp.
Depth extrapolation at inference
PASS
n_loops overridable at generate() time; works correctly
Router bias for aux-loss-free load balancing
WARN
router_bias exists as nn.Buffer in main.py; update logic entirely absent from training script
ACT ponder loss for halting training
FAIL
No ponder_loss term anywhere in the training script
PyTorch FSDP multi-GPU training
PASS
3b_fine_web_edu.py correctly uses FullyShardedDataParallel with FULL_SHARD
README training table says DDP; code is FSDP
WARN
Documentation inconsistency. FSDP is more capable but differs from README claim.
mythos_7b() variant
FAIL
Called in README usage example; does not exist in variants.py; ImportError on use
torch==2.11.0 (pyproject.toml)
WARN
Version 2.11.0 does not exist on PyPI; pip install will fail from pyproject.toml
770M achieves 1.3B quality
NOTE
Externally sourced from Parcae paper; not reproduced here
openai/gpt-oss-20b tokenizer
WARN
Unverifiable on HuggingFace Hub as of April 2026

2. Architecture Implementation Audit

2.1 Module-Level Analysis

Module
Status
Paper Reference
RMSNorm
PASS
Zhang & Sennrich (2019)
RoPE (precompute / apply)
PASS
LLaMA-3 (theta=500000)
GQAttention
PASS
Ainslie et al. (2023)
GQAttention edge case
WARN
Dao et al. (2023)
MLAttention
PASS
DeepSeek-V2 (2024)
MoEFFN routing logic
PASS
DeepSeek-V3 (2024)
MoEFFN dispatch
FAIL
N/A
router_bias update
FAIL
DeepSeek-V3 aux-loss-free
LTIInjection
PASS
Prairie et al. (2026)
ACTHalting logic
PASS
Graves (2016)
ACT ponder loss
FAIL
Graves (2016)
LoRAAdapter
PASS
Bae et al. (2024)
RecurrentBlock
PASS
Composite
RecurrentBlock KV cache
WARN
N/A
OpenMythos full model
PASS
Composite

2.2 Critical Production Blockers (P0 Required Before Training)

The routed expert dispatch in MoEFFN uses a nested Python for-loop over topk and n_experts. At mythos_1b (topk=4, n_experts=64) this executes up to 256 Python-level iterations per forward pass. At mythos_50b (n_experts=256), the ceiling is 1,024 iterations. This is not a micro-optimization issue — it is a fundamental throughput ceiling that renders GPU acceleration ineffective.
# current implementation (research-grade only)
for i in range(self.topk):
for eid in range(self.n_experts):
mask = expert_ids == eid
if not mask.any(): continue
out[mask] += token_scores[mask] * self.routed_experts[eid](flat[mask])
Required fix: replace with scatter_add over a batched expert tensor, or a Triton kernel following the pattern in Megablocks or vLLM's MoE implementation.
The router_bias is declared as an nn.Buffer in main.py (gradient-free, as intended by DeepSeek-V3). However, the training script 3b_fine_web_edu.py contains zero code to update it. DeepSeek-V3's aux-loss-free routing requires a per-step bias adjustment to maintain load balance:
# required per-step update (entirely absent from 3b_fine_web_edu.py)
expert_load = topk_idx_counts / total_tokens # fraction routed to each expert
target_load = 1.0 / self.n_experts
bias_delta = lr_bias * torch.sign(target_load - expert_load)
moe.router_bias += bias_delta
Without this, the router will collapse to a subset of experts within thousands of training steps, making the MoE architecture functionally equivalent to a much smaller dense model.
ACTHalting correctly implements Graves (2016) remainder logic for inference. However, without a ponder loss term in the training objective, the halting probabilities receive no gradient signal and cannot learn when to stop:
# ponder loss required per Graves (2016) Section 3 (absent from training script)
ponder_loss = beta * accumulated_loop_weights.mean()
total_loss = lm_cross_entropy_loss + ponder_loss
In practice, without this loss the model will either run max_loop_iters for all inputs, or halt at a constant loop depth determined by initialization, providing none of the variable-compute benefits the README describes.

3. moda.py: Parallel Experimental Architecture

moda.py (1,063 lines) implements Mixture-of-Depths Attention (arXiv 2603.15619) with DeepSeekMoE. It shares zero imports with main.py and functions as a completely independent codebase bundled within the same package.
Dimension
main.py
moda.py
Load balancing
router_bias, aux-loss-free (DeepSeek-V3)
balance_loss (DeepSeekMoE section 3.3) — incompatible approach
RoPE implementation
Complex phasor (view_as_complex)
cos/sin decomposition (RotaryEmbedding class) — same result, different code path
Attention mechanism
GQA or MLA (switchable via cfg.attn_type)
MoDA: causal KV + cross-layer depth KV in single softmax
Integration with main.py
N/A
None — zero import links, no shared interface contract
Independently runnable
Yes
Yes (moda_example.py smoke test passes)
Training support
3b_fine_web_edu.py (FSDP)
No training script provided
Diagnostic: moda.py is a parallel research experiment, not a component of the primary architecture. Evolution_Action_Plan.md references 'MLA architecture consideration' but no integration code exists. The two architectures define conflicting load balancing schemes and would require substantial interface design work to unify.

4. Training Script Audit

Component
Status
Notes
FSDP configuration
PASS
FULL_SHARD + TransformerBlock wrapping. Correct use of FSDP.clip_grad_norm_ for distributed global norm.
Mixed precision
PASS
BF16 for H100/A100; float16 + GradScaler fallback for older GPUs. Correctly handled.
Cosine LR schedule
PASS
Linear warmup (2000 steps) -> cosine decay. Standard and correct.
Gradient accumulation
PASS
no_sync context prevents redundant all-reduce during accumulation steps.
Checkpoint save/load
PASS
FSDP.state_dict_type context for distributed state gathering. Resume logic correct.
DataLoader sharding
PASS
FineWeb-Edu streaming with per-rank shard splitting.
router_bias update
FAIL
Entirely absent. Routing collapse will occur under sustained training.
ACT ponder loss
FAIL
Entirely absent. Halting mechanism receives no gradient signal.
Tokenizer model ID
WARN
openai/gpt-oss-20b unverifiable on HuggingFace Hub as of April 2026.
README says DDP; code is FSDP
WARN
Documentation mismatch. Actual parallelism is FSDP, which differs from README table.

5. Variants and Dependencies

Variant
dim
Loops
Experts
Context
Memory Risk
mythos_1b
2048
16
64
4K
None
mythos_3b
3072
16
64
4K
None
mythos_10b
4096
24
128
8K
Low
mythos_50b
6144
32
256
8K
Medium
mythos_100b
8192
32
256
1M
[FAIL] 16GB+ RoPE buffer at init
mythos_500b
12288
48
512
1M
[FAIL] 16GB+ RoPE buffer at init
mythos_1t
16384
64
512
1M
[FAIL] 16GB+ RoPE buffer at init
Note: mythos_100b/500b/1t set max_seq_len=1,000,000. precompute_rope_freqs allocates this as an eager register_buffer, creating a ~16GB tensor at initialization that transfers to GPU on model.to(device). Lazy or on-the-fly RoPE computation is required before these variants are usable.
Package
Status
Issue
torch>=2.1.0 (requirements.txt)
PASS
Correct lower bound
torch>=2.11.0 (training/requirements.txt)
WARN
Version 2.11.0 does not exist on PyPI
torch==2.11.0 (pyproject.toml)
WARN
Hard pin to non-existent version; pip install from pyproject.toml will fail
transformers>=4.40.0
PASS
Reasonable lower bound
datasets>=2.18.0
PASS
Correct for FineWeb-Edu streaming
flash-attn>=2.8.3
NOTE
Optional, CUDA-only. Fallback path functions without it.
pytest>=8.1.1
PASS
Test dependencies correctly separated

6. Applicability to General AI Reasoning Development

Setting aside the Claude Mythos narrative entirely, this section evaluates OpenMythos as a research engineering artifact with potential applicability to developers building general-purpose reasoning systems. The assessment is based on extractability, correctness, and adoption cost of individual components.

6.1 High-Value Extractable Components

LTIInjection: Contractive State Management for Iterative Systems
LTIInjection is the strongest standalone contribution in the codebase. The parameterization of A in log-space with an exponential map guarantees spectral radius < 1 by construction — independent of learning rate or gradient noise. This pattern is directly applicable to:
  • Agent loop architectures maintaining state across tool calls or environment steps
  • Iterative refinement pipelines (draft-critique-revise) where state explosion is a failure mode
  • Any persistent memory system requiring formal stability guarantees rather than empirical tuning
The key property: stability is architectural, not hyperparameter-dependent. Gradient clipping and norm regularization provide weaker, conditional guarantees. LTIInjection's guarantee holds at lr=1000, verified by the test suite.
MLAttention: KV Cache Compression for Long-Context Reasoning
MLAttention's compression of the KV cache to a low-rank latent (kv_lora_rank) is a validated technique from DeepSeek-V2. For long-context reasoning systems, the memory reduction is meaningful: instead of caching full K and V tensors, only the compressed latent c_kv and positional k_rope are stored per token. The implementation is paper-accurate, test-verified, and extractable with minimal modification. Developers building custom inference engines can use this module directly.
RecurrentBlock + ACT: Variable Compute Allocation
The RecurrentBlock with ACT halting provides a clean mechanism for allocating variable compute per input within a single batch. For reasoning systems, this directly maps to a useful property: simple queries terminate early, multi-step reasoning problems run more loops. The mechanism is correctly implemented.
Practical adoption requires one addition: the ponder loss in the training objective (approximately 5 lines following Graves 2016, Section 3). Once added, this module provides a trainable, differentiable compute-budget controller without requiring architectural changes.
LoRAAdapter: Low-Overhead Loop Differentiation
The per-loop scale Embedding(max_loops, rank) breaks the symmetry of weight-tied iterations at negligible parameter cost (128 scalars at rank=8, max_loops=16). Any architecture using weight sharing or parameter-efficient fine-tuning can adapt this pattern to enable loop-differentiated behavior without full layer duplication.

6.2 Components Requiring Substantial Work Before Use

Component
Required Work
Effort Estimate
MoEFFN dispatch
Replace nested Python loop with scatter_add or Triton kernel
High: requires CUDA kernel development or Megablocks/vLLM integration
router_bias training
Add per-step bias update to training loop
Low: ~15 lines following DeepSeek-V3 paper Algorithm 1
ACT ponder loss
Add ponder_loss term to training objective
Low: ~5 lines following Graves (2016) Section 3
Large-variant RoPE
Replace eager register_buffer with lazy/on-the-fly computation
Medium: refactor precompute_rope_freqs call sites
Tokenizer
Replace openai/gpt-oss-20b with verifiable model ID
Trivial: one-line change
moda.py integration
Define interface contract if MoDA is to replace MLA
High: architectural redesign required

6.3 Recommended Adoption Pattern

For developers who want to leverage specific components without inheriting unresolved issues:
  • Use LTIInjection as-is. No modifications required. Stable under all tested conditions.
  • Use MLAttention as-is for KV cache compression. Test coverage is sufficient to validate correctness.
  • Use RecurrentBlock after adding ponder_loss to training objective. All other logic is correct.
  • Do not use MoEFFN in any GPU-scale context until the dispatch loop is rewritten.
  • Do not attempt to integrate moda.py with main.py without an explicit interface design phase.
  • Replace the tokenizer model ID before running any test that instantiates MythosTokenizer.

7. Final Assessment

Dimension
Status
Assessment
Theoretical accuracy
PASS
RDT, LTI, ACT, MoE, MLA all implemented with paper fidelity across all cited references
Code quality
PASS
Type hints, docstrings, structured module separation throughout. Above-average for academic codebase.
Test coverage
PASS
RoPE: 9 invariants. LTI: 4 stability checks. Full model: 8 integration tests. Meaningful coverage.
Smoke test (small config)
PASS
CPU smoke test passes with dim=256 tiny config
Training correctness
FAIL
router_bias update and ACT ponder loss both absent. Model will not train as described.
Production readiness
FAIL
MoE Python loop dispatch is a fundamental throughput barrier for any GPU workload
Large variant viability
WARN
100b/500b/1t: ~16GB RoPE buffer allocated at initialization. OOM before first forward pass.
Documentation accuracy
WARN
mythos_7b() absent; DDP/FSDP mismatch; torch==2.11.0 non-existent on PyPI
moda.py integration
FAIL
No integration path with main.py. Treat as an independent codebase.
Tokenizer availability
WARN
openai/gpt-oss-20b unverifiable on HuggingFace Hub as of April 2026
💡What the project is: A high-quality research-grade implementation of recurrent-depth transformer concepts from peer-reviewed literature. Core modules are mathematically correct and individually verifiable. Test coverage is above average for an academic codebase. The LTIInjection module in particular represents a clean, extractable contribution.
What the project is not: A production-ready model, a verified reconstruction of any Anthropic system, or a codebase that can be trained as the README describes without three non-trivial engineering fixes. The 770M-vs-1.3B efficiency claim is a citation, not a result.
💡The narrative gap: Community discourse consumes this project as an open-source rebellion against closed AI — a technically-grounded narrative frame that substantially overstates what the codebase delivers. The gap between 'algorithmic enthusiasm' and technical reality is measurable and documented in this report. Both the narrative value and the technical limitations are real. They are not the same thing.

7.1 Priority Action Items

Priority
Action
Complexity
Impact
P0
Implement router_bias per-step update in training loop
Low (~15 lines)
Prevents routing collapse under sustained training
P0
Add ACT ponder loss to training objective
Low (~5 lines)
Enables halting mechanism to be learned
P0
Replace MoE dispatch Python loop with scatter_add
High (kernel rewrite)
Required for any GPU-scale training throughput
P1
Fix torch==2.11.0 pin to >=2.1.0 in pyproject.toml
Trivial
Unblocks pip install from pyproject.toml
P1
Add mythos_7b() variant or remove README reference
Trivial
Eliminates ImportError on README example
P1
Replace openai/gpt-oss-20b with verifiable tokenizer ID
Trivial
Unblocks all tokenizer tests
P2
Implement lazy RoPE for 100b+ variants
Medium
Required for large-variant initialization
P2
Define moda.py integration interface or document as standalone
Medium
Clarifies package architecture
P3
Correct README parallelism description: DDP -> FSDP
Trivial
Documentation accuracy
Flamehaven Internal Document | Protocol Re-Genesis Audit v2.0 (EN) | April 24, 2026
💡

💾 Download Code Review Report(Doc)

 

Next Step

If your AI system works in demos but still feels fragile, start here.

Flamehaven reviews where AI systems overclaim, drift quietly, or remain operationally fragile under real conditions. Start with a direct technical conversation or review how the work is structured before you reach out.

Direct founder contact · Response within 1-2 business days

Share

Related Reading