OpenMythos v0.5.0 Code Review - Audit Report

💡

🔎 Executive Summary

OpenMythos is a theoretically sound, research-grade reconstruction of a Recurrent-Depth Transformer grounded in peer-reviewed literature. Core modules (LTIInjection, MLAttention, RecurrentBlock) are mathematically correct and paper-accurate.

Three production-blocking deficiencies prevent training as described: a Python-loop MoE dispatch that cannot scale to any real GPU workload, an absent router_bias update that will cause routing collapse, and a missing ACT ponder loss that prevents the halting mechanism from being learned.

The 770M-vs-1.3B efficiency claim is externally sourced from Parcae (Prairie et al., 2026) and has not been reproduced by this codebase.

1. Narrative vs. Technical Reality

OpenMythos has achieved viral traction through three interlocking narrative frames. Each frame contains a technically-grounded kernel wrapped in a layer of framing that community discourse amplifies beyond what the source code supports.

1.1 The Three Narrative Drivers

Narrative Frame	Community Reception	Technical Reality
Claude Mythos reverse-engineering	David-vs-Goliath framing: one developer decoding a closed frontier model drives open-source sentiment	README disclaimer is explicit: "independent, community-driven theoretical reconstruction based solely on publicly available research and speculation." Not a leak. Not a distillation. A hypothesis in code.
Silent Reasoning / Latent Thought	"AI that thinks in loops, not tokens" resonates with audiences unfamiliar with transformer internals; perceived as a step change in AI capability	Correctly grounded in Saunshi et al. (2025) and Geiping et al. (2025). The recurrent loop is real and correctly implemented. Whether Anthropic uses this architecture is unverified speculation.
770M achieves 1.3B quality	Parameter-efficiency narrative drives GitHub stars from hardware-constrained developers; positions OpenMythos as democratizing AI	Figure cited verbatim from Parcae (Prairie et al., 2026). Not reproduced by this codebase. No trained checkpoint exists. No benchmark results are provided anywhere in the repository.

1.2 README Claims vs. Source Code: Verified Checklist

README Claim	Verdict	Evidence
Prelude + Recurrent Block + Coda three-stage architecture	PASS	OpenMythos.__init__ in main.py confirms structure
h_{t+1} = Ah_t + Be + Transformer(h_t,e) update rule	PASS	LTIInjection.forward() implements this exactly
rho(A) < 1 guaranteed by construction	PASS	A_discrete = exp(-exp(log_dt+log_A).clamp(-20,20)) ensures (0,1) range always
Flash Attention 2 with transparent fallback	PASS	GQAttention auto-detects flash_attn; fallback to scaled dot-product
MLA with compressed KV latent cache	PASS	MLAttention caches c_kv (kv_lora_rank) + k_rope only; full K/V not stored
ACT halting per position	PASS	ACTHalting.forward() correct Graves remainder trick; bounded behavior
Loop-index RoPE embedding	PASS	loop_index_embedding() injects sinusoidal signal into first dim//8 channels
Depth-wise LoRA per iteration	PASS	LoRAAdapter: per-loop Embedding(max_loops, rank). Depth extrapolation via clamp.
Depth extrapolation at inference	PASS	n_loops overridable at generate() time; works correctly
Router bias for aux-loss-free load balancing	WARN	router_bias exists as nn.Buffer in main.py; update logic entirely absent from training script
ACT ponder loss for halting training	FAIL	No ponder_loss term anywhere in the training script
PyTorch FSDP multi-GPU training	PASS	3b_fine_web_edu.py correctly uses FullyShardedDataParallel with FULL_SHARD
README training table says DDP; code is FSDP	WARN	Documentation inconsistency. FSDP is more capable but differs from README claim.
mythos_7b() variant	FAIL	Called in README usage example; does not exist in variants.py; ImportError on use
torch==2.11.0 (pyproject.toml)	WARN	Version 2.11.0 does not exist on PyPI; pip install will fail from pyproject.toml
770M achieves 1.3B quality	NOTE	Externally sourced from Parcae paper; not reproduced here
openai/gpt-oss-20b tokenizer	WARN	Unverifiable on HuggingFace Hub as of April 2026

2. Architecture Implementation Audit

2.1 Module-Level Analysis

Module	Status	Paper Reference
RMSNorm	PASS	Zhang & Sennrich (2019)
RoPE (precompute / apply)	PASS	LLaMA-3 (theta=500000)
GQAttention	PASS	Ainslie et al. (2023)
GQAttention edge case	WARN	Dao et al. (2023)
MLAttention	PASS	DeepSeek-V2 (2024)
MoEFFN routing logic	PASS	DeepSeek-V3 (2024)
MoEFFN dispatch	FAIL	N/A
router_bias update	FAIL	DeepSeek-V3 aux-loss-free
LTIInjection	PASS	Prairie et al. (2026)
ACTHalting logic	PASS	Graves (2016)
ACT ponder loss	FAIL	Graves (2016)
LoRAAdapter	PASS	Bae et al. (2024)
RecurrentBlock	PASS	Composite
RecurrentBlock KV cache	WARN	N/A
OpenMythos full model	PASS	Composite

2.2 Critical Production Blockers (P0 Required Before Training)

The routed expert dispatch in MoEFFN uses a nested Python for-loop over topk and n_experts. At mythos_1b (topk=4, n_experts=64) this executes up to 256 Python-level iterations per forward pass. At mythos_50b (n_experts=256), the ceiling is 1,024 iterations. This is not a micro-optimization issue — it is a fundamental throughput ceiling that renders GPU acceleration ineffective.

# current implementation (research-grade only)
for i in range(self.topk):
for eid in range(self.n_experts):
mask = expert_ids == eid
if not mask.any(): continue
out[mask] += token_scores[mask] * self.routed_experts[eid](flat[mask])

Required fix: replace with scatter_add over a batched expert tensor, or a Triton kernel following the pattern in Megablocks or vLLM's MoE implementation.

The router_bias is declared as an nn.Buffer in main.py (gradient-free, as intended by DeepSeek-V3). However, the training script 3b_fine_web_edu.py contains zero code to update it. DeepSeek-V3's aux-loss-free routing requires a per-step bias adjustment to maintain load balance:

# required per-step update (entirely absent from 3b_fine_web_edu.py)
expert_load = topk_idx_counts / total_tokens # fraction routed to each expert
target_load = 1.0 / self.n_experts
bias_delta = lr_bias * torch.sign(target_load - expert_load)
moe.router_bias += bias_delta

Without this, the router will collapse to a subset of experts within thousands of training steps, making the MoE architecture functionally equivalent to a much smaller dense model.

ACTHalting correctly implements Graves (2016) remainder logic for inference. However, without a ponder loss term in the training objective, the halting probabilities receive no gradient signal and cannot learn when to stop:

# ponder loss required per Graves (2016) Section 3 (absent from training script)
ponder_loss = beta * accumulated_loop_weights.mean()
total_loss = lm_cross_entropy_loss + ponder_loss

In practice, without this loss the model will either run max_loop_iters for all inputs, or halt at a constant loop depth determined by initialization, providing none of the variable-compute benefits the README describes.

3. moda.py: Parallel Experimental Architecture

moda.py (1,063 lines) implements Mixture-of-Depths Attention (arXiv 2603.15619) with DeepSeekMoE. It shares zero imports with main.py and functions as a completely independent codebase bundled within the same package.

Dimension	main.py	moda.py
Load balancing	router_bias, aux-loss-free (DeepSeek-V3)	balance_loss (DeepSeekMoE section 3.3) — incompatible approach
RoPE implementation	Complex phasor (view_as_complex)	cos/sin decomposition (RotaryEmbedding class) — same result, different code path
Attention mechanism	GQA or MLA (switchable via cfg.attn_type)	MoDA: causal KV + cross-layer depth KV in single softmax
Integration with main.py	N/A	None — zero import links, no shared interface contract
Independently runnable	Yes	Yes (moda_example.py smoke test passes)
Training support	3b_fine_web_edu.py (FSDP)	No training script provided

Diagnostic: moda.py is a parallel research experiment, not a component of the primary architecture. Evolution_Action_Plan.md references 'MLA architecture consideration' but no integration code exists. The two architectures define conflicting load balancing schemes and would require substantial interface design work to unify.

4. Training Script Audit

Component	Status	Notes
FSDP configuration	PASS	FULL_SHARD + TransformerBlock wrapping. Correct use of FSDP.clip_grad_norm_ for distributed global norm.
Mixed precision	PASS	BF16 for H100/A100; float16 + GradScaler fallback for older GPUs. Correctly handled.
Cosine LR schedule	PASS	Linear warmup (2000 steps) -> cosine decay. Standard and correct.
Gradient accumulation	PASS	no_sync context prevents redundant all-reduce during accumulation steps.
Checkpoint save/load	PASS	FSDP.state_dict_type context for distributed state gathering. Resume logic correct.
DataLoader sharding	PASS	FineWeb-Edu streaming with per-rank shard splitting.
router_bias update	FAIL	Entirely absent. Routing collapse will occur under sustained training.
ACT ponder loss	FAIL	Entirely absent. Halting mechanism receives no gradient signal.
Tokenizer model ID	WARN	openai/gpt-oss-20b unverifiable on HuggingFace Hub as of April 2026.
README says DDP; code is FSDP	WARN	Documentation mismatch. Actual parallelism is FSDP, which differs from README table.

5. Variants and Dependencies

Variant	dim	Loops	Experts	Context	Memory Risk
mythos_1b	2048	16	64	4K	None
mythos_3b	3072	16	64	4K	None
mythos_10b	4096	24	128	8K	Low
mythos_50b	6144	32	256	8K	Medium
mythos_100b	8192	32	256	1M	[FAIL] 16GB+ RoPE buffer at init
mythos_500b	12288	48	512	1M	[FAIL] 16GB+ RoPE buffer at init
mythos_1t	16384	64	512	1M	[FAIL] 16GB+ RoPE buffer at init

Note: mythos_100b/500b/1t set max_seq_len=1,000,000. precompute_rope_freqs allocates this as an eager register_buffer, creating a ~16GB tensor at initialization that transfers to GPU on model.to(device). Lazy or on-the-fly RoPE computation is required before these variants are usable.

Package	Status	Issue
torch>=2.1.0 (requirements.txt)	PASS	Correct lower bound
torch>=2.11.0 (training/requirements.txt)	WARN	Version 2.11.0 does not exist on PyPI
torch==2.11.0 (pyproject.toml)	WARN	Hard pin to non-existent version; pip install from pyproject.toml will fail
transformers>=4.40.0	PASS	Reasonable lower bound
datasets>=2.18.0	PASS	Correct for FineWeb-Edu streaming
flash-attn>=2.8.3	NOTE	Optional, CUDA-only. Fallback path functions without it.
pytest>=8.1.1	PASS	Test dependencies correctly separated

6. Applicability to General AI Reasoning Development

Setting aside the Claude Mythos narrative entirely, this section evaluates OpenMythos as a research engineering artifact with potential applicability to developers building general-purpose reasoning systems. The assessment is based on extractability, correctness, and adoption cost of individual components.

6.1 High-Value Extractable Components

LTIInjection: Contractive State Management for Iterative Systems

LTIInjection is the strongest standalone contribution in the codebase. The parameterization of A in log-space with an exponential map guarantees spectral radius < 1 by construction — independent of learning rate or gradient noise. This pattern is directly applicable to:

Agent loop architectures maintaining state across tool calls or environment steps

Iterative refinement pipelines (draft-critique-revise) where state explosion is a failure mode

Any persistent memory system requiring formal stability guarantees rather than empirical tuning

The key property: stability is architectural, not hyperparameter-dependent. Gradient clipping and norm regularization provide weaker, conditional guarantees. LTIInjection's guarantee holds at lr=1000, verified by the test suite.

MLAttention: KV Cache Compression for Long-Context Reasoning

MLAttention's compression of the KV cache to a low-rank latent (kv_lora_rank) is a validated technique from DeepSeek-V2. For long-context reasoning systems, the memory reduction is meaningful: instead of caching full K and V tensors, only the compressed latent c_kv and positional k_rope are stored per token. The implementation is paper-accurate, test-verified, and extractable with minimal modification. Developers building custom inference engines can use this module directly.

RecurrentBlock + ACT: Variable Compute Allocation

The RecurrentBlock with ACT halting provides a clean mechanism for allocating variable compute per input within a single batch. For reasoning systems, this directly maps to a useful property: simple queries terminate early, multi-step reasoning problems run more loops. The mechanism is correctly implemented.

Practical adoption requires one addition: the ponder loss in the training objective (approximately 5 lines following Graves 2016, Section 3). Once added, this module provides a trainable, differentiable compute-budget controller without requiring architectural changes.

LoRAAdapter: Low-Overhead Loop Differentiation

The per-loop scale Embedding(max_loops, rank) breaks the symmetry of weight-tied iterations at negligible parameter cost (128 scalars at rank=8, max_loops=16). Any architecture using weight sharing or parameter-efficient fine-tuning can adapt this pattern to enable loop-differentiated behavior without full layer duplication.

6.2 Components Requiring Substantial Work Before Use

Component	Required Work	Effort Estimate
MoEFFN dispatch	Replace nested Python loop with scatter_add or Triton kernel	High: requires CUDA kernel development or Megablocks/vLLM integration
router_bias training	Add per-step bias update to training loop	Low: ~15 lines following DeepSeek-V3 paper Algorithm 1
ACT ponder loss	Add ponder_loss term to training objective	Low: ~5 lines following Graves (2016) Section 3
Large-variant RoPE	Replace eager register_buffer with lazy/on-the-fly computation	Medium: refactor precompute_rope_freqs call sites
Tokenizer	Replace openai/gpt-oss-20b with verifiable model ID	Trivial: one-line change
moda.py integration	Define interface contract if MoDA is to replace MLA	High: architectural redesign required

6.3 Recommended Adoption Pattern

For developers who want to leverage specific components without inheriting unresolved issues:

Use LTIInjection as-is. No modifications required. Stable under all tested conditions.

Use MLAttention as-is for KV cache compression. Test coverage is sufficient to validate correctness.

Use RecurrentBlock after adding ponder_loss to training objective. All other logic is correct.

Do not use MoEFFN in any GPU-scale context until the dispatch loop is rewritten.

Do not attempt to integrate moda.py with main.py without an explicit interface design phase.

Replace the tokenizer model ID before running any test that instantiates MythosTokenizer.

7. Final Assessment

Dimension	Status	Assessment
Theoretical accuracy	PASS	RDT, LTI, ACT, MoE, MLA all implemented with paper fidelity across all cited references
Code quality	PASS	Type hints, docstrings, structured module separation throughout. Above-average for academic codebase.
Test coverage	PASS	RoPE: 9 invariants. LTI: 4 stability checks. Full model: 8 integration tests. Meaningful coverage.
Smoke test (small config)	PASS	CPU smoke test passes with dim=256 tiny config
Training correctness	FAIL	router_bias update and ACT ponder loss both absent. Model will not train as described.
Production readiness	FAIL	MoE Python loop dispatch is a fundamental throughput barrier for any GPU workload
Large variant viability	WARN	100b/500b/1t: ~16GB RoPE buffer allocated at initialization. OOM before first forward pass.
Documentation accuracy	WARN	mythos_7b() absent; DDP/FSDP mismatch; torch==2.11.0 non-existent on PyPI
moda.py integration	FAIL	No integration path with main.py. Treat as an independent codebase.
Tokenizer availability	WARN	openai/gpt-oss-20b unverifiable on HuggingFace Hub as of April 2026

💡What the project is: A high-quality research-grade implementation of recurrent-depth transformer concepts from peer-reviewed literature. Core modules are mathematically correct and individually verifiable. Test coverage is above average for an academic codebase. The LTIInjection module in particular represents a clean, extractable contribution.
What the project is not: A production-ready model, a verified reconstruction of any Anthropic system, or a codebase that can be trained as the README describes without three non-trivial engineering fixes. The 770M-vs-1.3B efficiency claim is a citation, not a result.
💡The narrative gap: Community discourse consumes this project as an open-source rebellion against closed AI — a technically-grounded narrative frame that substantially overstates what the codebase delivers. The gap between 'algorithmic enthusiasm' and technical reality is measurable and documented in this report. Both the narrative value and the technical limitations are real. They are not the same thing.

7.1 Priority Action Items

Priority	Action	Complexity	Impact
P0	Implement router_bias per-step update in training loop	Low (~15 lines)	Prevents routing collapse under sustained training
P0	Add ACT ponder loss to training objective	Low (~5 lines)	Enables halting mechanism to be learned
P0	Replace MoE dispatch Python loop with scatter_add	High (kernel rewrite)	Required for any GPU-scale training throughput
P1	Fix torch==2.11.0 pin to >=2.1.0 in pyproject.toml	Trivial	Unblocks pip install from pyproject.toml
P1	Add mythos_7b() variant or remove README reference	Trivial	Eliminates ImportError on README example
P1	Replace openai/gpt-oss-20b with verifiable tokenizer ID	Trivial	Unblocks all tokenizer tests
P2	Implement lazy RoPE for 100b+ variants	Medium	Required for large-variant initialization
P2	Define moda.py integration interface or document as standalone	Medium	Clarifies package architecture
P3	Correct README parallelism description: DDP -> FSDP	Trivial	Documentation accuracy

Flamehaven Internal Document | Protocol Re-Genesis Audit v2.0 (EN) | April 24, 2026

💡

💾 Download Code Review Report(Doc)

OpenMythos_Audit_Report.docx

24.8 KiB