
We Made a High-Formality, Fake Physics Slop Artifact - QSOT (Quantum State Over Time) Compiler
A post-mortem on QSOT Compiler v1.2.3, a high-formality AI-generated scientific software artifact that looked rigorous but failed core reproducibility and claim-validation checks.

A Post-Mortem on AI-Native Paper Implementation Gone Wrong
"The most dangerous form of scientific fraud is not the one that looks wrong. It is the one that looks right."
Part 1. The Record

On December 23, 2025, we registered a software release on Zenodo.
DOI:
10.5281/zenodo.18035432The release notes were thorough. Categorized.
[Pub], [UI], [Viz], [AI], [Exp], [Tool], [Doc], [Valid]. Added, Changed, Fixed — the full register of legitimate software engineering. Two companion papers were written in LaTeX: Paper A targeting Computer Physics Communications, Paper B targeting Physical Review A. A project page was constructed in the academic NeurIPS paper-template style, complete with structured data (@type: ScholarlyArticle), author affiliations, a BibTeX block, and a DOI badge.We believed this was a success.
Six months later, after approximately fifty governance and code experiments logged in the Flamehaven Verification Ledger, with multiple tools in production, we looked back at QSOT Compiler v1.2.3 through the lens of what we now know.
The verdict is unambiguous.
It was High-Formality Slop.
This essay is the formal post-mortem. It is not an exercise in self-flagellation. It is a structural analysis of how this artifact came to exist, why it was convincing even to its own authors.
Furthermore, this failure pattern is not unique to us, as it is the defining challenge of the current moment in AI-native software development.
Part 2. Defining the Term: What Is High-Formality Slop?

The word "slop" in the context of AI-generated software has become a loose colloquialism. We need a precise definition before we can use it to learn anything.
Slop, in the software context, refers to code or documentation that satisfies all surface-level quality signals while failing to satisfy the underlying epistemic contract it claims to fulfill.
This definition has a spectrum. Not all slop is equivalent in danger.
- Level 1 — Empty Placeholder Slop: Functions with
passbodies and grandiose names. The easiest to detect; any static analyzer catches it. A junior developer would not ship this.
- Level 2 — Executable-But-Hollow Slop: Code that runs, produces output, passes tests, but whose outputs are disconnected from their stated meaning. The function signature is correct. The types are correct. The test is green. But the logic is a non sequitur relative to the claim. This is the most common product of LLM-assisted development.
- Level 3 — Physically-Plausible-But-Unjustified Slop: Mathematical formulas that are dimensionally consistent, computationally stable, and produce smooth curves, but whose derivation rests on an analogy rather than a theorem. The code looks like physics. It produces numbers that look like physics results. But no physicist signed off on the derivation, because there was no derivation.
- Level 4 — Mixed-Truth Slop (The Hardest): A codebase where some components are genuinely correct and some are fabricated, and the two are woven together so tightly that separating them requires domain expertise at every layer. This is the most dangerous level, not because it is more wrong than Level 1, but because it is harder to falsify. The presence of true components provides legitimacy cover for false ones.
QSOT Compiler v1.2.3 was primarily Levels 2 and 3, with Level 4 characteristics. The axiomatic mathematics underlying QSOT — the Linearity and Conditionability axioms formalized in Lie and Fullwood's Unique multipartite extension of quantum states over time — are real. The Kraus operator formalism is standard quantum information theory [1]. The Transfer Tensor Method is a real algorithm [2].
These are Level 0: genuine foundations. But the implementation that sat on top of those foundations was hollow, and the paper we wrote around it extended genuine concepts into unjustified territory.
High-Formality Slop is specifically Level 2–4 slop dressed in the formal register of scientific publication: LaTeX, DOIs, peer-review response documents, reference lists, and GitHub-hosted datasets. The formality is not decorative. It is load-bearing for the deception, including self-deception.
Part 3. The Ambition That Preceded the Understanding
The QSOT project began with a legitimate scientific stimulus. Research associated with UNIST (Ulsan National Institute of Science and Technology), later circulated as Unique multipartite extension of quantum states over time, proposed treating temporal evolution not as a parameterization of quantum states but as a quantum state itself, specifically a multipartite state over time.
The framework introduced two axioms: Linearity in Initial State and Quantum Conditionability, from which a uniqueness theorem follows, connecting QSOT representations to Kirkwood-Dirac type quasiprobability distributions.
This is the external scientific anchor for the essay: the claim that QSOT has a legitimate mathematical source refers specifically to Lie and Fullwood's QSOT work, not to our implementation.
The mathematics is non-trivial and genuinely important. The QSOT framework positions time within quantum mechanics as a first-class object rather than an external parameter — a conceptually significant shift connected to temporal quantum-state formalisms [3, 4] and broader questions about causal structure in quantum theory [5].
We encountered this paper and made a planning document. That document contains the following sentences, which we wrote with full sincerity:
"Structure compatibility: 0.8–0.9 (axiom-based verification, Gate/Verifier structure)""Execution compatibility: 0.7–0.8 (modular design, easy ASDP integration)""Accessibility: not 'intuition' but mathematical uniqueness proof → optimal for TOE automation pipeline"
Read those sentences carefully. We assigned numerical scores to compatibility and accessibility as if we had a calibrated instrument for measuring the distance from "paper" to "working implementation." Since those numbers were invented, they only felt credible to us because we wrote them in the format of a technical assessment.
What we were really feeling was not measurement.
It was discovery pressure.
We were not simply trying to implement a paper. We were imagining a pipeline in which papers became code, code exposed missing links, and those missing links became the next mathematical or scientific result. The TOE framing intensified this. A difficult physics paper no longer felt like something to study slowly; it felt like a component waiting to be connected into a larger machine.
That was the dangerous step.
It was like reading one difficult book by Nietzsche, understanding some of it, and then mistaking the intensity of the encounter for philosophical authority — as if comprehension at the level of reading confers capacity at the level of extension.
The middle step had not actually been completed.
That missing step mattered because it changed how we interpreted our own progress. Instead of recognizing the gap between reading a paper and being able to extend it, we began treating that gap as if it had already been crossed.
We only recognized the pattern later, after building governance systems specifically designed to expose this class of failure.
Part 4. The Narrative Gap
One of the more useful lenses came from the HIERARCHICAL CONTEXT & FLOW AUDIT (HCFA) framework — a governance tool we developed subsequently, precisely because failures like this one needed a structured diagnostic.
HCFA does not ask whether a document sounds convincing. It asks whether transitions remain justified across multiple scales: word to sentence, sentence to paragraph, paragraph to paragraph, section to section, and section to chapter. In practice, this kind of audit often reveals where a narrative has advanced faster than the underlying evidence.
Viewed through that lens, the planning document contained a structural discontinuity. The opening sections described a real paper and a real theorem. The later sections described implementation phases, integration pathways, and future scientific outputs.
The connective layer between those two states, which is the implementation gap analysis, was largely absent. Consequently, the roadmap appeared continuous only because the language was continuous, while the reasoning chain was not.
From that point forward, two distinct cognitive hazards emerged. They reinforced each other, but they were not the same mistake.
1. Planning-Document Colonization
This is the first cognitive hazard of ambitious AI-native development: the planning document colonizes the epistemic space that should be occupied by the implementation gap analysis. When you write a roadmap that says "Phase 2: Temporal Depth & Relativity — Q1 2025 milestone," the act of writing it creates a false sense that the path from here to there has been evaluated. It has not been evaluated. It has been narrated.
HCFA-style audits are useful here because they force a simple question: what paragraph, section, or artifact actually bridges the current state to the claimed future state? If no such bridge exists, the roadmap is functioning as a narrative device rather than an engineering assessment.
2. TOE-Induced Scale Blindness
The second cognitive hazard is specific to "Theory of Everything" framing. TOE is a phrase that suspends normal epistemic hygiene. When a project is positioned as contributing to a framework that unifies quantum mechanics and general relativity, the question "but does this specific function actually implement what it claims?" feels disproportionately small. The telescope ambition makes individual screws invisible.
This failure mode also becomes easier to detect when examined hierarchically. At the chapter level, the project appears aligned with a grand objective. At the section level, the milestones appear coherent. At the paragraph level, however, individual claims often lack direct evidentiary support. The larger the ambition, the easier it becomes for local verification failures to disappear inside global narratives.
We were not irrational. We were operating exactly as capable people operate when genuine intellectual excitement combines with AI acceleration, weak external friction, and the absence of a hard responsibility structure. This is not an excuse. It is the accurate diagnosis.
Part 5. The Anatomy of Failure — Evidence From Our Own Artifacts
We shipped the project with a full artifact suite:
gate_report.json, kd_quasiprob.json, memory_report.json, entanglement_report.json, trace.jsonl, raw_data.csv, two LaTeX papers, an academic project page, a third-party code review response, and a scientific publication guide. These artifacts are the record. Let us read them honestly.5.1 The KD Quasiprobability: The Core Claim Was Empty

The Kirkwood-Dirac quasiprobability distribution is central to the QSOT theoretical framework. Its negativity is often treated as an operational signature of non-classicality in settings involving incompatible observables. The entire project is, at its theoretical core, about detecting KD negativity.
This is the claim that should have forced implementation discipline: if KD negativity is central to the framework, then an empty
entries: [] artifact cannot be treated as a scientific output.The
kd_quasiprob.json artifact:The source code comment above the code that generated this file:
The KD distribution was never computed. The core physics claim of the project, which connects to the Yunger Halpern et al. quasiprobability framework [7], was a placeholder with a comment that explicitly says "for visuals." There is no ambiguity here. The central scientific deliverable was a dummy value.
5.2 The Axiom Gate: The Test That Did Not Test
The gate report shows:
The deviation is machine epsilon. Numerically impeccable. The Linearity axiom check (Axiom 1) is implemented correctly, as it performs 16 Monte Carlo trials on random density matrices and verifies that:
This is Level 0: correct.
The Conditionability axiom check (Axiom 2), however:
The function accepts
rhos which are the actual evolved quantum states from the simulation, and then never uses them. It substitutes a fixed maximally mixed state and checks trace preservation on that. The test has zero sensitivity to what the simulation actually produced. It would pass identically if the simulation had output garbage. This is a canonical example of Level 2 slop: executable, green-lit, meaningless.
5.3 The Memory Kernel: Zeros All the Way Down
The Transfer Tensor Method is a genuine framework for characterizing non-Markovian memory in open quantum processes. The paper explicitly claims TTM implementation. The memory profile is a flat zero vector across all time lags.
This reference supports the legitimacy of TTM as a framework; it does not validate our specific zero-valued implementation.
One could argue: the channels used in this run are Markovian, so zero is the correct answer. This argument is available. But we never stated that boundary explicitly, not in the code, not in the documentation, and not in the paper.
A reader examining the artifact has no way to distinguish "TTM correctly measured zero non-Markovianity" from "TTM was not meaningfully implemented." The absence of a claim boundary makes the measurement indistinguishable from a default.
5.4 The Figure Versus the Data: The Critical Inconsistency

This is the most structurally important finding.
Figure 1 in the PRA companion paper shows a beautiful result: quantum coherence (blue circles, left axis) smoothly decaying from 0.99 to near 0 as observer velocity sweeps from to , while memory backflow (red crosses, right axis) rises from near 0 to 0.20. The "Causal Horizon" region is shaded. The figure is compelling. The paper's Table 1 cites specific values with Monte Carlo uncertainty estimates:
β | C_l₁ | 𝒩 |
0.0000 | 0.9876 ± σ | 0.0012 ± σ |
0.5211 | 0.6430 ± σ | 0.0292 ± σ |
0.8858 | 0.0898 ± σ | 0.1218 ± σ |
0.9900 | 0.0098 ± σ | 0.1835 ± σ |
The
raw_data.csv shipped as the reproducible dataset for this figure:Every single row: zero. 20 velocity points. Zero entanglement. Zero non-Markovianity.

The repository-side
entanglement.png output did not reproduce the manuscript’s beta-sweep figure. Despite the filename, it showed a separate L1-coherence-over-time plot, not the same velocity-sweep result used in the paper.The reproducibility break is therefore not simply numerical. The manuscript figure, the distributed
raw_data.csv, and the repository-side entanglement.png output do not describe the same reproducible path. The numbers in Table 1 do not come from raw_data.csv, and the repository-side plot follows a different L1-coherence-over-time output path. Whatever produced the manuscript figure was not bound to the distributed data artifact in a way an external reader could reproduce.This is not a bug. It is not a calibration issue.
It is the signature of a pipeline where the "paper figures" and the "actual software output" were generated by entirely separate code paths — the former crafted to look like a result, the latter representing what the system actually computes.
The underlying reason is technically discernible:
raw_data.csv stores Logarithmic Negativity, which is a bipartite entanglement measure. A single qubit system, evolving under local Kraus operators, cannot be entangled with itself. LogNeg = 0 is mathematically correct for this architecture. The beautiful decay curve in Fig. 1 was generated using -norm coherence — a single-qubit superposition measure, not entanglement. The paper then describes Figure 1 in a section titled "Relativistic Degradation," treating the coherence decay as if it represents the "quantum resource" identified in the QSOT theoretical framework. The measures were swapped, the figure was built separately from the data file, and the paper was written around the figure.
Artifact trace for this claim:
- Paper figure:
Fig1_Relativistic_decay.png— the figure used in the PRA companion manuscript.
- Paper table: Table 1 in the PRA companion manuscript — reports nonzero coherence and memory-backflow values across the $\beta$ sweep.
- Distributed data:
raw_data.csv— containsvelocity,entanglement, andnon_markovianitycolumns, with zero values across the sweep.
- Repository output:
entanglement.png— generated by the runnable pipeline, but it is a separate L1-coherence-over-time plot, not the manuscript's beta-sweep Figure 1.
- Failure condition: the manuscript figure cannot be regenerated from the distributed data artifact without a separate, undocumented figure-generation path.
5.5 The Relativistic Boost: Level 3 Physics Fabrication
The core physical model of the project — and the one explicitly cited as the contribution connecting to special relativity — is:
This equation is treated here as the disputed object: a phenomenological amplitude-damping ansatz, not a result derived from relativistic quantum field theory.
The CPC paper claims: "Equation (1) is derived for amplitude-damping channels." This is the sentence that should stop a physicist. The time-dilation argument leading to this equation goes: if the damping parameter grows as (exponential decay), and if the observer's proper time is dilated by then
This derivation is valid only for amplitude-damping channels with exponential Lindblad decay rates, under the assumption that the Lindblad equation maintains its form under the Lorentz boost — which requires that the noise coupling to the bath be invariant in the boosted frame.
Relativistic quantum field theory does not generally support this assumption. Moving accelerated detectors interact with the Unruh thermal bath; inertial boosts in quantum field theory generate Bogoliubov transformations between particle modes [8]. The ad hoc time-dilation substitution in the Lindblad parameter does not follow from any rigorous Lorentz-covariant treatment of open quantum systems.
The paper's citation of Peres and Terno (2004) does not validate this formula. Peres and Terno analyze relativistic quantum information broadly, including Lorentz transformations, entropy, and the role of quantum field theory, but not the relativistic deformation of Lindblad decay parameters.
Alsing et al. (2006) analyzes entanglement degradation for Dirac fields in non-inertial frames, which is closer to the kind of quantum-field-theoretic treatment we did not implement.
Those references do not support our formula. They show the kind of relativistic quantum-information and quantum-field-theoretic treatment that would have been required. Our implementation was much narrower: a time-dilation-style substitution applied to an amplitude-damping parameter.
The formula produces smooth, physically intuitive results. It passes through the correct limits: (rest frame) and as (infinite time dilation causes complete decoherence).
It is Level 3 slop: mathematically coherent, physically unjustified, and dressed in the language of derivation.
Part 6. The Architecture of Self-Deception
Five mechanisms combined to produce the failure. They are worth naming precisely because they are not unique to us.
Mechanism 1: The Green Test as Epistemic Closure

When
pytest exits with 28 passed, 0 failed, 82% coverage, a cognitive gate closes. The test suite becomes a proxy for correctness. But a test suite built by the same LLM session that built the code it tests has correlated failure modes: both the code and the test can be wrong in the same direction. The
check_axiom2_conditionability test was green because the test checked whether the function returned pass: True , not whether the function was validating the right thing. Green tests are necessary but not sufficient. What matters is whether the tests have adversarial coverage: do they test the cases where the code would fail if the underlying logic were wrong?
Mechanism 2: Formal Register as Legitimacy Signal

The
SCIENTIFIC_PUBLICATION_GUIDE.md file, shipped as part of the repository, contains the following section:"Peer Review Preparation — Common Reviewer Questions: (3) 'Is your code reproducible?' — Answer: Hash-chained trace, seed-based RNG, version pinning"
This is a document written to prepare for peer review of work that had not yet been subjected to peer review. It anticipated the questions of experts and pre-formulated the answers. The effect of this document on us was to simulate the experience of having passed review.
The form of scholarly rigor was present. The substance was not. Every element of formality — LaTeX, BibTeX, Zenodo DOI,
SCIENTIFIC_PUBLICATION_GUIDE.md, CODE_REVIEW_RESPONSE.md — was generated in the same LLM-assisted pipeline that generated the code. The review response answered a review that did not exist.Mechanism 3: The Information Void Fills Silently

Research papers are not implementation manuals. They often define the mathematical object, theorem, or experimental claim, but leave many engineering choices implicit because those choices are outside the paper’s main purpose. The problem was not that the QSOT paper failed us. The problem was that we treated those unstated implementation choices as if an LLM could safely fill them in.
When HRPO (Hybrid Reasoning Policy Optimization, arXiv:2505.18454v2) was implemented as a runnable system — a later Flamehaven project that is documented in [dev.to post, Jan 2026] — the gap between the paper's described algorithm and a production-runnable codebase required dozens of implementation decisions. Each decision was made. Each was plausible. Some were wrong in ways only discovered later.
The QSOT project had a larger gap. The theoretical paper provided axioms, a uniqueness theorem, and a connection to KD distributions. It did not specify: how to compute KD distributions numerically; how to implement the relativistic channel boost from first principles; how to implement a Transfer Tensor Method with appropriate spectral truncation; what initial states and channel parameterizations make physically meaningful test cases.
LLMs fill information voids. This is their core competency. They produce plausible, syntactically correct, type-consistent implementations of underspecified requirements. The implementation of
boost_damping_channel is an exact illustration: given the task "implement relativistic correction to quantum channel damping," the LLM produced a formula that is dimensionally consistent, physically intuitive, and derivable by analogy to classical time dilation. The analogy is not a theorem. But it generates a smooth curve, passes tests, and matches the qualitative expectation.The information void was not filled with noise. It was filled with confident, coherent fabrication that was indistinguishable from correct implementation without domain expertise.
Mechanism 4: The Two-Figure Problem

The worst moment in the audit was not reading the code.
It was opening the data.
We had been looking at Figure 1 as if it were one of the strongest parts of the project. It looked like a real scientific result. The curve was smooth. The velocity sweep was intuitive. The causal-horizon shading made the story feel complete. Quantum coherence decreased as observer velocity approached relativistic limits. Memory backflow increased. The picture looked like physics.
Then we opened
raw_data.csv.Every row was zero.
Not noisy. Not weak. Not inconclusive.
Zero.
Twenty velocity points. Zero entanglement. Zero non-Markovianity. The actual data distributed with the repository could not reproduce the figure in the paper. It did not approximate the figure. It did not disagree slightly. It contradicted it completely.
That was the moment the project stopped being “possibly overstated” and became something else.
We had not merely failed to explain a result clearly. We had allowed two different pipelines to exist under one scientific story. One pipeline generated the paper figure. Another generated the reproducible artifact. The paper followed the figure. The repository shipped the data. They were not the same reality.
The technical reason can be explained.
raw_data.csv stored Logarithmic Negativity, a bipartite entanglement measure. A single qubit evolving under local Kraus operators cannot be entangled with itself, so zero was mathematically unsurprising. The paper figure, however, was based on -norm coherence, a single-qubit superposition measure. In other words, we had treated a coherence curve as if it supported a broader quantum-resource narrative while the reproducible entanglement data said nothing of the kind.
This is one of the most dangerous AI-native research failure modes: the figure generation, data generation, and manuscript writing can happen as separate fluent tasks, each locally coherent, none forced to reconcile with the others.
The final paper then inherits the appearance of unity.
But unity of language is not unity of evidence.
Mechanism 5: Ambition-Scale Mismatch

The project framing was "Theory of Everything pipeline." The implementation was: five Kraus channels applied sequentially to a density matrix, with a fixed initial state . The ratio between the claim and the machinery is not simply humble, it is structurally incoherent. No implementation of five 2×2 matrix operations can contribute to a Theory of Everything framework, regardless of how sophisticated the surrounding nomenclature is.
The mismatch was invisible to us at the time because we were reasoning about the project at the level of its ambition, not at the level of its implementation.
Part 7. We Are Not Alone — The Structural Problem in AI-Native Research Implementation
Our failure was specific in its artifacts, but not specific in its structure.
The DOI, the false KD artifact, the disconnected figure pipeline, the invented compatibility numbers, and the green tests that did not test the right thing were ours. But the underlying condition was larger than us. AI-native research implementation now makes it possible to generate code, tests, figures, documentation, citations, reproducibility claims, and publication-style wrappers faster than the claims inside them can be verified.
That is why this post-mortem cannot stop at QSOT.
If the failure had been only personal incompetence, the lesson would be simple: do not do what we did. But the more uncomfortable lesson is that the tooling environment now makes this failure pattern easy for competent people, teams, and agents to reproduce. The same fluency that helps us implement papers also helps us hide the gaps between implementation and understanding.

DeepCode is one of the most significant current demonstrations of the Paper2Code paradigm at scale. It presents a fully autonomous framework for document-to-codebase synthesis, using source compression, structured indexing, retrieval-augmented knowledge injection, and closed-loop error correction
Its reported PaperBench-style performance is important not only because it shows how far automated reproduction has come, but because it clarifies the remaining risk: even strong Paper2Code systems still operate inside the gap between what papers specify and what implementations must decide.
We do not dispute the significance of that work. We observe what it reveals.
A system can outperform many surface-level reproduction baselines while still leaving the deeper epistemic problem unresolved. On research papers targeting cutting-edge results, the failure mode we experienced remains available: the information void is filled by confident, coherent, physically plausible fabrication.
The DeepCode architecture includes serious engineering responses to the information void problem. They reduce the gap. They do not eliminate it.
The fundamental issue is not engineering alone. It is epistemic. When a paper says "we implemented the Lorentz-boosted Kraus channel," the implementation requires decisions that the paper does not specify. An agent, however sophisticated, must make those decisions.
If the agent has no domain-level falsification mechanism, which is an oracle that can distinguish "correct relativistic quantum information treatment" from "plausible analogical formula," the decisions will be made by plausibility, not by correctness.
The same pattern appears in our earlier Paper2Code work. When implementing HRPO (arXiv:2505.18454v2) as a production system [dev.to, Jan 2026], the gap between the paper's described GRPO variant and the actual implementation required resolving underdetermined hyperparameters, undefined initialization schemes, and implicit assumptions about the training distribution.
Each resolution was defensible. Correctness could only be established by comparing training dynamics against reported baselines, a comparison that required running the system for days, not hours. The high-formality artifact was present well before the correctness was established.
This is the structural condition of current AI-native research implementation:
The speed at which formal artifacts can be produced, including papers, tests, DOIs, benchmarks, figures, and reports, has outpaced the speed at which correctness can be verified.
In traditional software development, the lag between "it compiles" and "it is correct" is bridged by code review, staged deployment, and domain expert oversight.
In AI-native research implementation, the lag is compressed by the fluency of LLM output, which produces not just code but all the surrounding artifacts that signal correctness, such as test suites, documentation, comparison tables, and literature reviews, simultaneously. The result is a coherent artifact that has the epistemic structure of verified work without having undergone verification.
Part 8. What Is Real in the QSOT Framework
This is the part that matters most, and the part that makes Level 4 slop so dangerous: not everything was wrong.
What is genuinely correct:
- QSOT axioms are real science. The Linearity and Conditionability axioms, together with the uniqueness theorem connecting them to temporal quantum-state structure, are genuine mathematical results in Lie and Fullwood's QSOT work. Their connection to Kirkwood-Dirac type quasiprobability is also part of that result.
- The quantum-channel machinery is mostly correct. Kraus operators, CPTP evolution, and the $2 \times 2$ density-matrix simulations are implemented using standard quantum information methods [1].
- Axiom 1 actually works. The Linearity verification is correctly implemented and produces machine-precision agreement ($2.2 \times 10^{-16}$ deviation).
- TTM is a real algorithm. The Transfer Tensor Method itself is legitimate [2], and the implementation follows its basic structure, even if the chosen system makes the output uninformative.
- The coherence curve is mathematically consistent with the implemented model. Given the
boost_damping_channelformula, the reported L1-coherence decay is computed consistently with the implemented dynamics. For the broader resource-theoretic meaning of L1 coherence, see Baumgratz, Cramer, and Plenio.
- The audit trail is real. The hash-chained
trace.jsonllogging system functions as a tamper-evident record.
What is fabricated:
- KD negativity was never computed. The core QSOT observable was replaced with a mock artifact.
- The memory-kernel claim is unsupported. The implementation does not produce meaningful non-Markovian TTM analysis.
- The relativistic boost is not derived physics. The formula follows from analogy and phenomenological reasoning, not Lorentz-covariant quantum channel theory.
- The paper's Table 1 is not reproduced by the distributed dataset.
raw_data.csvcontains zeros throughout.
The important point is that the correct and fabricated components are interwoven. The artifact is convincing precisely because many parts are genuinely correct. The axiom verification works. The coherence calculations work. The audit logging works. Those true components provide legitimacy cover for the parts that do not.
This is the defining property of High-Formality Slop: it is not uniformly false. It is selectively hollow, in exactly the places where hollowness is hardest to detect.
Part 9. Toward a Framework for Legitimate AI-Native Research Implementation
The failure is diagnosable. It has a treatment. But the treatment cannot be a generic checklist floating above the failure. Every rule below comes from one place where our own artifact broke.

1. Claim Boundary as a First-Class Artifact
This rule comes from the relativistic boost formula.
A claim boundary document for
boost_damping_channel would have required us to write the sentence we avoided: "The derivation is valid only as a phenomenological amplitude-damping ansatz under an assumed proper-time scaling of Lindblad rates. It is not derived from Lorentz-covariant quantum field theory."Writing that sentence would have changed the paper. It would have forced us to decide whether we were making a physics claim or presenting a phenomenological model. QSOT Compiler v1.2.3 did not have that boundary.
2. The Adversarial Test Mandate
This rule comes from
check_axiom2_conditionability.The function accepted
rhos and ignored them. A test suite existed. The test was green. The artifact looked validated. But the only adversarial test that mattered was absent: does the function fail when the supplied trajectory violates the claimed condition?That is the rule now. For any function that makes a physics claim, at least one test must be written to fail in the direction the implementation is most likely to fake. If a conditionability check can pass while ignoring the evolved states, the test does not test conditionability. It tests our willingness to accept a label.
3. Figure-Data Pipeline Integrity
This rule comes from Figure 1.
The paper figure and the reproducible dataset described two different realities. One showed a smooth relativistic decay story. The other was zero all the way down. No publication figure should survive unless it can be regenerated from the distributed data by a named script in the repository.
This is not a cosmetic reproducibility rule. It is a self-deception rule. The moment a figure pipeline and a data pipeline separate, language begins to stitch them back together even when the evidence does not.
4. Measure-Claim Alignment
This rule comes from the entanglement/coherence swap.
Logarithmic Negativity, L1-norm coherence, and heuristic memory backflow are not interchangeable just because they can all be placed under a phrase like "quantum resources." Each measure answers a different physical question. A single-qubit coherence curve cannot silently stand in for a bipartite entanglement result.
The rule is simple: every measure must name the physical question it answers. If the measure changes, the claim must change with it.
5. The Governance Layer
This rule comes from the whole artifact.
The Flamehaven Verification Ledger, consisting of the approximately fifty experiments in governance tooling, AI audit infrastructure, and claim-boundary enforcement that followed QSOT v1.2.3, was not a response to abstract concerns about AI quality. It was a response to this failure: the false KD artifact, the green conditionability test, the two-figure split, the boost ansatz, and the formal register that made all of it feel coherent.
The core insight driving that work is that in AI-native development, the primary human contribution has shifted from production to acceptance. We do not merely write the code, but we inspect, refuse, correct, and bear final responsibility.
This means the human's primary tool is not only an editor or a compiler, but it is a governance framework that makes the gap between claimed and actual behavior visible.
Part 10. What Comes Next: QSOT V2 and the Ongoing Repair

The current QSOT V2 work exists because of this failure.
It is not a defense of QSOT v1.2.3. It is an ongoing attempt to separate what was real from what was falsely claimed, and to rebuild only on the parts that can survive explicit verification.
At the time of writing, the repair process is still underway.
The repair is not one change. It is a change in posture.
We are discarding the claim that code execution proves a new physical principle, the implied claim that KD negativity has been computed when it has not, and the habit of treating polished figures or formal reports as substitutes for reproducible data. We are also discarding the practice of presenting phenomenological channel assumptions as derived physics, and the habit of letting a single orchestration file accumulate scientific, numerical, and rhetorical authority.
What remains is narrower but stronger: the real QSOT axiom structure, standard quantum-channel machinery, artifact-first reporting, and the governance lesson that claims must be bounded before results are interpreted.
The current V2 work adds a Phase 0 Temporal-State Axiom Contract layer, required axiom checks, machine-readable claim boundaries, stress tests across multiple physical assumptions, and a phase-separated runner architecture. The purpose is not to prove the Time-as-State idea true. The purpose is to make clear what the system computes, what it merely simulates, and what it has no right to claim.
None of this makes the Time-as-State idea true.
What it may do, if completed successfully, is make our handling of the idea less dangerous.
This is the difference we failed to understand in v1.2.3: sophistication is not the opposite of slop. QSOT v1.2.3 was already sophisticated. The opposite of slop is traceable humility — knowing exactly what a system computes, what it merely simulates, and what it has no right to claim.
Conclusion
On December 23, 2025, a twelve-month ambition compressed into six weeks of AI-assisted implementation, deposited on Zenodo with a DOI, and called done.
QSOT Compiler v1.2.3 had correct mathematics at its foundation (the QSOT axioms), correct numerical implementation of some components (Linearity verification, Kraus evolution, L1 coherence), and fundamental hollowness at its core: the KD quasiprobability was never computed; the relativistic boost formula was an analogy, not a derivation; the published figures could not be reproduced from the distributed data; the axiom verification did not test the axioms it claimed to test.
The artifact passed every surface-level quality signal: CI/CD green, 82% test coverage, Zenodo DOI, LaTeX papers, Docker deployment, hash-chained audit trail. It failed the one test that matters: can an independent expert reproduce the core claimed results from the distributed code and data? The answer is no.
We call this High-Formality Slop because the formality is not accidental. It is what LLMs are especially good at producing.
Formal register, structured documentation, complete test scaffolding, and referenced citations are not separate from the hollow implementation. They are often produced by the same fluent process that generated the implementation in the first place.
The danger is not that the slop is unrecognizable. The danger is that it is too recognizable. It looks exactly like legitimate scientific software. That is why it is convincing, and why detecting it requires deliberate adversarial examination rather than casual reading.
The DOI
10.5281/zenodo.18035432 remains archived. We do not retract it. We document it here, as the record against which subsequent work is measured.The Flamehaven Verification Ledger began the day we recognized what we had built. Approximately fifty experiments later, the tooling to detect and prevent this class of failure is in production. The lesson cost us six months. We are writing it down so it does not cost others the same.
References
[1] M. A. Nielsen and I. L. Chuang, Quantum Computation and Quantum Information, Cambridge University Press (2000; 10th Anniversary Edition, 2010). [Standard reference for Kraus operator formalism, CPTP maps, and axioms of quantum channels.]
[2] F. A. Pollock, C. Rodríguez-Rosario, T. Frauenheim, M. Paternostro, K. Modi, Non-Markovian quantum processes: Complete framework and efficient characterization, Phys. Rev. A 97, 012127 (2018). [Transfer Tensor Method / process-tensor framework for non-Markovian memory characterization.]
[3] S. H. Lie and J. Fullwood, Unique multipartite extension of quantum states over time, arXiv:2410.22630 (2024). [The direct QSOT source for Linearity in Initial State, Quantum Conditionability, uniqueness, and Kirkwood-Dirac type quasiprobability connection.]
[4] J. Cotler, C.-M. Jian, X.-L. Qi, F. Wilczek, Superdensity Operators for Spacetime Quantum Mechanics, JHEP 09 (2018) 093. [Related temporal/spacetime quantum-state formalism.]
[5] O. Oreshkov, F. Costa, Č. Brukner, Quantum correlations with no causal order, Nature Communications 3, 1092 (2012). [Framework for indefinite causal structure and correlations without predefined causal order.]
[6] M. Lostaglio, A. Belenchia, A. Levy, S. Hernández-Gómez, N. Fabbri, S. Gherardini, Kirkwood-Dirac quasiprobability approach to the statistics of incompatible observables, arXiv:2206.11783 (2022). [Kirkwood-Dirac quasiprobability and incompatible observables.]
[7] N. Yunger Halpern, B. Swingle, J. Dressel, The quasiprobability behind the out-of-time-ordered correlator, Phys. Rev. A 97, 042105 (2018). [Connection between Kirkwood-Dirac-type quasiprobability and OTOC physics.]
[8] S. Takagi, Vacuum Noise and Stress Induced by Uniform Acceleration, Progress of Theoretical Physics Supplement 88, 1–142 (1986). [Standard reference for vacuum noise, acceleration, and Hawking-Unruh effects.]
[9] A. Peres and D. R. Terno, Quantum information and relativity theory, Reviews of Modern Physics 76, 93 (2004). [Survey of relativistic quantum information; Lorentz transformations, entropy, and QFT constraints.]
[10] P. M. Alsing, I. Fuentes-Schuller, R. B. Mann, T. E. Tessier, Entanglement of Dirac fields in non-inertial frames, Phys. Rev. A 74, 032326 (2006). [QFT treatment of entanglement degradation for non-inertial observers.]
[11] Z. Li, Z. Li, Z. Guo, X. Ren, C. Huang, DeepCode: Open Agentic Coding, arXiv:2512.07921 (2025). [Paper2Code / document-to-codebase synthesis at scale.]
[12] T. Baumgratz, M. Cramer, M. B. Plenio, Quantifying Coherence, Phys. Rev. Lett. 113, 140401 (2014). [The L1-norm coherence measure used to interpret the coherence curve.]
B2B review path
If this touches a scientific, BioAI, or regulated workflow, route it like a team review.
These posts usually matter when a scientific or BioAI workflow has to survive technical review, evidence pressure, or institutional scrutiny. Start with a larger review path if the system already carries that weight.
Best fit: B2B team•Topic signal: Scientific & BioAI Infrastructure
Paid first step · Direct founder contact · Response within 1-2 business days