We Built AI Verification Infrastructure. Then It Found Our Blind Spots.

Prologue: The Mirage & The Pivot

In the summer of 2024, our team began with an ambition that felt almost impossible from the outside: to use frontier AI for drug discovery, theoretical physics, and problems that seemed unreachable. We imagined systems that could help reason about spacetime geometries in a Theory of Everything framework and propose novel small-molecule bindings.

Within the first month, the boundary became clear. The models were powerful, but their claims were not something we could responsibly stand behind unless the path from input to result could be inspected. We were facing fluent equations, confident structural predictions, and polished scientific language without custody paths we could independently walk.

So we made a choice. We could continue producing evocative claims we could not verify, or we could build the infrastructure that would force our own work through the same scrutiny we would apply to anyone else’s.

We stopped. We built.

For the last two years, we pursued this work without revenue. The original ambition was larger than the system we are publishing now, but the work became more precise. We began turning model-generated reasoning into executable Python harnesses, custody records, and governance gates before we had stable language for what to call them.

The practical question became narrower: can scientific claims expose the path by which they were generated, and can biomolecular AI systems be evaluated for transparency before their outputs are treated as evidence?

We shared our progress publicly on platforms like dev.to, LinkedIn, Substack, Medium, and coderlegion, typically one to three times a week. Most of those posts did not travel far. They were part research log, part signal flare, and part attempt to decide what should remain protected as core IP and what should be released as open-source verification infrastructure.

Over time, the answer became clearer: protect the sensitive core, but publish the verification surfaces, audit artifacts, and reproducible paths wherever possible.

In the last few weeks, we intentionally reached out to more than four hundred researchers, engineers, and domain experts on LinkedIn. That was not passive networking, and it was not a sales campaign. After nearly two years of building mostly in silence, we wanted to put the work in front of people who were qualified to challenge it.

The fact that many accepted the connection did not validate the work, but it felt like the first small signal that the questions might matter outside our own room. The real purpose was to expose the ledger, the failures, and the audit paths to people who could correct us, pressure-test the system, and teach us where the assumptions were still weak.

We are not claiming to have solved drug discovery or the Theory of Everything. We are publishing the raw verification materials because we believe even a small piece of reliable infrastructure may help others working near those frontiers.

What the Flamehaven Verification Ledger Is

That choice — to verify before we believe — became the whole of what we built. Everything in the Flamehaven Verification Ledger answers one question:

Can this claim show the path by which it came to exist?

A mathematical or scientific claim earns operational trust here only once the path behind it can be inspected, reproduced, or challenged.

A biomolecular AI pipeline is trusted only once its disagreements are surfaced, not smoothed over.

A bioscience repository is cleared only once its safety is checked line by line, with no model in the loop.

Verification is not a feature here — it is the spine.

So the ledger is public and inspectable by design. Each run is published as a bounded artifact — inputs, a score, a report, a traceable custody path, and (when it happens) a failure note.

Not a product, not a stack of papers: the layer where a claim becomes
checkable

It holds three verification lanes today — plus a fourth surface we are beginning to open up:

EQA — Equation-to-Artifact. Turns mathematical and physical results into runnable, deterministic Python checks (e.g. string-theory β-functions at 200-bit precision; the OpenAI Erdős reproduction).

BAV — Biomolecular AI Validation. Judges whether a biomolecular AI pipeline — multi-model structure prediction, consensus, and governance — deserves trust, treating model disagreement as a safety signal rather than noise. (e.g. EXP-031, where AlphaFold3, AlphaFold2, Chai-1 and Boltz-2 disagreed on a 52-aa fold and the pipeline abstained; EXP-005, a truthful-null that rejected three Upadacitinib carriers in under two hours).

BSC — Bioscience Compliance. Our open-source scanner STEM-BIO-AI runs local, zero-execution safety and compliance scans of bioscience repositories and maps findings to the MIT AI Risk Repository and EU AI Act — deterministic, with no LLM in the scoring path. (e.g. yorkeccak/bio → T1 Quarantine, score 48; Runchuan-BU/BioClaw → T2 Caution, score 60).

Open resources (rolling out). The reusable pieces behind the lanes — verification frameworks, semantic models, and the working templates we build from — shared with the community as each is cleared for public release.

[Flamehaven Verification Ledger]

1. What Two Years Produced — and What It Cost Our Beliefs

This is the part that matters, and it is not a pitch. It is the record. Two years of runs left a body of results: some held up under scrutiny, and some broke beliefs we were attached to.

A verification engine earns its credibility from both — the results that survived show it produces something real; the failures show it will tell us when we are wrong. Publishing both halves, instead of only the flattering one, is our position.

Success 1: Geometry Overruled the Narrative

The same EQA lane that exposed a Weyl-curvature blind spot also gets something deeper exactly right: it judges a background by its geometry, not by the words attached to it. TOE-TEST-0004 is a 3-3-4 adversarial matrix built to pit narrative against topology.

Green–Schwarz anomaly cancellation requires the gravitational and gauge Pontryagin densities to match. The engine computes a residual that is anomaly-free only when :

Two adversarial cases make the point:

A "scary" background that should fail — but passes. A Schwarzschild black hole carrying a single-plane gauge field sounds like an anomaly waiting to happen. The engine returns PASS : Schwarzschild is Ricci-flat and a single-plane field is topologically trivial , so the residual is exactly

A "safe-sounding" background that should pass — but fails. Flat 10-D Minkowski space sounds harmless; with a two-plane gauge field it returns FAIL, because . Without curvature there is nothing to cancel the gauge anomaly.

This is the thesis in miniature: the ledger does not reward a background for sounding safe, nor punish one for sounding dangerous.

Full report: eqa/archive/reports/TOE-TEST-0004.md

Success 2: Calibration Is Not Understanding

The most dangerous AI output is not the one that says "I don't know" — it is the one that is beautifully calibrated and still does not understand what it is looking at. EXP-028 was built to catch exactly that confusion in our own pipeline.

On the surface, the model looked excellent:

Brier score improved from 0.204 to 0.0056 after calibration (our gate: ≤ 0.01).

AUC = 1.0 — perfect discrimination.

By every standard calibration metric, this is a model you would trust. But the cross-domain honesty gate refused it:

SR9 ≈ 0.26 (gate ≥ 0.80) — the reasoning did not hold up across chemistry, genomics, and proteomics.

DI2 ≈ 0.61 (gate ≤ 0.20) — high internal contradiction across inference steps.

So the pipeline did the rare thing: it returned "I cannot resolve this" and abstained, rather than let a well-calibrated number masquerade as comprehension.

Calibration measures whether confidence is honest about frequency; it says nothing about whether the model understands. Catching that gap — passing the honesty test by failing loudly — is the result.

(SR9 = Scientific Resonance, cross-domain consistency; DI2 = Dimensional Integrity, reasoning drift. Both advisory, not externally validated.)

Success 3: A Repository Scored Against a Legal Traceability Rubric, Deterministically

The third result is the least glamorous and the most reusable: a real bioscience repository scored against external regulation by code anyone can re-run — no model in the loop, no judgment call we can hide.

Our open-source scanner STEM-BIO-AI (pip install stem-ai) reads observable signals and grades three rubrics, each :

S₁ README/intent evidence

S₂ repo-local consistency

S₃ code/bio responsibility.

They combine into a deterministic, weighted raw score (with a penalty when hard-coded credentials are found):

The part that gives the score teeth is not the average — it is the hard cap. A repository cannot earn a high tier by gaming the rubric: clinical-adjacency without an explicit disclaimer caps the score, and a T0 hard-floor caps it harder:

The final score maps to a fixed triage tier:

For yorkeccak/bio ( , no credential penalty, no cap triggered):

Every rubric signal is cross-referenced to the MIT AI Risk Repository and EU AI Act Article 12.

The point is not the single number — it is how the number is reached: a multi-dimensional read rather than one checklist score (README intent, repo-local consistency, dependency safety, exception handling, data provenance, clinical disclaimers).

This is layered with AST-level code quantification (ast.parse over the source — structure, not string-matching) and the clinical-adjacency hard cap as governance on top.

And it does all of this with zero runtime dependencies, no network, no GPU, and no model in the loop — nothing in the target repo is ever executed — so the entire audit runs on a modest laptop in seconds and reproduces from repository state.

Failure 0: The Synthetic Data We Caught in Our Own Ledger

The most important failure was not in physics or biology. It was in our own record. An early build of the EQA dashboard shipped a "51-run calibration registry" that read as authoritative — until an external reviewer showed it was procedurally generated:

Fabricated primes — "split primes" computed as p = 101 + 4n, so p = 105 (which is 3 × 5 × 7) was labelled prime.

Fabricated structure — field degrees generated as 2^(2 + n mod 4).

Fabricated provenance & verdicts — [synthetic] hash labels, and at least one run showing a FAILED check sitting right beside a PASS verdict.

It was hallucinated scaffolding, not computation.

We did three things, in public. We deleted the entire synthetic registry. We replaced it with the real TOE-TEST foundational runs (0001–0051), importing the actual reports verbatim with only local workspace paths sanitized. And we added a deterministic synthetic_marker detector to the CI gate so a [synthetic] tag can never re-enter the published ledger.

This is the failure we are most willing to show, because it is exactly what the system was built to catch — including when the fabricator was us.

The replacement was not merely "honest emptiness." The real archive (TOE-TEST-0001–0051) carries the foundational physics and verification-methodology runs.

One of them, TOE-TEST-0004 ("Topology and Geometry Do Not Lie," graded S, 10/10), is the cleanest statement of this project's entire thesis. Its 3-3-4 adversarial matrix is built from cases where the narrative is wrong and the geometry is right:

The Unexpected PASS: A Schwarzschild-plus-gauge-field background that "should" anomaly-fail instead passes because it is Ricci-flat and the gauge field is topologically trivial

The Confident FAIL: A flat, "safe-sounding" background fails because Green–Schwarz cancellation requires curvature . The ledger does not argue that geometry beats narrative — it runs the cases and shows it.

Failure 1: One-Loop Weyl Curvature Blindness (EQA Lane · TOE-TEST-0001 / T09)

The EQA (Equation-to-Artifact) engine evaluates spacetime geometries under one-loop worldsheet -functions of the non-linear -model, serving as the equations of motion for superstring backgrounds:

We stress-tested this engine by inputting a "Planck-scale spacetime foam Schwarzschild metric with mass ," expecting a rapid gate failure due to extreme curvature near the horizon. Instead, the solver returned a clean PASS .

A Schwarzschild black hole is a vacuum solution to the Einstein field equations, meaning its Ricci tensor vanishes identically everywhere outside the central singularity. Because the one-loop worldsheet -functions couple directly to the Ricci tensor and not to the Weyl curvature tensor , all EQA metric residuals vanished:

The metric returned and , completely ignoring the developer's descriptive prompt. This demonstrated a narrow but important form of narrative immunity—it executes only coordinate mathematics, remaining deaf to linguistic hype.

However, it exposed a major physical boundary of our EQA engine: the one-loop solver is entirely blind to Weyl/tidal curvature. The physical curvature near a Planck-scale horizon is dictated by coordinate-invariant Weyl curvature, measured by the Kretschmann scalar:

This Weyl/tidal curvature diverges near the horizon and the singularity even though the Ricci tensor stays identically zero. Under such curvature, a one-loop Ricci-based acceptance gate should not be treated as sufficient, yet the engine passed it.

This is a specific blind spot, not a broken engine.

On the same ten-case battery the solver correctly fails all four de Sitter backgrounds (T06–T10) with identical residuals , reproducing the Maldacena–Núñez no-go theorem computationally.

It also recovers the Wess–Zumino–Witten exact-CFT cancellation . The engine is correct wherever one-loop Ricci geometry is sufficient; T09 marks precisely where that sufficiency ends.

We have not fixed this. The correction requires integrating Gauss-Bonnet terms to generate non-zero residuals for Ricci-flat backgrounds with high tidal forces: The full physics analysis remains open for inspection in the public foundational ledger report: eqa/archive/reports/TOE-TEST-0001.md.

Failure 2: An Honest Rejection We Cannot Yet Confirm (BAV Lane · EXP-005)

The hazard in computational drug discovery is not a model that declares uncertainty — it is a confident output trusted as evidence before its reasoning is inspected. EXP-005 is the inverse case: a fast, confident rejection that we cannot yet independently confirm is correct.

Our internal candidate-generation engine (RExSyn) screened three lipid carriers for topical Upadacitinib, a selective JAK1 inhibitor: a solid lipid nanoparticle (SLN), a nanostructured lipid carrier (NLC), and a liposomal gel.

Each was scored by an internal honesty gate:

SR9 (Scientific Resonance) is an advisory heuristic for cross-domain consistency — does the reasoning hold up across chemistry, genomics, and proteomics? Higher is better; our bar is ≥ 0.80. It is not an externally validated physical quantity.

The scores came back far below the bar:

SLN — SR9 ≈ 0.28

NLC — SR9 ≈ 0.23

Liposomal gel — SR9 ≈ 0.26

Every candidate was rejected in under two hours, against an estimated months of bench work. The value was in what was not built.

But because SR9 is advisory, we do not know whether ~ 0.25 reflects real formulation incompatibility or an over-conservative false reject.

A fast negative is only useful if it is a correct negative — and that is the one claim we cannot yet stand behind on our own.

Raw per-carrier metrics [bav/exp-005/manifest.json]

Failure 2b: When the Models Themselves Disagree

The complement to a confident-but-wrong prediction is a set of predictions that quietly disagree. EXP-031 stress-tested a 52-amino-acid target outside the models' comfortable distribution across four predictors — AlphaFold2, AlphaFold3, Chai-1, and Boltz-2.

They did not converge. pTM stayed low and the inter-model consensus drifted by across arms.

The honest outcome here is not a number — it is a refusal: the pipeline returned Unverified (Drift Detected) and an observer-only (KEEP_OBSERVER) decision for every arm, treating the disagreement itself as the safety signal rather than averaging four models into false confidence.

The underlying predictors "failed"; the ledger's job was to refuse to launder that into a result. The divergence is re-runnable (input FASTA, pinned seed 20260208, model versions) [bav/exp-031/reproduce]

Failure 3: The Multiplicative Reliability Fallacy

The BAV pipeline models the cumulative end-to-end reliability of biomolecular prediction runs using a multiplicative stage-wise formula:

These factors are not independent. is highly dependent on structural consensus accuracy, which in turn depends on sequence capture. Treating these variables as independent scalars is mathematically naive and systematically underestimates failure risks.

The correct formulation must utilize chained conditional probabilities:

We have not implemented this yet. Every reliability figure currently published in the BAV lane carries this known mathematical constraint.

2. From Weekly Experiments to Public Custody Paths

The value of this ledger is not only that it exposes past work. It changes what we can do next. Before this infrastructure existed, every experiment ended as a local folder, a private note, or a claim that required too much explanation. Now a weekly experiment can become a public record: a bounded run, a score, a report, an artifact path, and a failure note that others can inspect.

That is different from a formal paper, and it serves a different stage of the research process.

A paper argues. A ledger operates.

It gives the work a cadence, a memory, and a surface that others can challenge without waiting for us to package every result as a finished manuscript.

We do not see this as a replacement for peer review. We see it as the layer before peer review: the place where claims are made inspectable, where failures remain visible, and where weekly experiments can accumulate into something more durable than isolated announcements.

Operationally, that surface is built around custody paths. We do not claim to have "solved" traceability, nor do we promise that every result is bit-exact reproducible.

Instead, we expose the custody paths of every run in a public repository so that any researcher can trace how a number was generated and challenge its validity.

Conditional DERIVED Status: Under our credibility playbook, a research score is classified as DERIVED strictly while the linked code is runnable by others and yields the same result. The mere existence of a public repository is not sufficient; a third party must be able to execute it. Otherwise, it reverts to ADVISORY-HEURISTIC.

Stochastic Non-Determinism: Fold outputs vary with seed, GPU architecture, MSA depth, and compiler version. We handle this with a re-run scaffold (bav/exp-031/reproduce/): we anchor the input FASTA (taking a SHA-256 of the input, never the stochastic 3D output), pin the seed and model versions, and record our pLDDT/PAE/pTM as a labelled reference run for regime-level comparison — same confidence regime, not a bit-exact match. (A computed statistical tolerance band from a multi-seed bootstrap is a documented next step, not yet implemented — so we do not claim one.)

The repository layout represents the complete audit trail of these custody paths:

The EQA lane runs Python harnesses with mpmath at 200-bit precision.

The BAV lane publishes bounded run artifacts, structural metrics, and advisory scoring outputs for external inspection.

The BSC scanner executes locally with no network access, parsing syntax trees and dependency chains without running any external code.

3. The Self-Censoring Repository: MICA + a CI Sanitizer Gate

Two layers keep the ledger honest before anything is published, and they are worth separating cleanly.

MICA (Memory Invocation & Context Archive for AI) is the governance memory — mica.yaml plus a machine-readable archive of Design Invariants (verification-ledger.mica.archive.json) and an operating playbook + credibility architecture. It encodes what the project must always obey: internal metrics carry no external authority; no fabricated data; understatement on the public surface. MICA is the constitution, not the police.

The sanitizer is the enforcement — sanitizer/sanitize_ledger.py, a deterministic, dependency-light tool that MICA governs and that runs as a CI gate (.github/workflows/opsec-sanitize.yml) on every push. It fails the build if it finds:

Workspace / PII leaks — Windows, UNC and POSIX absolute paths containing workspace markers are collapsed to [workspace]/…; locale-specific (Hangul) strings and IP / email / secret patterns are flagged. It is marker-gated, so public URLs and relative paths are never touched.

Our own marketing language — a promotional_language detector flags superlatives ("revolutionary," "state-of-the-art," "authoritative") on the public-facing surface, after stripping HTML/CSS so it never false-fires on position: absolute.

Fabrication tags — the synthetic_marker detector (added after Failure 0) blocks any [synthetic] label from reaching publication.

Findings flow through typed Finding / FileResult structures and functional _FIX / _DETECT registers; every run appends to scan_history.jsonl, from which calibration.json compiles per-rule frequencies and false-positive candidates that a human can promote to an allowlist.

The net effect is a repository that censors its own hype, its own leaks, and its own fabrications before they reach the public — and keeps an auditable record of having done so.

4. Dashboard Heuristics: Labeled Explicitly As What They Are

Our dashboard plots EQA metrics against an exponential curve:

Status: ADVISORY-HEURISTIC
Crucial Methodological Caveat: This exponential decay model is not a physical law derived from superstring worldsheet geometry. It is strictly an empirical, visual fitting heuristic used for dashboard charting. Representing this curve as an absolute mathematical law would violate MICA's database integrity rules by letting unvalidated content back into the ledger.

5. The Controlled Public Audit Brief (What We Need You to Test)

The ledger is not a finished product. It is a controlled public audit brief addressed to the scientific community in a language that any researcher can verify. We need domain experts to audit highly isolated parts of the system:

To Formulation Scientists & Medicinal Chemists: Audit BAV EXP-005. Do the advisory resonance scores ) reflect genuine incompatibility of the SLN / NLC / liposomal carriers with topical Upadacitinib, or an over-conservative false reject by the gate? The raw per-carrier metrics are in bav/exp-005/manifest.json.

To Structural Biologists: Audit BAV EXP-031. Is the multi-model non-convergence on the 52-aa target ( , consensus drift up to ) genuine out-of-distribution behavior that justifies abstention, or an artifact of our harness? Re-fold from the scaffold: bav/exp-031/reproduce/.

To String Theorists: Audit EQA TOE-TEST-0001 (T09). Is our one-loop worldsheet -function solver physically correct in the Ricci-flat vacuum regimes where we claim results? The physics analysis is detailed in eqa/archive/reports/TOE-TEST-0001.md.

To Regulatory and Compliance Experts: Audit BSC yorkeccak-bio and BioClaw. Is our zero-execution Article 12 compliance mapping legally defensible or technically approximate? The interface includes a steerable sandbox that allows policy-weight adjustments and recalculation of compliance ratings. We need reviewers to test whether that flexibility improves auditability or introduces arbitrary scoring behavior. Audit the logic and findings in stem-bio-ai/bioclaw/2026-5-21/Runchuan-BU_BioClaw_experiment_results.json.

All code, schemas, and run payloads: Zenodo DOI 10.5281/zenodo.20483364

GitHub Repository: Flamehaven Verification Ledger

Citation ORCID: 0009-0009-2641-4280

Epilogue: The Invitation

We began by trying to reach the unreachable — spacetime geometries, novel small-molecule bindings, a Theory of Everything. We did not get there, and this ledger is not that. What we found instead was smaller, and we think more useful: a way to make a claim show the path by which it came to exist.

That matters more now than when we started. As AI fills every field with fluent, confident, publication-shaped output, the quiet casualty is the experiment itself — not the claim about it, but the runnable artifact, the failed run nobody deleted, the number you can trace back to the code that produced it.

We are an AI-native team; we felt that pull from the inside, and for a while we produced exactly what people are growing tired of. Then our own ledger caught us doing it (Failure 0), and we had to decide whether the verification layer was real or just another story we told about ourselves.

So we are not announcing a solution, and we are not claiming to have solved drug discovery or the Theory of Everything. We are publishing a small, inspectable surface — failures included — and handing it to the people best equipped to break it.

If it proves useful against the confident noise filling everyone's feeds, that will be your verdict, not our claim.

We wrote for two years before anyone was reading.
Some of you are reading now.
We built this to learn from you.