Can AI Review Physics? Yes — That Is Why We Built SPAR

A standalone review framework for checking whether outputs deserve the claims attached to them.

Most review systems answer a familiar question:

Did the system still produce the expected output?

SPAR is built for a narrower and more dangerous one:

Does the output still deserve the claim attached to it?

That is the split. Not reliability alone, but admissibility.

In practical terms, admissibility means claim-worthiness: whether a result justifies the interpretation, governance status, or scientific statement built on top of it.

A system can be reliable and inadmissible at the same time.

A physics engine can compute beta_G_norm, return zero, pass regression, and stay green across the whole pipeline. The report can still say the background is admissible. But if the function producing beta_G_norm is a stub that always returns zero, the output is stable while the claim attached to it is false.

That is not hypothetical. It is one concrete class of review failure SPAR was designed to surface.

What SPAR Is

SPAR (Sovereign Physics Autonomous Review) is a deterministic framework for claim-aware review.

It does not replace unit tests. It does not replace regression benchmarks. It does not replace scoring systems. It reviews a different object:

the output

the claim attached to that output

the implementation state behind it

the maturity state that should travel with it

SPAR started inside Flamehaven-TOE, an open physics simulation and AI governance engine. It has since been extracted into a standalone open-source framework:

github.com/flamehaven01/SPAR-Framework

The framework includes a generic review kernel, explicit score and verdict policy, registry-backed review surfaces, and a first domain adapter for physics. Physics is the first adapter and the first domain where this review model was stress-tested. It is not the limit of the framework.

The core idea is simpler than the name: an output can pass while the claim attached to it drifts.

Why Ordinary Review Is Not Enough

Ordinary review is usually shallow by necessity. It asks questions like:

did the code run

did the output shape stay valid

did the score remain within bounds

did regression remain green

Those are necessary checks. They are not always enough.

The failure SPAR cares about is not, in the first instance, a crash. It is not even always a wrong number. It is this:

the code executes

the output looks plausible

the tests pass

and the interpretation is still overstated or structurally false

That failure can appear in several ways: a placeholder implementation returns stable-looking values; a maturity registry stays stale after the implementation improves; a score looks smooth before its epistemic basis is strong enough to justify the interpretation attached to it; an approximation gets reported as closure.

None of these failures is spectacular. That is exactly why they are easy to miss.

A Minimal Divergence

The clearest way to see the difference is in review form.

Ordinary regression says the system still works. SPAR says the system may no longer be describing its own computation truthfully.

SPAR is not "tests, but harsher." It can produce a different review outcome even when ordinary regression remains green. In this case, the required action is not rejection. It is reclassification. That is not the same as testing harder. It is reviewing a different object.

The Core Mismatch Classes

SPAR treats three mismatch classes as first-class review objects.

Anchor mismatch — the output conflicts with a declared analytical or contractual anchor.

Interpretation mismatch — the report language claims more than the implementation state justifies.

Maturity mismatch — the implementation, registry, and outward-facing claim have drifted out of sync.

Ordinary review mostly checks whether a system still passes. SPAR checks whether the result is still being described honestly.

The Three-Layer Structure

Layer A — Anchor Consistency

Layer A checks whether output agrees with a declared analytical or contractual anchor. The expected value is not "whatever the engine produced last time." It is "what the declared contract says must appear for this background, under this formulation."

Layer A tests agreement with a declared reference contract — not truth in some unconstrained universal sense. Analytical anchors depend on regime, normalization, and formulation. That distinction matters. Still, the engineering value is clear: reliability can remain intact while anchor-consistency fails. A Layer A anomaly means the output contradicts the contract the system claims to be using.

Layer B — Interpretation Validity

Layer B checks whether the interpretation attached to the output stays within declared scope. This layer is deterministic — it does not rely on a free-form LLM judge. It uses explicit rule tables over structured runtime artifacts, maturity states, and report text.

Typical checks: does the report claim full closure while the path is still heuristic or partial; is a bounded approximation being described as exact; is an environment-conditional bridge being written up as universal; are overclaim phrases appearing where runtime state does not support them.

Layer B does not eliminate semantic ambiguity. What it does is narrow the problem from "solve rhetoric in general" to "enforce explicit admissibility contracts against declared model states." That makes it auditable. Not complete. Auditable.

Layer C — Existence and Maturity Probes

Layer C asks what kind of implementation produced the result: genuine, approximate, gapped, environment-conditional, or research-only.

This is where SPAR becomes especially different from ordinary review. It does not merely score outputs. It checks the ontological status of the path that produced them. A result from a known-limited path is not the same thing as a result from a genuine path. A research probe is not production-grade closure. A dependency-bound bridge is not a universal capability. Those distinctions change what the output is allowed to claim.

Why the Registry Matters

A framework like this needs more than score outputs. It needs structured state that can travel with the result.

A machine-readable registry turns caveat prose into runtime surface. That lets review results carry explicit maturity labels — open, partial, closed, environment_conditional, research_only — rather than vague prose. Without that surface, approximation and closure collapse into the same sentence.

Scoring Policy Is Explicit

SPAR keeps score policy visible. No hidden learned weights.

These are not laws of nature. They are review policy. A hidden learned scorer may feel sophisticated. An explicit policy is easier to inspect, debate, and change.

Two or more Layer A anomalies trigger unconditional REJECT regardless of total score. Mathematical contract failures are not averaged away by cleaner signals elsewhere.

A Concrete Example: The Omega Score Transition

Flamehaven-TOE's primary governance metric, SIDRCE Omega, once relied on a large arbitrary multiplicative constant applied to the raw residual. The outputs looked stable. Nothing felt obviously broken.

SPAR still flagged it. Stability was not the right question. The stronger question was whether the formula justified the interpretation being attached to it. A free scaling constant with no physical derivation is not the same thing as a physically motivated model.

The formula was replaced with a chi-squared Gaussian construction:

That change matters because it introduces a reversible relation to the underlying residual:

Given a reported score, the residual is recoverable. The score is no longer just a presentation layer. It encodes a falsifiable relation to the quantity beneath it.

SPAR did not respond by declaring the problem solved. It updated the classification precisely: the formula ceased to be arbitrary, but the remaining gap shifted to the tolerance scales, which are still calibration parameters. That is a narrower and more honest claim than either "arbitrary" or "fully resolved."

That is exactly the kind of distinction SPAR is built to review.

Quick Start

The review result carries more than pass/fail: a verdict, a score, and a maturity-aware review surface. That surface is what makes the output governable rather than just evaluable.

Where This Framework Fits

Physics remains the first adapter and strongest early testbed. The review pattern is broader.

It fits anywhere outputs can pass while the attached claim can drift: scientific computing pipelines, PDE and simulation workflows, scientific ML surrogates, inverse and calibration models, AI code review, model governance, regulated analytics and reporting.

That does not mean every team needs the full framework. Often the first useful step is smaller.

A Lightweight Adoption Path

Level 1 — Claim Check. Add three explicit questions to an existing workflow: What is the output actually claiming? Does that claim match the implementation state? Is this result exact, approximate, partial, or heuristic? Most teams can do this immediately with no new tooling.

Level 2 — Maturity Labels. Attach state labels to results: heuristic, partial, closed, environment_conditional. A small registry. Already a meaningful step beyond ordinary review.

Level 3 — Full SPAR. Layer A anchor consistency, Layer B interpretation validity, Layer C existence and maturity probes, registry-backed snapshots, explicit score and verdict policy.

SPAR can be used as a review habit before it is adopted as a full framework.

What SPAR Does Not Do

SPAR does not provide a universal truth engine, free-form LLM judging in the core, domain contracts inside the generic kernel, or certainty about whether a scientific claim is true in all possible senses.

SPAR is not a machine for declaring truth. Its narrower goal is to make claim drift reviewable.

Reliability Is Not Enough

Reliability asks whether a system produces stable, repeatable outputs. Admissibility asks whether those outputs deserve the meanings attached to them.

A stub that always returns zero can be reliable. A heuristic threshold can be reliable. A smoothly calibrated score can be reliable. None of those facts alone makes the resulting claim justified.

Current AI and scientific tooling are already better at measuring reliability than admissibility. That asymmetry is understandable — reliability is easier to benchmark, easier to automate, easier to ship in CI. But admissibility is where silent approximations, overstated claims, and maturity mismatches accumulate.

SPAR is one working answer to that problem. Not a universal answer. A technical one.

It turns implementation state, maturity state, analytical anchoring, and scope honesty into review objects that can travel with the result.

That is why the architecture may matter outside the domain that produced it.

Repository: github.com/flamehaven01/SPAR-Framework

Docs: What Is SPAR · Admissibility · Physics as the Proof Case · Use Cases