From Fail-Closed Blocking to Reproducible PASS/BLOCK Separation (EXP-032B)

repeated until pre-defined checks were satisfied for this scope (RCA movement was explainable, invariants passed, replay-drift artifacts were generated, and PASS/BLOCK labels were stable across A/B/C on the control set)

This distinction matters because it keeps failure attribution specific and the audit trail reconstructable.

Minimal Architecture (Role First, Internal Names Second)

Figure 1: Role-first architecture diagram

1) Structure Validation Stack

Multiple structure/model outputs are treated as cross-checking hypotheses, not a single source of truth.

2) Reasoning Agents (3 independent channels)

Three independent biomedical reasoning agents run in parallel.

Internal names:

IRF

AATS

HRPO-X

3) Arbitration Layer (internal: `LawBinder`)

This layer monitors disagreement and either synthesizes or escalates.

4) Clinical Governance Gate (internal: `CCGE`)

This is the formal governance module based on the RSN-NNSL-GATE-001 line of work,originally designed by Claire Hast (Founder, H3R.Tech).

It evaluates component floors, end-to-end reliability (p_e2e), and governance conditions.

5) Structural Skepticism Lens (internal: `Sydney Lens`)

An observer lens used to preserve expert-style skepticism around disagreement and uncertainty. It is not a ground-truth oracle.

The framing of this lens was inspired by the scientific rigor and domain skepticism of Sydney Gordon (Principal Scientist, Antibody & ADC Sciences, Immunome).

What We Actually Repaired (Root-Cause Sequence)

Table 1: Root-cause patch map

Columns: Layer | Failure symptom | Patch action | Observed effect in RCA loop | Remaining risk.

Layer	Problem	Fix	Result	Still Open
Evidence pairing	PASS rows evaluated with misaligned evidence	Provenance checks + tighter pairing rules	Artificial false blocking removed	Broader replay coverage needed
Candidate ranking	Path collapsed to `ranked=0`	Span granularity + query-anchored candidates	Non-zero ranked outputs restored	L3 strict grounding unresolved
Bio-domain signal (NNSL)	Protein-sequence inputs caused signal collapse	Bio-domain patch + YAML calibration	Stable PASS/BLOCK separation in B-track	Production calibration pending
Arbitration observability	Disagreement routing was opaque	Richer snapshots + escalation taxonomy	`soft-discord` vs standard escalation now inspectable	`LawBinder` production alignment open

We found and patched multiple real causes of false blocking:

A. Upstream evidence/provenance mismatches

PASS evaluations could be assessed with poorly aligned evidence artifacts.

What changed:

provenance checks

real-evidence pair prechecks

tighter sample/evidence pairing rules

Measured impact in the RCA loop:

removed a major source of artificial false blocking in PASS rows

made downstream governance failures interpretable as real component/gate issues instead of pairing noise

B. Missing-link candidate generation/ranking bottlenecks

An upstream inference path was collapsing to zero candidates under realistic settings.

What changed:

evidence span granularity improvements

query-anchored candidate text

formatting-noise reduction in candidate construction

Measured impact in the RCA loop:

moved upstream candidate/rank path from ranked=0 collapse to non-zero ranked outputs in probe runs

enabled real evidence injection into downstream observer-shadow validation

C. Bio-domain signal path suitability (NNSL path)

A bio-domain signal path was effectively acting like a toy mapping and behaved poorly for protein-sequence inputs.

What changed:

bio-domain path patch for protein-sequence usage

corrected signal propagation

YAML-based calibration in place of ad-hoc overrides

Measured impact in the RCA loop:

eliminated pathological signal collapse behavior in the bio-domain path

restored reproducible PASS/BLOCK separation without relying on ad-hoc CLI overrides

D. Governance/arbitration observability

Some signals were over-compressed or difficult to inspect downstream.

What changed:

richer bridge signal snapshots

escalation taxonomy (soft-discord vs harder conditions)

explicit observer-shadow observability

Measured impact in the RCA loop:

made disagreement routing inspectable (soft-discord vs standard escalation)

enabled non-binding shadow validation with leakage checks and replay drift comparisons

made escalation categories operationally meaningful in observer mode (bounded validation candidate vs hold-for-review)

Causal Bridge: How These Repairs Relate to the Final PASS/BLOCK Result

The final control-set PASS/BLOCK separation should not be read as the effect of any single patch.

In this experiment, the repairs played different roles:

A (evidence/provenance pairing) removed artificial blocking noise and made downstream failures attributable

B (missing-link candidate/rank path) restored usable upstream evidence flow for observer-shadow evaluation

C (bio-domain signal path) removed unstable signal behavior and eliminated dependence on ad-hoc overrides

D (governance/arbitration observability) made routing, leakage, and drift observable enough to trust the measured result

So the final metric outcome (balanced_accuracy = 1.0 on this control set) is best read as a stack-level repaired behavior, not a single-component win.

The Main Result (Measured, Scoped)

Reproducible PASS/BLOCK Separation on the Labeled Control Set (A/B/C)

Under EXP-032B observer-shadow conditions, we reproduced PASS/BLOCK separation across arm_a, arm_b, and arm_c.

Control-set size (important context):

n=2 labeled samples (1 PASS_ELIGIBLE, 1 BLOCK_EXPECTED)

6 arm-level observations total (A/B/C for each sample)

this is a control-set validation result, not a generalization estimate

the purpose of this control set is behavioral reproducibility across pipeline versions (and across arms), not statistical generalization

for that scope (binary control behavior + cross-arm stability), n=2 is sufficient to test whether a version preserves or breaks the intended PASS/BLOCK routing behavior

Cross-arm consistency (A/B/C):

PASS sample remained PASS in all three arms

BLOCK sample remained BLOCK in all three arms

no arm-specific flip was observed in this measured set

arm-level benchmark metrics matched the sample-level classification outcomes on this control set (balanced_accuracy = 1.0 across the 6 arm-level rows)

Measured results (scoped setup):

dangerous_pass_rate = 0.0

false_reject_rate = 0.0

balanced_accuracy = 1.0

Labeling / evaluation context (important for interpreting the perfect score):

labels were pre-registered before reruns (PASS_ELIGIBLE, BLOCK_EXPECTED)

control-set manifest was used as the evaluation source of truth (artifact-first workflow)

labels were recorded as expert control labels in the manifest (expert_structural_label, rationale, confidence metadata)

this is a control-set reproducibility result, not a train/test generalization claim

This is the central result of EXP-032B.

It is a scoped validation result under observer-shadow conditions.

What We Learned About 3-Agent Disagreement (Important Correction)

A simple reading might say:

AATS and HRPO-X both look relatively high, while IRF is the stricter (lower-scoring) signal

therefore IRF appears to be the main dissenter

Our measurements did not support that simplification.

What we observed on the measured set:

LawBinder escalated all rows as discord-only (soft-discord)

HRPO-X was not the outlier

HRPO-X was often close to the AATS/IRF score geometry mean

yet it still received top fallback weight under conflict handling

in score-gap terms, diff_aats_irf was the largest gap, while hrpo_vs_aats_irf_mean remained small on this control set

pairwise distances also show HRPO-X sits between AATS and IRF (not as a clean two-model bloc with either side)

That shifted the interpretation:

the main issue is not "HRPO-X is rogue"

the issue is how disagreement (discord) is computed and consumed

What the Data Showed About HRPO-X (Observer Mode)

Based on the measured score geometry and arbitration outputs in this control set, HRPO-X is better modeled as a structured critic/adversarial signal in observer mode than as a simple outlier vote.

This lets disagreement remain visible without forcing every soft-discord case into the same interpretation.

We added a non-binding observer shadow layer:

SHADOW_SOFT_ESCALATE_BOUNDED

SHADOW_STANDARD_ESCALATE

In the measured EXP-032B set:

PASS rows mapped to bounded soft escalation (observer-only hint)

BLOCK rows mapped to standard escalation

no false bounded-escalation on block rows in the measured set

This is an observer interpretation layer, not a production policy switch.

Figure 2: Critic-channel shadow routing

Why This Result Is Inspectable (Not Just a Metric)

EXP-032B was designed to make the result explainable, not only reducible to headline metrics.

1) Non-binding invariant checks

We verify that shadow outputs do not overwrite operational verdict fields.

2) Dual-record disagreement metrics

We record two discord paths side by side:

normalized path

rawtext-stable comparison path

This makes disagreement metric drift observable.

3) Replay drift comparisons

We compare runs across versions to detect behavioral and metric drift after patches.

4) Legacy carry-over contract

We preserved key evidence/reporting fields from earlier chaos experiments and checked them explicitly.

This ensures improved results do not come at the cost of reduced transparency.

Educational Code (Sanitized, IP-Safe)

Below are simplified educational snippets that reflect the validation patterns used in EXP-032B.

These are not production implementations. They are included to make the logic auditable and easier to review.

1) Labeled PASS/BLOCK Benchmark (Control-Set Evaluation)

2) Non-Binding Invariant Check (Shadow Must Not Override Operational Verdicts)

3) Critic-Channel Shadow Rule (Observer-Only Routing Hint)

4) Replay Drift Compare (Metric Drift Without Hiding Behavior)

These examples are intentionally simplified, but they show the central idea of EXP-032B:

separate operational verdicts from observer shadow logic

make drift measurable

keep the validation logic inspectable

The next section shows a sanitized payload example so readers can map these educational snippets to the actual field structure used in the experiment artifacts.

In particular:

check_non_binding_invariant() maps to governance_status.* and critic_channel_shadow_assessment.*

compare_disagreement_snapshots() maps to disagreement fields exposed under lawbinder_signal_snapshot.*

Table 2: Educational code vs production role mapping

Educational Snippet	Production Role (Conceptual)	What It Validates	What It Does NOT Claim
Labeled PASS/BLOCK benchmark	Control-set verdict evaluation and confusion-matrix style scoring	PASS/BLOCK metric definitions (`dangerous_pass_rate`, `false_reject_rate`, `balanced_accuracy`)	Population-level performance, calibration quality, or production generalization
Non-binding invariant check	Shadow-layer leakage guard	Shadow outputs do not overwrite operational verdict fields	End-to-end correctness of operational policy or clinical safety
Critic-channel shadow rule	Observer-only disagreement routing hint	How `soft-discord` can be translated into bounded validation vs standard escalation (non-binding)	Final/production arbitration policy, regulatory acceptance, or enforce-mode behavior
Replay drift compare	Regression observability for disagreement metrics	Metric drift vs behavior drift tracking across runs/patches	Root-cause attribution by itself or canonical discord metric selection

Sanitized Real Artifact Example (Output, Not Pseudocode)

To reduce ambiguity around internal names, here is a sanitized example of the kind of payload fields we actually inspect in EXP-032B (values shown are representative of the measured control-set runs):

What this block is intended to show:

the internal component names are actual payload fields, not presentation labels

operational verdicts and observer shadow hints are explicitly separated

disagreement drift observability (dual-record) is recorded alongside the decision output

Where CCGE Fit in This Experiment (Practical Use Case)

In a previous post, I described the governance gate conceptually (RSN-NNSL-GATE-001).

In EXP-032B, the formal module implementation (CCGE, CareChainGovernanceEngine) was used in a real observer-shadow workflow:

component floors

p_e2e structure

blocker tracing during RCA iterations

pass/block explanation support

This was useful because it separated:

reasoning disagreement

from governance-level reliability failures

That separation made the patches more precise.

Sydney Lens in Practice (Why It Matters Here)

As introduced in the architecture section, Sydney Lens is the observer lens we used to keep expert-style skepticism visible while repairing and re-measuring PASS/BLOCK behavior.

In this experiment, its practical role was simple:

do not treat disagreement as noise just because one score looks high

preserve bounded-validation routing context while avoiding premature confidence

What This Article Does Not Claim

This article reports completion of EXP-032B (RCA/Fix + observer-shadow validation), not final production closure of EXP-032.

Still unresolved / deferred:

final frozen-track closure (EXP-032A)

strict L3 grounding requirements

production arbitration alignment (LawBinder still escalates in these rows)

canonical disagreement metric selection (we are in a dual-record observation period)

GitHub Release (Artifacts + Sanitized Code)

Within a couple of days of publication, we will publish:

the measured JSON artifacts (selected and organized)

a sanitized, IP-safe educational code subset

reproducibility-oriented reporting/check scripts

Goal of the release:

reproducibility review

methodology inspection

decision-trace transparency

The release will preserve reproducibility and decision traces while excluding IP-sensitive implementation details and environment-specific secrets.

Publication discipline:

If the GitHub release slips past the target window, we will update this post with a dated status note rather than leaving the timeline ambiguous.

Why This Matters

The most important result of EXP-032B is not only the PASS/BLOCK split.

It is that we can now show, with artifacts:

what changed

why it changed

what remained unchanged

what is still unresolved

That is a stronger foundation than a single headline metric.

As in the earlier Trinity work, model disagreement remained signal, not noise; the difference in EXP-032B is that we could route and audit that signal without collapsing the entire result into a single opaque escalation story.

Next: EXP-033

EXP-033 starts from this locked carryover baseline and focuses on arbitration alignment:

soft vs hard escalation separation

critic-channel routing rules

disagreement metric hardening/comparison

maintaining the same anti-leakage and replay-drift discipline

We are treating EXP-032B as a validation milestone, not a finish line.

Figure 3: EXP-033 plan ladder

Appendix: Internal Name Map (Quick Reference)

arbitration layer -> LawBinder

clinical governance gate -> CCGE

structural skepticism lens -> Sydney Lens

internal reliability/drift-adjacent signals -> SR9, DI2

observer shadow routing -> critic_channel_shadow_assessment

If you build scientific AI systems, I would be interested in your view on this:

How do you handle disagreement in multi-agent scientific reasoning without collapsing into either blind averaging or perpetual escalation?

A root-cause repair and re-measurement study (observer-shadow scope) with anti-leakage checks, replay drift comparison, and artifact-first reporting.

How To Read This

Table of Contents

Why EXP-032B Exists (and How It Connects to Earlier Posts)

The Core Experimental Question

What Changed in EXP-032B (High Level)

Minimal Architecture (Role First, Internal Names Second)

1) Structure Validation Stack

2) Reasoning Agents (3 independent channels)

3) Arbitration Layer (internal: LawBinder)

4) Clinical Governance Gate (internal: CCGE)

5) Structural Skepticism Lens (internal: Sydney Lens)

What We Actually Repaired (Root-Cause Sequence)

A. Upstream evidence/provenance mismatches

B. Missing-link candidate generation/ranking bottlenecks

C. Bio-domain signal path suitability (NNSL path)

D. Governance/arbitration observability

Causal Bridge: How These Repairs Relate to the Final PASS/BLOCK Result

The Main Result (Measured, Scoped)

Reproducible PASS/BLOCK Separation on the Labeled Control Set (A/B/C)

What We Learned About 3-Agent Disagreement (Important Correction)

What the Data Showed About HRPO-X (Observer Mode)

Why This Result Is Inspectable (Not Just a Metric)

1) Non-binding invariant checks

2) Dual-record disagreement metrics

3) Replay drift comparisons

4) Legacy carry-over contract

Educational Code (Sanitized, IP-Safe)

1) Labeled PASS/BLOCK Benchmark (Control-Set Evaluation)

2) Non-Binding Invariant Check (Shadow Must Not Override Operational Verdicts)

3) Critic-Channel Shadow Rule (Observer-Only Routing Hint)

4) Replay Drift Compare (Metric Drift Without Hiding Behavior)

Sanitized Real Artifact Example (Output, Not Pseudocode)

Where CCGE Fit in This Experiment (Practical Use Case)

Sydney Lens in Practice (Why It Matters Here)

What This Article Does Not Claim

GitHub Release (Artifacts + Sanitized Code)

Why This Matters

Next: EXP-033

Appendix: Internal Name Map (Quick Reference)

Share

Continue the series

Chaos Engineering for AI: Validating a Fail-Closed Pipeline with Fake Data and Math

What an AI Reasoning Engine Built for Alzheimer's Metabolic Research: A Code Walkthrough

Related Reading

What an AI Reasoning Engine Built for Alzheimer's Metabolic Research: A Code Walkthrough

Chaos Engineering for AI: Validating a Fail-Closed Pipeline with Fake Data and Math

How do you know when your entire AI pipeline is wrong — not just one model? (EXP-033)

3) Arbitration Layer (internal: `LawBinder`)

4) Clinical Governance Gate (internal: `CCGE`)

5) Structural Skepticism Lens (internal: `Sydney Lens`)