When Adding Chai-1 and Boltz-2 Exposed Hidden Model Disagreement(Trinity Protocol Part)

Cover image for Trinity Protocol Part 2: When Adding Chai-1 and Boltz-2 Exposed Hidden Model Disagreement

In Part 1, I introduced our AF3/AF2 orchestration—a system designed to treat AI protein predictions not as gospel, but as hypotheses that need validation.

Two days ago, Sydney Gordon (Principal Scientist at Immunome) asked the question that sparked this entire sequel:

"Why not use Chai-1 and Boltz-2 as cross-validation?"

It was brilliant. We did exactly that.

The result? Instead of getting clearer consensus, we got structured chaos.

And that chaos turned out to be the most valuable data we've gathered all month.

1) The Setup: Three Ways to Be Wrong

We wanted to see what happens when we throw Out-of-Distribution (OOD) targets at a multi-model ensemble.

The question: Who would agree with whom when nobody really knows the answer?

We set up three "arms":

Arm	Primary	Validators	Status
A	AF3	AF2	The baseline
B	AF3	Boltz-2 + Chai-1	The kitchen sink
C	AF2	Boltz-2 + Chai-1	The control (what if AF3 is the problem?)

Target: 52-residue protein-ligand complex (aspirin binding site)

The hook: We deliberately chose a low-homology target where we suspected models might hallucinate confidence.

All three arms shared:

Same sequence family

Same random seed (20260208)

Same governance (LawBinder in observer mode)

Same drift gate: ≤ 0.05

Observer mode means: "Watch what happens, but don't block anything. Collect data."

2) The Result: Disagreement is a Feature, Not a Bug

Usually in software, "failed convergence" sounds like a bug.

In biology, it's a signal.

Final Metrics:

The pattern: Adding more validators increased disagreement.

This was counterintuitive. We expected:

"More models → more consensus → lower drift"

What we got:

"More models → hidden disagreements exposed → higher drift"

The Moment of Realization

When I first saw Arm B's drift jump from 0.10 to 0.17, I thought:

"Did I break something?"

Then I looked at the validator scores:

Boltz was 2x more confident than everyone else on the same structure.

That's not noise. That's biology telling you something.

In a traditional pipeline, this would've been averaged:

But that 0.74 vs 0.36 gap? That's a 2-month wet-lab delay if you choose wrong.

3) The Surprise: AlphaGenome Couldn't Fix It

Here's where it got really interesting.

AlphaGenome (our redesign engine) was watching this trainwreck in real-time. It said:

"I see low-confidence regions. Let me propose stabilizing mutations."

What AG Did:

Cycle 1 → 2:

Detected: residues 2, 4, 17 (pLDDT < 60)

Proposed: S2A, S4A, P17A (hydrophobic core stabilization)

Applied: sequence updated for next cycle

Cycle 2 → 3:

Detected: residues 18, 20, 21 (still problematic)

Proposed: P18A, P20A, V21L (further optimization)

Applied: sequence updated

The twist?

Drift kept increasing despite AG's "fixes".

This told us something critical:

The target isn't just tricky—it's genuinely outside the training distribution of current AI models.

Sequence tweaks alone can't bridge fundamental model disagreement.

This is biology speaking, not code failing.

And catching this before wet-lab synthesis? That's months of saved time and reagent costs.

4) Why 0.07 is Actually Huge (Math + Biology)

In Arm A, AF3 (0.40) and AF2 (0.33) differed by 0.07.

You might think: "0.07? That's basically noise, right?"

Wrong.

The Context Math:

In high-confidence regime (>0.7):

0.07 delta = ~10% relative gap

Models agree on the fold, just differ on details

In low-confidence regime (0.3–0.4):

0.07 delta = 21% relative gap (0.07 / 0.33)

Models disagree on fundamental topology

What This Means in Biology:

When pTM is in the 0.3–0.4 range, you're typically seeing:

Loop placement uncertainty

Hinge region flexibility

Possible intrinsic disorder

Or: the model has no idea

A 21% disagreement here often maps to:

AF3 thinks: "This loop folds left"

AF2 thinks: "This loop folds right"

You shouldn't average those. You should stop.

Because if you average them and proceed to synthesis:

Best case: you waste time on a structure that needs refinement

Worst case: you validate a hallucination and build a pipeline around it

5) For Engineers: The Logic Under the Hood

For the AI researchers and devs reading this, here's the actual production logic we used.

5.1 The "Effective Drift" Calculation

We don't just subtract scores. We add a penalty for validator disagreement.

If Boltz and Chai are fighting with each other, the system becomes extra cautious.

Example (Arm B):

Why this matters: Traditional pipelines ignore validator variance. We penalize it.

5.2 Detecting Directional Divergence

Random noise goes up and down.

Structured divergence moves in one direction.

Arm A example:

This is how we distinguished real divergence from measurement jitter.

If drift oscillates (0.07 → 0.04 → 0.09), it's probably implementation noise.

If drift increases monotonically (0.07 → 0.09 → 0.10), the models fundamentally disagree.

5.3 The Boltz CIF Fallback (Production Reality)

In production, things break. Boltz didn't always give us a clean confidence_standard.json.

We could've:

Dropped Boltz as a validator → lose diversity

Crashed the pipeline → lose the experiment

Instead, we built a traceable fallback:

Result: Chain A pLDDT = 76.01 → PTM proxy = 0.76

Crucially, we stored full provenance:

Why this matters: Anyone can reproduce our decision tree.

6) The Laptop That Ran This Experiment

One of the most exciting parts? The infrastructure.

We didn't use a GPU cluster. We didn't rent a cloud farm.

Local orchestration: ASUS TUF FX505GD

CPU: Intel i7-8750H

GPU: NVIDIA GTX 1050 (4GB) ← not even used for orchestration

RAM: 16GB DDR4

Model execution: Free/low-cost cloud services

AF3: AlphaFold Server (free web API)

AF2: Google Colab (free tier, ColabFold)

Boltz: Google Colab (free tier, official notebook)

Chai: Chai Discovery API (free)

Total cost: ~$12 in cloud credits

The Insight:

Your laptop doesn't run the models.

Your laptop coordinates them.

The expensive inference happens elsewhere. Your machine handles:

Pipeline logic

Governance (LawBinder)

Redesign proposals (AlphaGenome)

Artifact management

This removes the infrastructure barrier for any biotech team.

7) What's Next: Adaptive Gates (EXP-032)

EXP-031 taught us: one threshold doesn't fit all confidence regimes.

The Problem:

Using a single gate (0.05) treats them the same. That's wrong.

The Solution: Adaptive Thresholds

EXP-032 will test:

25 in-distribution samples (expect convergence)

25 OOD stress samples (expect divergence)

Adaptive gates based on primary model confidence

8) The Rexsyn Value Proposition: Managing Biological Risk

Most pipelines ask:

"What's the highest confidence score we can justify?"

We ask:

"Do models agree? If not, what is that disagreement telling us?"

When Boltz says 0.74 and Chai says 0.36 on the same structure:

Standard approach: average → 0.50 → "acceptable" → proceed to synthesis

Rexsyn approach: 2x gap → "investigate" → flag for human review

Why This Matters for Biotech Teams:

Rexsyn doesn't just run models. It manages biological risk.

By detecting disagreement early, we save teams from:

Months of failed wet-lab validation (synthesize → test → fail → redesign)

Wasted reagent costs ($500–$5,000 per protein construct)

Opportunity cost (chasing wrong leads instead of promising candidates)

The Economics:

The difference: We fail fast in silico, not slow in the lab.

The Trust Factor:

In drug discovery, confidence theater is expensive.

Reporting "0.85 confidence" doesn't help if:

Model A says 0.95

Model B says 0.75

They disagree on which loop folds where

We built Rexsyn to say "I don't know" when models disagree.

Because in biology, admitting uncertainty is more valuable than hallucinating confidence.

9) Reproducibility Checklist

Required Artifacts:

Key Settings:

Compute Provenance:

10) Closing Thoughts

EXP-031 didn't prove our system works perfectly.

It proved something more important:

The pipeline can detect when models disagree in biologically meaningful ways.

In the OOD regime, disagreement IS the data.

And catching that disagreement before wet-lab validation?

That's the difference between confidence theater and verification engineering.

🙏Big thanks to Sydney Gordon for the question that sparked this work.

If you're building multi-model validation systems, I'd love to hear:

When do you average model scores vs. investigate their disagreement?

Drop a comment below.

This work is part of RExSyn (AI-driven drug discovery platform). Full code will be open-sourced upon publication.

Reaudit Correction (2026-02-16)

Before starting EXP-032, we reran EXP-031 end-to-end and rechecked the full input-output chain.

What did not change:

Drift ordering was reproduced: A < B < C (0.10, 0.17, 0.25)

Final governance decision remained KEEP_OBSERVER

What was corrected:

In the original Stage-2 base_pass_ood mode, n_ood_positive was 0.

So block_recall = 0.0 there should not be read as validated OOD blocking performance.

It is a denominator/computability issue under that mode.

Scope clarification:

EXP-031 should be read as an observer-phase stress test for disagreement detection.

It should not be read as a production gate-promotion claim.

This correction updates interpretation discipline, not the core drift outcomes.