Flamehaven LogoFlamehaven.space
back to writing
When Adding Chai-1 and Boltz-2 Exposed Hidden Model Disagreement(Trinity Protocol Part)

When Adding Chai-1 and Boltz-2 Exposed Hidden Model Disagreement(Trinity Protocol Part)

See how adding Chai-1 and Boltz-2 to an AlphaFold workflow exposed hidden model disagreement, increased drift, and revealed why failed convergence can be the most valuable signal in computational biology.

Series

RExSyn Nexus-BioPart 5 of 10
View all in series
Cover image for Trinity Protocol Part 2: When Adding Chai-1 and Boltz-2 Exposed Hidden Model Disagreement
Trinity Protocol #2
In Part 1, I introduced our AF3/AF2 orchestration—a system designed to treat AI protein predictions not as gospel, but as hypotheses that need validation.
Two days ago, Sydney Gordon (Principal Scientist at Immunome) asked the question that sparked this entire sequel:
"Why not use Chai-1 and Boltz-2 as cross-validation?"
It was brilliant. We did exactly that.
The result? Instead of getting clearer consensus, we got structured chaos.
And that chaos turned out to be the most valuable data we've gathered all month.

A Hypothesis of Consensus

1) The Setup: Three Ways to Be Wrong

The Setup
We wanted to see what happens when we throw Out-of-Distribution (OOD) targets at a multi-model ensemble.
The question: Who would agree with whom when nobody really knows the answer?
We set up three "arms":
Arm
Primary
Validators
Status
A
AF3
AF2
The baseline
B
AF3
Boltz-2 + Chai-1
The kitchen sink
C
AF2
Boltz-2 + Chai-1
The control (what if AF3 is the problem?)
Target: 52-residue protein-ligand complex (aspirin binding site)
The hook: We deliberately chose a low-homology target where we suspected models might hallucinate confidence.
All three arms shared:
  • Same sequence family
  • Same random seed (20260208)
  • Same governance (LawBinder in observer mode)
  • Same drift gate: ≤ 0.05
Observer mode means: "Watch what happens, but don't block anything. Collect data."

2) The Result: Disagreement is a Feature, Not a Bug

The Result
Usually in software, "failed convergence" sounds like a bug.
In biology, it's a signal.

Final Metrics:

The pattern: Adding more validators increased disagreement.
This was counterintuitive. We expected:
"More models → more consensus → lower drift"
What we got:
"More models → hidden disagreements exposed → higher drift"

The Moment of Realization

Inside Arm B
When I first saw Arm B's drift jump from 0.10 to 0.17, I thought:
"Did I break something?"
Then I looked at the validator scores:
Boltz was 2x more confident than everyone else on the same structure.
That's not noise. That's biology telling you something.
In a traditional pipeline, this would've been averaged:
But that 0.74 vs 0.36 gap? That's a 2-month wet-lab delay if you choose wrong.

3) The Surprise: AlphaGenome Couldn't Fix It

Alphagenome could not fix it
Here's where it got really interesting.
AlphaGenome (our redesign engine) was watching this trainwreck in real-time. It said:
"I see low-confidence regions. Let me propose stabilizing mutations."

What AG Did:

Cycle 1 → 2:
  • Detected: residues 2, 4, 17 (pLDDT < 60)
  • Proposed: S2AS4AP17A (hydrophobic core stabilization)
  • Applied: sequence updated for next cycle
Cycle 2 → 3:
  • Detected: residues 18, 20, 21 (still problematic)
  • Proposed: P18AP20AV21L (further optimization)
  • Applied: sequence updated
The twist?
Drift kept increasing despite AG's "fixes".
This told us something critical:
The target isn't just tricky—it's genuinely outside the training distribution of current AI models.
Sequence tweaks alone can't bridge fundamental model disagreement.
This is biology speaking, not code failing.
And catching this before wet-lab synthesis? That's months of saved time and reagent costs.

4) Why 0.07 is Actually Huge (Math + Biology)

IThe Mathematics of Uncertainity
In Arm A, AF3 (0.40) and AF2 (0.33) differed by 0.07.
You might think: "0.07? That's basically noise, right?"
Wrong.

The Context Math:

In high-confidence regime (>0.7):
  • 0.07 delta = ~10% relative gap
  • Models agree on the fold, just differ on details
In low-confidence regime (0.3–0.4):
  • 0.07 delta = 21% relative gap (0.07 / 0.33)
  • Models disagree on fundamental topology

What This Means in Biology:

When pTM is in the 0.3–0.4 range, you're typically seeing:
  • Loop placement uncertainty
  • Hinge region flexibility
  • Possible intrinsic disorder
  • Or: the model has no idea
A 21% disagreement here often maps to:
  • AF3 thinks: "This loop folds left"
  • AF2 thinks: "This loop folds right"
You shouldn't average those. You should stop.
Because if you average them and proceed to synthesis:
  • Best case: you waste time on a structure that needs refinement
  • Worst case: you validate a hallucination and build a pipeline around it

5) For Engineers: The Logic Under the Hood

Engineering the Signal
For the AI researchers and devs reading this, here's the actual production logic we used.

5.1 The "Effective Drift" Calculation

We don't just subtract scores. We add a penalty for validator disagreement.
If Boltz and Chai are fighting with each other, the system becomes extra cautious.
Example (Arm B):
Why this matters: Traditional pipelines ignore validator variance. We penalize it.

5.2 Detecting Directional Divergence

Detecting Directional Divergence
Random noise goes up and down.
Structured divergence moves in one direction.
Arm A example:
This is how we distinguished real divergence from measurement jitter.
If drift oscillates (0.07 → 0.04 → 0.09), it's probably implementation noise.
If drift increases monotonically (0.07 → 0.09 → 0.10), the models fundamentally disagree.

5.3 The Boltz CIF Fallback (Production Reality)

Production Reality
In production, things break. Boltz didn't always give us a clean confidence_standard.json.
We could've:
  1. Dropped Boltz as a validator → lose diversity
  1. Crashed the pipeline → lose the experiment
Instead, we built a traceable fallback:
Result: Chain A pLDDT = 76.01 → PTM proxy = 0.76
Crucially, we stored full provenance:

Why this matters: Anyone can reproduce our decision tree.
infrastructure

6) The Laptop That Ran This Experiment

One of the most exciting parts? The infrastructure.
We didn't use a GPU cluster. We didn't rent a cloud farm.
Local orchestration: ASUS TUF FX505GD
  • CPU: Intel i7-8750H
  • GPU: NVIDIA GTX 1050 (4GB) ← not even used for orchestration
  • RAM: 16GB DDR4
Model execution: Free/low-cost cloud services
  • AF3: AlphaFold Server (free web API)
  • AF2: Google Colab (free tier, ColabFold)
  • Boltz: Google Colab (free tier, official notebook)
  • Chai: Chai Discovery API (free)
Total cost: ~$12 in cloud credits

The Insight:

Your laptop doesn't run the models.
Your laptop coordinates them.
The expensive inference happens elsewhere. Your machine handles:
  • Pipeline logic
  • Governance (LawBinder)
  • Redesign proposals (AlphaGenome)
  • Artifact management
This removes the infrastructure barrier for any biotech team.

7) What's Next: Adaptive Gates (EXP-032)

Future State
EXP-031 taught us: one threshold doesn't fit all confidence regimes.

The Problem:

Using a single gate (0.05) treats them the same. That's wrong.

The Solution: Adaptive Thresholds

EXP-032 will test:
  • 25 in-distribution samples (expect convergence)
  • 25 OOD stress samples (expect divergence)
  • Adaptive gates based on primary model confidence

8) The Rexsyn Value Proposition: Managing Biological Risk

Managing Biological Risk
Most pipelines ask:
"What's the highest confidence score we can justify?"
We ask:
"Do models agree? If not, what is that disagreement telling us?"
When Boltz says 0.74 and Chai says 0.36 on the same structure:
Standard approach: average → 0.50 → "acceptable" → proceed to synthesis
Rexsyn approach: 2x gap → "investigate" → flag for human review

Why This Matters for Biotech Teams:

Rexsyn doesn't just run models. It manages biological risk.
By detecting disagreement early, we save teams from:
  • Months of failed wet-lab validation (synthesize → test → fail → redesign)
  • Wasted reagent costs ($500–$5,000 per protein construct)
  • Opportunity cost (chasing wrong leads instead of promising candidates)

The Economics:

The difference: We fail fast in silico, not slow in the lab.

The Trust Factor:

In drug discovery, confidence theater is expensive.
Reporting "0.85 confidence" doesn't help if:
  • Model A says 0.95
  • Model B says 0.75
  • They disagree on which loop folds where
We built Rexsyn to say "I don't know" when models disagree.
Because in biology, admitting uncertainty is more valuable than hallucinating confidence.

9) Reproducibility Checklist

Required Artifacts:

Key Settings:

Compute Provenance:


10) Closing Thoughts

EXP-031 didn't prove our system works perfectly.
It proved something more important:
The pipeline can detect when models disagree in biologically meaningful ways.
In the OOD regime, disagreement IS the data.
And catching that disagreement before wet-lab validation?
That's the difference between confidence theater and verification engineering.
final thought
🙏Big thanks to Sydney Gordon for the question that sparked this work.

If you're building multi-model validation systems, I'd love to hear:
When do you average model scores vs. investigate their disagreement?
Drop a comment below.

This work is part of RExSyn (AI-driven drug discovery platform). Full code will be open-sourced upon publication.

Reaudit Correction (2026-02-16)

Before starting EXP-032, we reran EXP-031 end-to-end and rechecked the full input-output chain.
What did not change:
Drift ordering was reproduced: A < B < C (0.10, 0.17, 0.25)
Final governance decision remained KEEP_OBSERVER
What was corrected:
In the original Stage-2 base_pass_ood mode, n_ood_positive was 0.
So block_recall = 0.0 there should not be read as validated OOD blocking performance.
It is a denominator/computability issue under that mode.
Scope clarification:
EXP-031 should be read as an observer-phase stress test for disagreement detection.
It should not be read as a production gate-promotion claim.
This correction updates interpretation discipline, not the core drift outcomes.
 

Share

Continue the series

View all in series

Related Reading