How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits

Previous article: How Auditing 10 Bio-AI Repositories Shaped STEM-AI

In the first STEM-AI write-up, I described what happened after auditing 10 open-source bio/medical AI repositories.

The important lesson was not just that some repositories lacked clinical disclaimers, tests, or governance artifacts.

The more useful lesson was this:

Text-only review is too weak for bio/medical AI. You have to inspect the code path.

That worked.

But it exposed the next problem.

If an AI system is auditing another AI or bioinformatics repository, how do you trust the auditor?

LLMs drift. One session can enforce a clinical boundary strictly. Another can invent a generous middle score for the same boundary case. In normal software review, that is annoying. In medical AI governance, it is a liability.

STEM-AI v1.1.2 is my answer to that problem.

It does not try to make the LLM deterministic by writing a longer prompt.

It binds the audit to a memory contract.

What v1.1.2 adds

STEM-AI v1.1.2 introduces MICA: Memory-Injected Contract Architecture.

The idea is simple:

before the auditor reads the target repository, it must load a fixed audit contract and self-check the rules it is not allowed to bend.

The v1.1.2 layer includes:

memory/mica.yaml -- composition contract

memory/stem-ai.mica.v1.1.2.json -- machine-checkable memory archive

memory/stem-ai-playbook.v1.1.2.md -- session playbook and drift guard

memory/stem-ai-lessons.v1.1.2.md -- historical failure-mode archive

spec/STEM-AI_v1.1.2_CORE.md -- canonical audit spec

The contract pins 18 invariants.

Examples:

Stage order is fixed: README intent, cross-platform evidence, code/bio evidence.

Stage weights are fixed.

Tier boundaries are fixed.

T0_HARD_FLOOR cannot be bypassed.

Stage 2 may use external evidence or Stage 2R repo-local consistency in LOCAL_ANALYSIS mode.

Governance overlay cannot raise the formal base tier.

C1-C4 code-integrity checks only run in LOCAL_ANALYSIS mode.

Mandatory clinical-use disclaimers cannot be omitted.

This is not a claim that the LLM becomes perfectly deterministic.

It is a narrower claim:

The auditor is forced to operate inside a contract whose scoring rules, hard floors, and evidence requirements are inspectable.

That is the useful layer.

What "loading the contract" means

Forcing the auditor to operate inside a machine-checkable memory contract

MICA is not hidden model memory.

It is also not a claim that the model provider changed the LLM.

In v1.1.2, "loading the contract" means the audit session starts by reading a fixed set of repository files before it is allowed to score the target:

The auditor then performs a pre-execution contract test:

confirm the canonical spec exists

confirm the memory archive exists

confirm the invariant count is 18

confirm the fixed tier boundaries are present

confirm the Stage 2 / Stage 2R lane rule is present

confirm Stage 3G cannot raise the formal tier

confirm C1-C4 mode gating is active

Only after that does the audit proceed.

This does not make the LLM mathematically deterministic.

It makes the audit procedure file-backed, inspectable, and interruptible. If the session cannot load or reconcile the contract files, the correct behavior is to stop before scoring.

That is the difference between "please be consistent" and "execute this versioned contract."

The audit workflow

STEM-AI v1.1.2 runs as a structured audit workflow:

In LOCAL_ANALYSIS mode, the auditor is not limited to what the README says.

It can inspect:

package metadata

workflow files

test definitions

dependency manifests

source-code paths

deprecated or dead-code paths

exception handling

credential patterns

provenance and hash-checking logic

The output is intentionally split into two files:

Separating subjective reasoning from verifiable mathematics

That split matters.

The report explains the reasoning.

The JSON lets another reviewer inspect the score, evidence fields, flags, and integrity checks without trusting the prose.

A real target audit, not a synthetic example

For this v1.1.2 demonstration, I used a real public repository:

artic-network/fieldbioinformatics

The target is not the protagonist of this post.

It is only the specimen used to show the audit workflow against a real bioinformatics codebase.

The local audit produced:

The target snapshot:

This is the important part: the audit did not ask, "Does this README sound trustworthy?"

It asked:

Do README claims match actual package metadata and entry points?

Are there real CI and domain-specific tests?

Are dependencies reproducible enough?

Are there credential leaks?

Are there deprecated patient-adjacent paths?

Do clinical-adjacent output paths fail closed?

Does the repository include governance evidence, or only governance absence?

That is where STEM-AI is useful.

The score object

The machine-readable result records the score like this:

External Stage 2 is explicitly represented as null for this local-only audit.

That does not mean cross-platform consistency is unimportant.

It means this evidence slice was deliberately scoped to LOCAL_ANALYSIS. Instead of pretending to have social/web evidence, v1.1.2 uses Stage 2R: Repo-Local Consistency.

Stage 2R asks whether the repository's own surfaces agree with each other:

README vs package metadata and CLI entry points

README vs docs, tutorials, and troubleshooting

README test claims vs CI workflow and test definitions

clinical-adjacent outputs vs local intended-use boundaries

The contract defines the fixed-weight calculation:

The final tier is therefore:

Not because the prose sounded balanced.

Because the contract math forces that result.

Why the T0 hard floor did not trigger

T0_HARD_FLOOR is the rule that prevents a clinically dangerous repository from escaping rejection through good wording.

In simplified form:

Examples of CA-DIRECT include patient-specific diagnosis, treatment recommendation, triage, risk scoring, or clinical decision support.

The audited repository did not trigger that floor because STEM-AI classified it as:

It produces biological sequence artifacts that may sit near public-health or clinical workflows, but the inspected surface did not make direct autonomous diagnosis or treatment claims. It also has substantive implementation, CI, and domain-specific test definitions.

So the result is not T0.

But it is also not high-trust.

The bounded result is T2 Caution.

Code-integrity findings

The same JSON records C1-C4 LOCAL_ANALYSIS checks:

That is the difference between a general review and a code-path audit.

A text review can say:

The project appears technically mature.

A code-path audit can say:

Credential patterns were checked. Dependency pinning is weak. Deprecated patient-adjacent metadata exists. One clinical-adjacent filtering path does not fail closed on missing depth.

That is a more useful governance object.

It is not a certificate.

It is a map of what a reviewer should trust, distrust, or inspect next.

A small Python verifier

Here is a small dependency-free Python script that reads the actual audit JSON and verifies the score calculation. It does not need target private code or patient data; it only checks the machine-readable audit result.

Expected digest:

Why this matters

Bio/medical AI governance is full of language that sounds safe but is hard to verify:

"research use only"

"not medical advice"

"validated pipeline"

"clinical-grade"

"responsible AI"

"human-in-the-loop"

Those phrases are not enough.

STEM-AI asks for observable structure:

source-code reality

test reality

CI reality

dependency reality

clinical boundary reality

governance artifact reality

code-integrity reality

v1.1.2 adds another layer:

auditor reality.

The AI auditor itself has to load a memory contract before it scores.

That is what MICA is for.

The final answer is T2 Caution: research reference and supervised non-clinical technical review only. No autonomous clinical decision support.

Not hype.

Not rejection by default.

A bounded trust judgment with evidence paths.

What comes next

The follow-on lane should:

provision the target dependency environment

run selected target tests in a controlled shell

capture command, exit code, environment hash, and output digest

attach a replay manifest to experiment_results.json

keep runtime evidence separate from source/document/CI evidence

For the current demonstration, runtime execution status is recorded as an evidence boundary in the audit JSON. The score itself remains based on the official v1.1.2 LOCAL_ANALYSIS evidence basis: Stage 1 source/README evidence, Stage 2R repo-local consistency, Stage 3 code/bio evidence, and C1-C4 integrity checks.

Final thought

STEM-AI is not a clinical certifier.

It is also not trying to replace scientific review, regulatory review, or domain experts.

Its role is narrower: make the governance conversation start from observable evidence instead of presentation quality.

In practice, that means asking:

What did the repository claim?

What does the code actually implement?

Do the local surfaces agree with each other?

Are the tests domain-specific or merely infrastructural?

Are clinical-adjacent boundaries explicit?

Can the auditor's own scoring logic be inspected?

That is where I think STEM-AI belongs in AI governance.

Not as the final authority.

As the evidence gate before authority is invoked.

It turns a vague question, "Do we trust this bio/medical AI repository?", into a more reviewable one:

Does this repository establish enough observable trust to be considered, contained, or rejected?

How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits

What v1.1.2 adds

What "loading the contract" means

The audit workflow

A real target audit, not a synthetic example

The score object

Why the T0 hard floor did not trigger

Code-integrity findings

A small Python verifier

Why this matters

What comes next

Final thought

If this touches a scientific, BioAI, or regulated workflow, route it like a team review.

Share

Continue the series

How Auditing 10 Bio-AI Repositories Shaped STEM-AI

From Score to Workflow: Turning STEM BIO-AI Into a Local Audit System

Related Reading

We Built AI Verification Infrastructure. Then It Found Our Blind Spots.

From Score to Workflow: Turning STEM BIO-AI Into a Local Audit System

Bio-AI Repository Audit 2026: A Technical Report on 10 Open-Source Systems