Flamehaven LogoFlamehaven.space
back to writing
How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits

How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits

STEM-AI v1.1.2 binds a bio/medical AI repository audit to a machine-checkable memory contract, then demonstrates it on a real open-source bioinformatics repository.

Series

STEM-AI:Soverign Trust Evaluator for Medical AI ArtifactsPart 4 of 4
View all in series
Image description
In the first STEM-AI write-up, I described what happened after auditing 10 open-source bio/medical AI repositories.
The important lesson was not just that some repositories lacked clinical disclaimers, tests, or governance artifacts.
The more useful lesson was this:
Text-only review is too weak for bio/medical AI. You have to inspect the code path.
That worked.
But it exposed the next problem.
If an AI system is auditing another AI or bioinformatics repository, how do you trust the auditor?
LLMs drift. One session can enforce a clinical boundary strictly. Another can invent a generous middle score for the same boundary case. In normal software review, that is annoying. In medical AI governance, it is a liability.
STEM-AI v1.1.2 is my answer to that problem.
It does not try to make the LLM deterministic by writing a longer prompt.
It binds the audit to a memory contract.

What v1.1.2 adds

standard audit vs Bio/Medical AI audit
The idea is simple:
before the auditor reads the target repository, it must load a fixed audit contract and self-check the rules it is not allowed to bend.
The v1.1.2 layer includes:
  • memory/mica.yaml -- composition contract
  • memory/stem-ai.mica.v1.1.2.json -- machine-checkable memory archive
  • memory/stem-ai-playbook.v1.1.2.md -- session playbook and drift guard
  • memory/stem-ai-lessons.v1.1.2.md -- historical failure-mode archive
  • spec/STEM-AI_v1.1.2_CORE.md -- canonical audit spec
The contract pins 18 invariants.
Examples:
  • Stage order is fixed: README intent, cross-platform evidence, code/bio evidence.
  • Stage weights are fixed.
  • Tier boundaries are fixed.
  • T0_HARD_FLOOR cannot be bypassed.
  • Stage 2 may use external evidence or Stage 2R repo-local consistency in LOCAL_ANALYSIS mode.
  • Governance overlay cannot raise the formal base tier.
  • C1-C4 code-integrity checks only run in LOCAL_ANALYSIS mode.
  • Mandatory clinical-use disclaimers cannot be omitted.
This is not a claim that the LLM becomes perfectly deterministic.
It is a narrower claim:
The auditor is forced to operate inside a contract whose scoring rules, hard floors, and evidence requirements are inspectable.
That is the useful layer.

What "loading the contract" means

Forcing the auditor to operate inside a machine-checkable memory contract
MICA is not hidden model memory.
It is also not a claim that the model provider changed the LLM.
In v1.1.2, "loading the contract" means the audit session starts by reading a fixed set of repository files before it is allowed to score the target:
Pinning the audit rules mathematically
The auditor then performs a pre-execution contract test:
  • confirm the canonical spec exists
  • confirm the memory archive exists
  • confirm the invariant count is 18
  • confirm the fixed tier boundaries are present
  • confirm the Stage 2 / Stage 2R lane rule is present
  • confirm Stage 3G cannot raise the formal tier
  • confirm C1-C4 mode gating is active
Only after that does the audit proceed.
This does not make the LLM mathematically deterministic.
It makes the audit procedure file-backed, inspectable, and interruptible. If the session cannot load or reconcile the contract files, the correct behavior is to stop before scoring.
That is the difference between "please be consistent" and "execute this versioned contract."

The audit workflow

STEM-AI v1.1.2 runs as a structured audit workflow:
STEM-AI v1.1.2 workflow
In LOCAL_ANALYSIS mode, the auditor is not limited to what the README says.
It can inspect:
  • package metadata
  • workflow files
  • test definitions
  • dependency manifests
  • source-code paths
  • deprecated or dead-code paths
  • exception handling
  • credential patterns
  • provenance and hash-checking logic
The output is intentionally split into two files:
Separating subjective reasoning from verifiable mathematics
That split matters.
The report explains the reasoning.
The JSON lets another reviewer inspect the score, evidence fields, flags, and integrity checks without trusting the prose.

A real target audit, not a synthetic example

For this v1.1.2 demonstration, I used a real public repository:
The target is not the protagonist of this post.
It is only the specimen used to show the audit workflow against a real bioinformatics codebase.
The local audit produced:
The target snapshot:
This is the important part: the audit did not ask, "Does this README sound trustworthy?"
It asked:
  • Do README claims match actual package metadata and entry points?
  • Are there real CI and domain-specific tests?
  • Are dependencies reproducible enough?
  • Are there credential leaks?
  • Are there deprecated patient-adjacent paths?
  • Do clinical-adjacent output paths fail closed?
  • Does the repository include governance evidence, or only governance absence?
That is where STEM-AI is useful.

The score object

The machine-readable result records the score like this:
External Stage 2 is explicitly represented as null for this local-only audit.
That does not mean cross-platform consistency is unimportant.
It means this evidence slice was deliberately scoped to LOCAL_ANALYSIS. Instead of pretending to have social/web evidence, v1.1.2 uses Stage 2R: Repo-Local Consistency.
Stage 2R asks whether the repository's own surfaces agree with each other:
  • README vs package metadata and CLI entry points
  • README vs docs, tutorials, and troubleshooting
  • README test claims vs CI workflow and test definitions
  • clinical-adjacent outputs vs local intended-use boundaries
The contract defines the fixed-weight calculation:
The final tier is therefore:
Not because the prose sounded balanced.
Because the contract math forces that result.

Why the T0 hard floor did not trigger

Why the T0 hard floor did not trigger
T0_HARD_FLOOR is the rule that prevents a clinically dangerous repository from escaping rejection through good wording.
In simplified form:
Examples of CA-DIRECT include patient-specific diagnosis, treatment recommendation, triage, risk scoring, or clinical decision support.
The audited repository did not trigger that floor because STEM-AI classified it as:
It produces biological sequence artifacts that may sit near public-health or clinical workflows, but the inspected surface did not make direct autonomous diagnosis or treatment claims. It also has substantive implementation, CI, and domain-specific test definitions.
So the result is not T0.
But it is also not high-trust.
The bounded result is T2 Caution.

Stem-AI Audit v1.1.2

Code-integrity findings

The same JSON records C1-C4 LOCAL_ANALYSIS checks:
That is the difference between a general review and a code-path audit.
A text review can say:
The project appears technically mature.
A code-path audit can say:
Credential patterns were checked. Dependency pinning is weak. Deprecated patient-adjacent metadata exists. One clinical-adjacent filtering path does not fail closed on missing depth.
That is a more useful governance object.
It is not a certificate.
It is a map of what a reviewer should trust, distrust, or inspect next.

A small Python verifier

Here is a small dependency-free Python script that reads the actual audit JSON and verifies the score calculation. It does not need target private code or patient data; it only checks the machine-readable audit result.
Expected digest:

Why this matters

Bio/medical AI governance is full of language that sounds safe but is hard to verify:
  • "research use only"
  • "not medical advice"
  • "validated pipeline"
  • "clinical-grade"
  • "responsible AI"
  • "human-in-the-loop"
Those phrases are not enough.
STEM-AI asks for observable structure:
  • source-code reality
  • test reality
  • CI reality
  • dependency reality
  • clinical boundary reality
  • governance artifact reality
  • code-integrity reality
v1.1.2 adds another layer:
auditor reality.
The AI auditor itself has to load a memory contract before it scores.
That is what MICA is for.
The final answer is T2 Caution: research reference and supervised non-clinical technical review only. No autonomous clinical decision support.
Not hype.
Not rejection by default.
A bounded trust judgment with evidence paths.

What comes next

The follow-on lane should:
  • provision the target dependency environment
  • run selected target tests in a controlled shell
  • capture command, exit code, environment hash, and output digest
  • attach a replay manifest to experiment_results.json
  • keep runtime evidence separate from source/document/CI evidence
For the current demonstration, runtime execution status is recorded as an evidence boundary in the audit JSON. The score itself remains based on the official v1.1.2 LOCAL_ANALYSIS evidence basis: Stage 1 source/README evidence, Stage 2R repo-local consistency, Stage 3 code/bio evidence, and C1-C4 integrity checks.

Stem-AI

Final thought

STEM-AI is not a clinical certifier.
It is also not trying to replace scientific review, regulatory review, or domain experts.
Its role is narrower: make the governance conversation start from observable evidence instead of presentation quality.
In practice, that means asking:
  • What did the repository claim?
  • What does the code actually implement?
  • Do the local surfaces agree with each other?
  • Are the tests domain-specific or merely infrastructural?
  • Are clinical-adjacent boundaries explicit?
  • Can the auditor's own scoring logic be inspected?
That is where I think STEM-AI belongs in AI governance.
Not as the final authority.
As the evidence gate before authority is invoked.
It turns a vague question, "Do we trust this bio/medical AI repository?", into a more reviewable one:
Does this repository establish enough observable trust to be considered, contained, or rejected?

Next Step

If your AI system works in demos but still feels fragile, start here.

Flamehaven reviews where AI systems overclaim, drift quietly, or remain operationally fragile under real conditions. Start with a direct technical conversation or review how the work is structured before you reach out.

Direct founder contact · Response within 1-2 business days

Share

Continue the series

View all in series

Related Reading