How Auditing 10 Bio-AI Repositories Shaped STEM-AI

💡

🧭 Reading path:

This post is part of a series. (1) STEM-AI introduction — what the framework is and why we built it (2) Technical audit report — full findings across 10 repositories (3) Narrative summary — what those findings actually mean

What Text Could See — and What Code Revealed

In March 2026, we ran STEM-AI against 10 high-visibility open-source bio/medical AI repositories.

The framework did what it was designed to do. It surfaced missing disclaimers, absent CI, weak reproducibility signals, and public-facing governance gaps. Those findings mattered, and the scores were directionally right.

But when we reviewed the audits more carefully, one pattern kept appearing: some of the most consequential failures were not visible in the artifact surface at all. They only became obvious when we looked directly at the code.

This function lives inside a repository presented as an AI-driven drug discovery workflow.

It does not generate molecules. It appends characters.

SMILES (Simplified Molecular Input Line Entry System) is a strict notation for molecular structure. A valid SMILES string encodes real geometry and bonding. Appending C produces a syntactically valid string that represents no real compound. The function runs without error, returns a list, and the pipeline continues downstream.

Our framework scored this repository T0. Correctly. But not because it saw this function.

It scored T0 because the README was missing disclaimers. The CI was absent. Reproducibility was undocumented. Text-path evaluation is designed to measure exactly that. It did.

The audit result was correct. The evidence surface had room to go deeper.

Running the audits showed us what code-path evaluation could add on top.

What Code Access Makes Visible

The drug discovery example was not unusual.

CellAgent's pipeline ends with this call:

The method exists. Its body is pass. The pipeline completes without error and produces nothing. A text audit reading the README would have no way to know this.

BioAgents includes a rate limiter for external API calls:

USE_JOB_QUEUE defaults to false in .env.example. Every default deployment has rate limiting disabled. The function name implies protection. In default operation, there is none.

The pattern across all three:

the code looks governed.

the behavior tells a different story.

That story is only visible when you read the code.

Text scores and code behavior can diverge. Knowing where and how they diverge is the next layer of evidence worth capturing.

Four Directions the Audits Opened

Reviewing all ten audits, we identified four areas where code-path evaluation could extend what text auditing already does well.

Direction 1: Clinical exposure is visible in imports, not just in README text.

A repository importing pharmacogenomics allele tables has clinical exposure regardless of what its README says. Detecting that dependency at the import level — rather than waiting for a disclaimer — lets the framework flag exposure earlier.

The key distinction is severity: a direct pharmacogenomics import (CYP2D6, CPIC) signals live patient-facing risk and is classified CA-DIRECT.

A general-purpose medical imaging library like pydicom or MONAI is classified CA-INDIRECT — research-use exposure, not necessarily a live clinical output path. The import alone does not determine clinical risk; the classification tier does.

Direction 2: Not all clinical proximity is the same.

A live pharmacogenomics dosage tool and a README roadmap note about a future ClinVar integration are not equivalent risks. Differentiating them — live output vs. research context vs. planned feature — makes the evaluation more precise and makes the accountability expectations more appropriate.

Direction 3: Scoring stability is worth measuring directly.

We ran Stage 1 on one repository in multiple passes. The results ranged across 28 points on the same input. Overlapping trigger conditions between hype-detection items are one contributing factor.

LLM runtime stochasticity is another — the exact split between the two is still under measurement. Adding explicit discrimination examples — what exact phrasing triggers each item, what does not — makes the scoring surface cleaner and reduces the most obvious sources of variance.

Direction 4: Code-path behavior deserves its own scan layer.

A fail-open pattern is a control path that appears to enforce a constraint but defaults to bypassing it. The BioAgents rate limiter above is the example. In a clinical output path, a silent pass-through is not graceful degradation. It is an untraced result that looks like a real one. Building a dedicated scan for these patterns adds a check that text auditing was never meant to provide.

These four directions came directly from running the audits. The scores across the 10 repositories remain as published. Code-path evaluation is what the framework can now add on top of them.

What v1.0.6 Added — Carried Forward in v1.1.2

These changes were introduced in v1.0.6 and are carried forward in the current internal v1.1.2 package. They extend the framework's evidence surface into code-level behavior. Calibration is ongoing.

1. Two Evidence Paths, Not One

We narrowed one of the biggest divergence points by splitting evaluation into a text path and a code path.

The text path works as before: read the README, CHANGELOG, and public posts, score against the rubric. Always available regardless of access to the repository.

The code path activates when the audit has a local clone. It runs through Claude Code, Codex CLI, Gemini CLI, or Copilot CLI. Claims are not interpreted. They are measured. A README that says "IRB-approved data" earns no points for the statement. Points require a provenance artifact in the code.

When code confirms the README, that is a positive signal. When it contradicts it, that contradiction is the finding.

2. Clinical Dependency Detection

At the start of every local audit, a scan script reads Python imports and README keywords. It classifies the result into one of three severity levels:

Accountability requirements follow the actual clinical proximity of the code. Not the aspirational proximity of the roadmap. A roadmap mention without active implementation is treated as CA-PLANNED rather than collapsed into the same bucket as live clinical output.

The pattern matching is against import statements and function names, not comment text. False positive calibration is still in progress.

3. Code Integrity Scanning (C1-C4)

A second scan handles four code-level checks:

hardcoded credentials (C1)

unpinned dependencies (C2)

clinical-path stubs (C3)

fail-open exception handlers (C4).

The C4 check targets the BioAgents-style pattern. Searching for clinical keywords on the except: line misses most real cases. Clinical context lives in function names and surrounding code. The scan uses a two-pass approach: first identify files with clinical-domain context, then find silent exception handlers within those files.

A silent except: pass in a clinical-context file is a trust-surface failure. The scan makes it visible without requiring a reviewer to read every exception block manually.

4. Discrimination Examples

To reduce the 28-point variance, we added explicit examples for each hype-detection item: what exact phrasing triggers it, what does not, and what the documented edge cases are.

The goal is to reduce obvious scoring drift enough that the same repository is no longer interpreted as two different trust surfaces across different auditors or different LLMs. That goal is not yet verified. The discrimination examples are the primary mechanism toward it.

The Three Questions Now Have a Fourth

In the first post, we described three questions the framework was built around:

Did the repository describe its limits honestly?Did public communication remain consistent with those limits?Did the codebase show evidence of maintenance and biological responsibility?

Running 10 real audits pointed toward a fourth:

4. Does the code actually do what the documentation says — and where it diverges, is that divergence visible and traceable?

That fourth question is what the audit outputs kept surfacing. A function name that sounds real. A pipeline that looks complete. An output that is plausible. An implementation that is a stub, or a control path that silently bypasses its own constraint.

The first three questions can be answered by reading. The fourth requires looking at the code.

What the Framework Added — and What Stays the Same

The first post ended with this:

"STEM-AI is meant to support serious review, not replace it."

That has not changed. Every report carries a non-removable disclaimer: LLM-generated audit, not a regulatory determination, not clinical certification. Every report carries an expiry date. The minimum threshold for supervised pilot consideration is still T3. None of the March 2026 repositories reached it.

What the audits added is narrower — broader evidence coverage on top of an already-working foundation:

What the Framework Added — and What Stays the Same

A verifiable artifact shifts the accountability surface — it does not eliminate the possibility of falsification. The framework treats its presence as a necessary condition, not a sufficient one.

What the framework gained is that the evidence it counts now extends beyond what authors say about their own code.

Three Directions Still Ahead

Automated re-audit on repository changes. A score from three months ago may not describe the same repository. The trajectory signal measures issue close rate and release frequency across consecutive 90-day windows. It is a partial answer. A CI-triggered re-audit path is the logical next step.

The denominator problem. Zero of 10 repositories reached T3. This may accurately describe the ecosystem's current state. It may also reflect calibration issues in the upper tiers. Distinguishing between the two requires before-and-after auditing of repositories that have received systematic governance remediation.

The Stage 2 redistribution question. Most audits have no cross-platform consistency data. When that data is unavailable, the framework redistributes Stage 2's weight equally between documentation quality and engineering accountability.

For repositories with clinical-direct exposure, a well-written README can then compensate for weak code accountability. A guardrail flags this condition. The current redistribution rule is explicit but not yet final — it remains one of the framework's open calibration questions.

If there are open-source bio-AI repositories you think should be >audited next, drop them in the comments. Bonus if they claim clinical >relevance, drug discovery, or medical reasoning.

STEM-AI v1.1.2 — Trust Evaluation Framework for Medical AI Repositories.

"Code works. But does the author care about the patient?"