Flamehaven LogoFlamehaven.space
back to writing
From Repo Scanner to Audit Architecture: What Changed in STEM BIO-AI Through v1.7.8

From Repo Scanner to Audit Architecture: What Changed in STEM BIO-AI Through v1.7.8

A technical look at how STEM BIO-AI v1.7.8 became less Python-shaped, more semantically stable, and more inspectable across real audit output surfaces.

Series

STEM-AI:Soverign Trust Evaluator for Medical AI ArtifactsPart 8 of 8
View all in series
From repo scanner to audit architecture: the evolution of STEM BIO-AI through v1.7.8
Three technical changes that made the scanner less Python-shaped, the warning model more stable, and the reports more inspectable.
The last time I wrote about STEM BIO-AI, the focus was AIRI:
how a local repository scanner could expand its risk vocabulary without pretending to become a universal AI safety judge.
That was the right story for 1.7.0 and 1.7.1.
But the project changed meaningfully after that.
By 1.7.8, the interesting question was no longer just:
Can this scanner attach a broader risk language to local findings?
It became:
Can this scanner make those findings more inspectable, less misleading, and more robust across real repository shapes?
That shift matters.
Because in audit tooling, correctness is only the first battle. The second battle is whether a reviewer can see why the tool landed where it did, and whether the output still makes sense when it leaves the terminal and becomes a report, a PDF packet, a Hugging Face demo, or a governance memo.
From 1.7.6 through 1.7.8, three changes mattered most.
They changed:
  1. what counts as evidence,
  1. how warning lanes are separated,
  1. and how the final artifact stays legible across surfaces.
This is the more technical story behind those releases.

Basic AIRI(the AI Risk Repository) Context: Expanding the Language of Risk

Before getting into the release details, it helps to define what AIRI means in this series.
AIRI refers here to the MIT AI Risk Repository: a public AI risk resource from the MIT AI Risk Initiative that organizes fragmented AI risk language across research, policy, and industry sources.
The repository includes an AI Risk Database, a Causal Taxonomy of AI Risks, and a Domain Taxonomy of AI Risks. According to the MIT AI Risk Repository site, the database collects 1,700+ risks from 74 existing AI risk frameworks and classifications, while the public domain taxonomy organizes risks into 7 domains and 24 subdomains.
That makes AIRI useful as a vocabulary source.
But vocabulary is not truth.
A local scanner should not say:
this repository caused this risk.
It should say something more careful:
this local finding belongs to a broader class of AI risk language.
That distinction is the design boundary.

1. Problem: The scanner was still too Python-shaped

Universal dependency detection and provenance evidence across Python and JavaScript stacks
One of the more useful failures in this line came from an uncomfortable result: a repository could obviously have dependency and lockfile evidence, and STEM BIO-AI could still miss it.
That is not a philosophical problem. That is an implementation problem.
In practice, the project was still too biased toward Python-native signals.
That showed up most clearly in JavaScript or mixed-stack repositories:
  • package.json
  • package-lock.json
  • pnpm-lock.yaml
  • yarn.lock
  • npm-shrinkwrap.json
were not being treated as first-class provenance and replication evidence in the same way that requirements.txt or pyproject.toml were.
The result was a false negative pattern:
  • Stage 3 provenance (B1) could be undercounted
  • Stage 4 replication evidence could be undercounted
  • and the report could quietly imply "no dependency evidence" when the repository clearly had dependency structure
That kind of miss is more dangerous than it sounds.
Not because it makes the score a little wrong.
But because it damages trust in the scanner's worldview.
If developers see a tool miss an obvious pnpm-lock.yaml, they stop believing the harder claims too.

What changed in 1.7.6

The fix was straightforward but important:
  • JavaScript manifests and lockfiles were promoted into the same evidence families as the existing Python manifests where appropriate.
Concretely, that meant:
  • B1_data_provenance_controls started recognizing JS manifest/lock surfaces
  • S4_environment_lock_evidence started recognizing them
  • S4_exact_dependency_pins_or_hashes started recognizing them
This was not a scoring philosophy change.
It was a scope correction.
The rule engine learned that a dependency ecosystem is a dependency ecosystem even when it is not Python.
One boundary matters here.
B1_data_provenance_controls does not suddenly mean "dataset lineage was proven by a lockfile."
In this lane, B1 is using dependency manifests as repository provenance surfaces:
  • what environment the repository expects,
  • what dependency custody the repository exposes,
  • and whether the repo surfaces any adjacent data-source, IRB, or dataset-citation language around that environment.
That is weaker than dataset lineage evidence.
But it is also much stronger than pretending a mixed-stack repository has no provenance surface at all.

A small before/after that makes the point

The yorkeccak/bio case is a good example because the score movement was not philosophical. It was mechanical.
Before the JS manifest fix, the same repository could produce:
After the manifest and lockfile correction, the same repository shape produced:
The important part is not the score delta by itself.
One small boundary is worth making explicit here.
The AIRI change is doing two things at once:
  • the denominator moved from 31 to 32 because the governed AIRI detector-scope expanded by one mapping row across this release line,
  • and the numerator moved from 0 to 7 because the current release can now carry more bounded AIRI links around the findings it actually surfaced.
That explains the AIRI coverage delta.
The scoring delta came from a more mechanical correction:
  • package.json, package-lock.json, and pnpm-lock.yaml stopped being invisible,
  • Stage 3 stopped saying "no dependency/provenance manifest detected,"
  • and Stage 4 stopped undercounting replication structure that was obviously there.
That is what I mean by "blind spot removal" rather than score drift.

Why that matters

This is the kind of change that sounds small in a changelog but large in practice.
Because it changes the relationship between the tool and the developer reading it.
A scanner earns the right to say "this repo is weak on provenance" only after it can correctly see the basic surfaces that exist in the target stack.
That correction also made later report outputs more believable.
When B1 moved from 0 to 15 in affected repositories, that was not "score drift." It was the removal of a blind spot.
And that distinction is exactly why audit tools need explicit versioned rationale.
Without it, every score movement looks arbitrary.

2. Problem: The warning lanes were doing too many jobs at once

Dedicated warning lanes in STEM BIO-AI showing C4, C5, and C6 semantic separation
Before the split, it helps to read C1–C6 as code-integrity lanes.
They are not general AI risk categories. They are reviewer-facing signals that tell you what kind of repository weakness the scanner found, and where to inspect next.
Lane
What it means in STEM BIO-AI
What a reviewer should inspect
C1
Hardcoded credential signals
exposed API keys, cloud keys, tokens, or credential-like patterns
C2
Dependency pinning and external-service fragility
loose dependency ranges, missing exact pins, fragile external service assumptions
C3
Deprecated patient-adjacent paths
legacy, archive, or deprecated folders that still contain patient or clinical-adjacent patterns
C4
Fail-open exception handling
except: pass, except Exception: pass, silent fallbacks, or code paths where errors can disappear
C5
Compliance and clinical-boundary integrity
unsupported HIPAA, compliance, clinical-safe, self-hosted, or regulatory-adjacent claims
C6
Mock-auth or no-auth local/self-host trust boundaries
auto-login, mock authentication, no-auth flows, or weak local trust-boundary assumptions
That table matters because C4, C5, and C6 are not interchangeable.
A fail-open exception is not the same problem as an unsupported compliance claim.
And an unsupported compliance claim is not the same problem as a mock-auth self-host boundary.
That distinction became important once the report started surfacing more nuanced governance signals.
The old C4 lane had started life as a code-oriented fail-open/exception surface.
But as the scanner got better at spotting unsupported compliance language and boundary failures, more and more signals were being interpreted near that same lane.
That made the result harder to read.
If a reviewer sees:
C4_exception_handling_clinical_adjacent_paths: WARN
they should be able to infer the remediation class immediately.
They should know to inspect executable control flow.
They should not have to wonder whether the warning is actually about a README compliance claim, a missing clinical boundary, or a mock-auth local path.
Once one lane starts carrying all of those meanings, the ID stops doing its job.
This is a common failure mode in rule systems.
At first it feels efficient:
  • one warning lane,
  • one bucket,
  • multiple related issues.
Then a few releases later the bucket becomes a junk drawer.
That is exactly what had to be prevented here.

What changed in 1.7.7 and 1.7.8

The solution was to split the lane cleanly:
  • C4 stayed reserved for executable fail-open exception behavior
  • C5 was introduced for unsupported compliance or boundary-integrity claims
  • C6 was introduced for mock-auth, auto-login, or no-auth self-host/local trust-boundary signals
This was more than renaming.
It made the model of the problem cleaner:
  • C4 is code-path failure semantics
  • C5 is governance/claim integrity
  • C6 is trust-boundary collapse in local or self-host flows
That distinction matters to developers because those are different remediation classes.
If a repository triggers C4, you inspect executable control flow. If it triggers C5, you inspect public claim surfaces and supporting governance evidence. If it triggers C6, you inspect local auth and trust-boundary design.
One warning label should not try to be all three.
The more interesting case is when two of those lanes fire together.
A repository can claim something like "HIPAA-ready self-hosting" at the README layer and also expose a mock-auth or auto-login local path.
That is not one problem.
It is two related problems:
  • C5 says the claim surface is overstating governance integrity
  • C6 says the local trust boundary is weaker than the claim suggests
That is exactly why the split matters.
If those two findings collapse into one bucket, the reviewer loses both remediation clarity and causal ordering.
If they stay separate, the report can say:
  1. the public claim is weak,
  1. the local boundary is weak,
  1. and both together make the repository easier to over-trust.

The code insight

This is one of those places where good audit tooling starts looking more like good static analysis design.
A useful warning family is not just one that catches things.
It is one that stays semantically stable across releases.
That is why this split mattered:
it was not just about improving recall.
It was about preserving interpretability under growth.
Once a detector ID becomes ambiguous, your historical comparisons become weaker.
And once historical comparisons become weaker, your audit system starts losing its memory.
That is a bigger problem than one missed warning.

3. Problem: The report could still be correct and yet hard to trust

A repository scanner does not end its life in JSON.
It ends up in:
  • Markdown
  • HTML
  • PDF
  • demos
  • governance reviews
  • screenshots
  • and social arguments
That means the output architecture matters almost as much as the scoring logic.
And there were two places where this became obvious.

First: AIRI numbers needed explanation, not just display

AIRI numbers needed explanation
Earlier versions could show AIRI coverage as a count, but not always make it obvious why a covered risk appeared.
That is a problem.
Because a number like 7 / 32 looks precise.
But precision without causal explanation is fragile.
Developers do not just want to know that a risk mapped. They want to know:
  • which detector triggered it,
  • why that detector maps to that AIRI risk,
  • and what boundary still remains around that mapping.
So the AIRI layer had to become more explicit.
That is where mapping_details mattered.
Covered AIRI rows now carry bounded reasoning objects that can say, in effect:
  • detector ID
  • mapping justification
  • trigger reason
That is a much stronger artifact than a bare coverage count.
It turns AIRI from a visual add-on into an inspectable vocabulary layer.
In practice the object now looks more like this:
That matters because the AIRI layer no longer asks the reviewer to trust a number alone.
It now gives the reviewer a bounded reasoning object to inspect.

Second: The packets themselves needed re-architecture

Artifact architecture showing brief, standard, and full evidence packet tiers across output surfaces
The PDF tiers had also drifted into an awkward shape.
The old packet boundaries were no longer matching the actual content density:
  • Stage 4 could disappear or feel collapsed
  • the closeout pages could become overcrowded
  • and "5-page detailed packet" could stop meaning what users expected
That led to a cleaner packet model:
  • level 1 = brief 1p
  • level 2 = standard 5p
  • level 3 = full 7p
And just as importantly:
  • the default CLI path moved to level 3
It is a statement about what the project now considers the normal artifact.
The normal artifact is no longer the brief scan. It is the full evidence packet.

Why that matters

This is where the project moved from "scanner" toward "audit architecture."
A scanner can stop at a result.
An audit architecture has to preserve meaning across surfaces.
That means:
  • JSON must be canonical
  • HTML must be navigable
  • PDFs must honor real packet boundaries
  • and the same warning semantics must survive in all of them
That is why these changes matter to developers.
They are part of the correctness story.
If the why disappears when the result becomes a report, the audit object was never complete to begin with.

The hidden pattern behind all three changes

These releases can look like a mixed bag:
  • JS manifest support
  • legal/compliance claim surfacing
  • external dependency risk
  • C4/C5/C6 split
  • AIRI reasoning
  • packet restructuring
  • demo/output alignment
But there is a single pattern underneath them:
the system became less willing to let ambiguity hide inside a convenient surface.
That showed up in three ways:
  1. a manifest should count if it exists
  1. a warning lane should mean one thing
  1. a risk mapping should explain itself
That may sound almost obvious.
But a lot of tools never make it that far.
They accumulate clever features faster than they reduce ambiguity.
This line of work did the opposite.
It made **the system stricter about what its outputs are allowed to imply. ** That is a more durable path.

The more interesting lesson

The most useful thing about 1.7.6 through 1.7.8 is not that STEM BIO-AI became "smarter."
It is that it became harder to misread.
That is a better goal for audit tooling.
Especially now.
Because in a world increasingly full of fluent agent outputs, the differentiator is not whether a tool can generate a plausible narrative.
It is whether the narrative stays tethered to inspectable structure when the repository is messy, cross-stack, overclaimed, or partially misleading.
That is where this release line got better.
Not by pretending to know more than it does.
But by making its own boundaries clearer.

What I would tell developers evaluating this line

What I would tell developers evaluating this line
If you only look at the release notes, you might think:
  • better AIRI
  • more warnings
  • nicer reports
That is true, but too shallow.
The real changes are:
  • the scanner is less Python-centric than it was
  • the warning taxonomy is more semantically stable than it was
  • the artifacts are more inspectable than they were
That combination matters more than any one score change.
It means the tool is becoming less of a clever repo grader and more of a reliable evidence instrument.
That is the direction I care about.
Because once the repository is politically messy, clinically adjacent, or governance-sensitive, "good-enough automation" is not enough.
The system has to show its work.
These versions got noticeably better at doing that.
A Reiable Edivdence Instrument for the Messy Reality

Try It

If you want the full packet explicitly:
The default path now lands on the full evidence packet, and that is the point.
In audit tooling, the serious path should not require an extra flag.

See the Artifact

If you want to inspect the actual artifact shape behind this release line, these two public outputs are the best reference:
stem-bio-ai report
The point of 1.7.8 is not just that the scanner scores the repository differently.
It is that the same result now survives translation into JSON, Markdown, HTML, and a full review packet without losing too much meaning along the way.

Next Step

If your AI system works in demos but still feels fragile, start here.

Flamehaven reviews where AI systems overclaim, drift quietly, or remain operationally fragile under real conditions. Start with a direct technical conversation or review how the work is structured before you reach out.

Direct founder contact · Response within 1-2 business days

Share

Continue the series

View all in series

Related Reading