From Score to Workflow: Turning STEM BIO-AI Into a Local Audit System

Earlier in this series, I wrote about why bio/medical AI repositories need more than benchmarks, what I learned after auditing 10 public repositories, and why an AI auditor itself needs a memory contract.

That work led to STEM-AI v1.1.2 and the MICA layer: a memory-contracted initialization step that forces the auditor to load bounded rules before scoring begins. If you have not read that part, the relevant post is here:

How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits

For the broader arc, the full series is here:

STEM-AI / STEM BIO-AI series

But after that, a different engineering problem took over.

The audit logic was stricter.

The reports were richer.

The reasoning was more bounded.

But the developer workflow still felt too loose.

So the next question was no longer:

How do I score trust?

It became:

How does a bio-AI audit tool become something an engineer can actually run, gate, inspect, and integrate?

The answer turned out to be less about seeing more signals and more about refusing to confuse them.

That is the core argument of this post:

A detector becomes more trustworthy when it is strict about what it cannot conclude.

Once I took that seriously, STEM BIO-AI stopped looking like “one score plus some extra metadata” and started looking like a system with distinct lanes, distinct boundaries, and distinct operator workflows.

The problem was no longer scoring

By the time I reached the 1.6.x line, the rubric was no longer the main bottleneck.

The bottleneck was operational clarity.

A trust audit tool is not very useful if:

the normal path is one long command with too many flags

CI has to reverse-engineer the result from human-readable stdout

bio-specific diagnostics are mixed directly into the same surface as formal scoring

regulatory relevance shows up as vague implication instead of explicit traceability

advisory AI is present, but its relationship to the official score is unclear

At that point, the tool stops being hard to trust for conceptual reasons and starts being hard to trust for operational reasons.

That is a different class of problem.

The CLI had to reflect operator intent

The earlier CLI was functional, but too flat.

You could do things like this:

All of that worked.

The issue was that it treated very different operator intents as one long option surface.

In practice, these are separate workflows:

scan a repository and generate artifacts

enforce a gate in CI/CD

export a bounded advisory packet

validate a downstream provider response

cross an explicit provider-call boundary

So I refactored the CLI around workflows instead of flag accumulation:

The older paths still exist for compatibility:

But they are no longer the conceptual center.

That matters more than it sounds.

Once the command names match the operator’s intent, the system becomes easier to teach, easier to remember, and easier to wire into pipelines.

This is not just a DX cleanup. In a medical or bio-adjacent audit context, command ambiguity is part of the trust problem.

Repository trust needed four separate lanes

This was the biggest architectural shift.

I stopped treating repository trust as one object.

In practice, it needed four separate lanes:

deterministic structural scoring

deterministic diagnostics

regulatory traceability

optional AI advisory

If all of those collapse into one final confidence score, the tool becomes harder to reason about.

The more regulated the domain, the more dangerous it becomes to collapse every useful signal into one score.

Some evidence should change the score. Some evidence should only raise review priority. Some evidence should support traceability. Some evidence should be handed to a human or advisory system.

The maturity of the tool is not that it sees all of them.

The maturity is that it does not confuse them.

This separation is not just conceptual. It exists in the code path.

One reasonable objection to any architecture write-up is: are these really separate lanes, or are they just different labels on the same output object?

In STEM BIO-AI, the answer is visible in the execution order.

The scanner computes the formal score first. In the result object, that means keys like:

Stage 1

Stage 2R

Stage 3

risk penalty

score cap

final_score

formal_tier

Only after that does it append the non-scoring layers, again as explicit result keys:

regulatory_basis

stage_traceability

regulatory_traceability

reasoning_model

optional ai_advisory

That ordering matters.

The score is not derived from the advisory lane. The regulatory mapping does not mutate the formal tier. The diagnostics lane can emit evidence without becoming a hidden score multiplier.

This is also why the JSON shape ended up more layered than earlier versions. The output had to preserve the distinction the code was already enforcing.

That execution order is the architectural reason the next four sections exist.

Once I had the lanes separated in code, each lane needed its own claim boundary, its own output semantics, and its own reason for not being collapsed into the others.

Put differently, the next four sections answer four different questions:

what is allowed to change the formal tier

what is useful enough to emit, but not yet mature enough to score

what can support regulatory review without pretending to be compliance

what can involve AI without letting AI become the scoring authority

1. Deterministic structural scoring

This remains the official score and tier.

It measures the main repository-visible signals:

README evidence

repo-local consistency

code and bio responsibility

dependency hygiene

changelog and provenance surfaces

code-integrity patterns

This lane is local, deterministic, and machine-checkable.

That is the part that can legitimately drive a formal triage tier.

I am not claiming this is the only possible architecture. A different system could have folded diagnostics or replication more aggressively into one unified score.

I chose not to, because the narrower score proved easier to defend. A smaller claim with cleaner boundaries was more valuable here than a broader score with ambiguous semantics.

2. Deterministic diagnostics

This is where the deterministic diagnostics spec became important.

I needed a place for findings that are real, useful, and inspectable, but should not silently perturb the main score until they are calibrated.

That is what docs/DETERMINISTIC_DIAGNOSTICS.md defines.

It separates the diagnostic problem into two lanes:

Lane A: deterministic local diagnostics

Lane B: optional AI-assisted semantic review

That separation is central.

The deterministic lane is authoritative for hard findings. The AI lane is advisory only.

The local diagnostic lane currently focuses on evidence-bearing bio-specific signals such as:

malformed or suspicious SMILES-like outputs

missing parser guards

silent mock or simulated-data fallbacks

risky subprocess construction around bio tools

traceability manifest surfaces

The point was not to create a “bio slop detector” with a catchy label.

The point was to create a local evidence lane that could say:

here is the file

here is the line

here is the snippet

here is the bounded interpretation

That is much more useful than a vague semantic warning.

Why diagnostics stayed evidence-only

This was one of the harder engineering decisions.

It would have been easy to push every new bio-specific detector directly into the final score.

I did not do that.

The deterministic diagnostics spec is explicit that many of these findings begin as evidence-only. In practice, they are emitted as line-level records in the result object's evidence_ledger:

findings are emitted into the result object’s evidence_ledger

findings appear in Markdown and -explain

findings do not change final_score or formal_tier

That is the right default.

For example, the SMILES lane can be very useful for detecting:

malformed surface strings

low-entropy placeholders

repeated trivial outputs

missing parser guards

But it does not prove:

medicinal usefulness

synthetic feasibility

binding plausibility

biological efficacy

full chemical validity in every edge case

That boundary is important.

A detector becomes more trustworthy when it is strict about what it cannot conclude.

Just as importantly, this is not meant to be a permanent holding area for every detector. The diagnostics spec is explicit that score impact should only happen after commit-pinned benchmark evidence, explicit false-positive review, and reproducible calibration. In other words, evidence-only is the temporary safe default until a detector has earned score authority.

3. Regulatory traceability

The second document that became central was docs/REGULATORY_MAPPING.md.

This solved a different problem.

Once you audit clinical-adjacent repositories, people naturally ask:

does this align with EU AI Act themes?

does this help with FDA-oriented review?

is there anything relevant to IMDRF or SaMD evidence families?

The wrong answer would be to turn those questions into a fake compliance score.

So I did the opposite.

The regulatory layer is explicitly framed as:

a traceability aid, not a compliance verdict

That document maps observed evidence classes to requirement families with bounded confidence labels like:

strong

moderate

weak-moderate

weak

not assessed

And it makes an important distinction:

the confidence applies to the mapping relationship, not to legal acceptability.

Those confidence labels are not model outputs and they are not inferred at runtime. They are fixed, rule-level mapping judgments attached to evidence classes in the mapping document itself. For example, changelog / checksum / config-manifest style evidence is treated as a moderate traceability signal for Article 12-style review, while human-oversight interface signals stay weak because interface presence is not the same thing as oversight procedure.

That means the tool can say things like:

versioned manifests and changelogs may support record-keeping / traceability review

intended-use and disclaimer sections may support transparency scaffolding review

override interfaces may support human-oversight interface review

subgroup measurement language may support weak evidence of data-governance intent

without claiming:

legal compliance

regulatory clearance

clinical certification

deployer conformance

In a regulated domain, traceability is useful only when it does not pretend to be permission.

A concrete example: why Article 12 is traceability, not compliance

The best example here is EU AI Act Article 12 style traceability.

The regulatory mapping layer treats signals like:

changelogs

checksum manifests

versioned config surfaces

audit-log schema fragments

decision-event or override-event schema tokens

as evidence that a repository may have traceability scaffolding.

That is useful.

It is also bounded.

The mapping document is explicit that changelog presence is not the same thing as deploy-time event logging, and that current scope does not establish runtime log completeness.

So the output can legitimately say:

there is structural evidence relevant to traceability review

while refusing to say:

this system satisfies traceability obligations

That is exactly the kind of distinction I wanted this lane to enforce.

What this buys in practice is not a compliance shortcut, but a faster review question. If a repository exposes none of the scaffolding signals in this lane — no change history, no artifact hashes, no versioned manifests, no event-schema surfaces — then there is very little reason to treat it as traceability-ready for deeper institutional review. If those signals do exist, the next step is still expert inspection, but the scanner has at least opened the right folder and pointed at the right files.

Why regulatory mapping stayed subordinate to evidence

This was non-negotiable.

Regulatory relevance had to remain downstream from evidence, not a score multiplier pretending to be law.

That is why the output shape separates things like:

regulatory_basis

stage_traceability

regulatory_traceability

from the actual score computation.

And it is not just decorative structure.

The regulatory basis object is registry-driven. It can mark review_required when the basis registry is stale or required source families are missing. That is a traceability control on the mapping layer itself, not an input into the scoring formula.

This is also why the regulatory note belongs in a muted traceability panel, not next to the main score.

If a repo has traceability-relevant scaffolding, that is useful.

If a repo has traceability-relevant scaffolding, that is still not compliance.

The distinction has to remain visible in both the code and the artifacts.

4. Optional AI advisory

The fourth lane is the advisory layer.

This one exists for bounded model-assisted review, but it does not get to rewrite the official outcome.

That means workflows like:

can exist without creating ambiguity about who owns the formal result.

The advisory layer can:

export a provider-neutral packet

validate downstream response structure

enforce finding-ID citation rules

reject prohibited claims

surface runtime and secret boundaries

What it cannot do is silently override:

score.final_score

score.formal_tier

How that rule is actually enforced

This is not just policy language in the README.

The advisory validator explicitly checks for score-override attempts. If a response includes fields like:

final_score

formal_tier

replication_score

replication_tier

or sets final_score_override, the response is marked invalid with final_score_override_requested.

The packet contract also exports the rule in plain language:

Do not modify or override final_score, formal_tier, replication_score, or replication_tier.

And provider responses must cite exact values from allowed_finding_ids; citation strings are not repaired or loosely matched later.

So the advisory lane is bounded in two ways:

it has no authority to change the deterministic result

it cannot cite evidence outside the bounded packet

That is the kind of mechanism I mean when I say “better boundaries.” If the rule cannot be checked, it is not really part of the architecture yet.

What operational use looks like now

Once these lanes were separated, the CLI became much easier to reason about.

Local engineering review:

CI/CD gate:

Offline advisory packet generation:

Downstream provider response validation:

The important point is not just that these commands exist.

It is that each one represents a distinct trust boundary.

That made the project feel more like engineering infrastructure and less like a scoring demo.

A real v1.6.2 packet

To make that less abstract, I re-ran STEM BIO-AI v1.6.2 against a local clone of ClawBio, which describes itself as a local-first, privacy-focused, reproducible bioinformatics-native AI skill library.

The command was:

On my machine, that run took about 9.4 seconds and emitted the usual CLI output set: a machine-readable JSON result, a Markdown report, a 5-page PDF packet, and a line-level explain trace.

Before the numbers, the important context is that STEM BIO-AI uses a published triage scale:

T0 = 0-39

T1 = 40-54

T2 = 55-69

T3 = 70-84

T4 = 85-100

Stage 4 replication is reported separately as its own lane, where R2 means some reproducibility scaffolding is present, but not yet enough to call the repository replication-strong.

Governance note: This is not a “bad repository” scoreboard, a clinical safety verdict, or a moral ranking. It is a deterministic evidence-surface pre-screen intended to support review, not replace it.

With that in mind, the result was:

67 / 100

T2 Caution

Replication lane: 55 / 100 (R2)

Clinical adjacency: CA-DIRECT (the repository surface makes direct healthcare-facing claims, even though it also carries an explicit non-clinical boundary)

Code integrity warnings: C2 dependency pinning, C4 exception handling

This is exactly the workflow shift I wanted the tool to support.

The same deterministic scan is rendered into multiple operator surfaces:

JSON for automation

Markdown for review

PDF for human-facing packet inspection

-explain for file / line / snippet proof tracing

That output shape is only possible because the result object already separates:

formal score and tier

replication lane

diagnostics lane

regulatory traceability

advisory boundary state

In other words, the PDF is not a separate product. It is a view over the same bounded audit object.

Two details from this run are worth calling out.

First, the scanner did not manufacture chemistry findings just because ClawBio is bio-adjacent. The deterministic diagnostics lane reported:

SMILES Surface Integrity: not_detected

SMILES RDKit Validation: not_applicable

SMILES Parser Guard: not_detected

That is the behavior I want. If a detector has no evidence, it should stay silent instead of inflating the report with domain-flavored noise. This is what the earlier thesis looks like when it hits real output: a detector becomes more trustworthy when it is strict about what it cannot conclude.

Second, the score is strict about observable repository conventions. ClawBio uses ClawBio_README_Repo.md rather than a root README.md, so the scan records S1_missing_readme: -20. A human reviewer might decide that this is acceptable contextually. The scanner does not make that leap for them. It only records what the repository exposes through the surfaces it knows how to measure.

That distinction matters. A T2 Caution result here does not mean “ClawBio is unsafe.” It means the current repository surface still raises review-relevant signals under the published deterministic rules, including dependency-pinning warnings, exception-handling warnings in a clinical-adjacent surface, and a stricter-than-human README convention check.

And that is exactly why the next section matters: once the workflow is concrete, the remaining question is not whether the tool can produce an answer, but where its current boundaries still need to stay visible.

What still has to stay bounded

The system is better than it was, but there are still obvious next steps.

1. The public surface is broad

There is now:

scoring

diagnostics

replication

advisory packeting

regulatory traceability

JSON / Markdown / PDF / explain outputs

That is useful, but it increases onboarding cost.

The CLI is clearer now, but the broader public surface has to stay disciplined.

2. The deterministic diagnostics lane is still missing a published calibration threshold

The diagnostics lane is evidence-first by design, but one practical gap remains: the public release does not yet ship a benchmark-backed threshold document saying exactly when a detector is mature enough to graduate from evidence-only into score-bearing territory.

Right now the rule is conceptually clear:

commit-pinned fixtures

reproducible detector output

explicit false-positive review

But the public decision boundary is still partly narrative. Until that calibration surface is published in a more operational form, keeping diagnostics evidence-only is the safer choice.

3. The regulatory confidence labels are rule-authored, not empirically validated

The mapping labels like strong, moderate, and weak-moderate are currently fixed rule-level judgments in the mapping document. They are not runtime model outputs, but they are also not yet backed by inter-rater reliability studies or a published reviewer-agreement benchmark.

That means they are useful as bounded structural mapping language, but they should not be treated as empirical proof that multiple auditors would converge on exactly the same label distribution.

Earlier context

Medical AI Repositories Need More Than Benchmarks. We Built STEM-AI to Audit Trust

How Auditing 10 Bio-AI Repositories Shaped STEM-AI

How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits

Try it yourself

STEM BIO-AI is Apache 2.0 and fully open source.

If you want to know whether a bio/medical AI repository is actually exposing reviewable evidence, or whether your own repository is weaker than you think, run it yourself.

That is the real test.

GitHub: https://github.com/flamehaven01/STEM-BIO-AI

License: Apache 2.0

Final thought

The earlier STEM-AI posts were about why repository trust deserves its own audit layer.

This phase was about something more practical:

what does that audit layer have to look like if an engineer is actually going to run it, inspect it, and put it in a pipeline?

For me, the answer was simple:

Separate the workflows. Separate the lanes. Keep diagnostics evidence-first. Keep regulatory mapping subordinate to evidence. Keep advisory AI bounded.

Optimize for inspectability, not just score production.

That is what changed the project.

Not bigger claims.

Better boundaries.