
From Score to Workflow: Turning STEM BIO-AI Into a Local Audit System
Bio/medical AI trust should not collapse into one score. STEM BIO-AI v1.6.2 shows how deterministic auditing, evidence-led diagnostics, regulatory traceability, and bounded AI advisory can become an inspectable local workflow.
Series
STEM-AI:Soverign Trust Evaluator for Medical AI ArtifactsPart 5 of 5

Earlier in this series, I wrote about why bio/medical AI repositories need more than benchmarks, what I learned after auditing 10 public repositories, and why an AI auditor itself needs a memory contract.
That work led to STEM-AI v1.1.2 and the MICA layer: a memory-contracted initialization step that forces the auditor to load bounded rules before scoring begins. If you have not read that part, the relevant post is here:
For the broader arc, the full series is here:
But after that, a different engineering problem took over.
The audit logic was stricter.
The reports were richer.
The reasoning was more bounded.
But the developer workflow still felt too loose.
So the next question was no longer:
How do I score trust?
It became:
How does a bio-AI audit tool become something an engineer can actually run, gate, inspect, and integrate?
The answer turned out to be less about seeing more signals and more about refusing to confuse them.
That is the core argument of this post:
A detector becomes more trustworthy when it is strict about what it cannot conclude.
Once I took that seriously, STEM BIO-AI stopped looking like “one score plus some extra metadata” and started looking like a system with distinct lanes, distinct boundaries, and distinct operator workflows.
The problem was no longer scoring

By the time I reached the 1.6.x line, the rubric was no longer the main bottleneck.
The bottleneck was operational clarity.
A trust audit tool is not very useful if:
- the normal path is one long command with too many flags
- CI has to reverse-engineer the result from human-readable stdout
- bio-specific diagnostics are mixed directly into the same surface as formal scoring
- regulatory relevance shows up as vague implication instead of explicit traceability
- advisory AI is present, but its relationship to the official score is unclear
At that point, the tool stops being hard to trust for conceptual reasons and starts being hard to trust for operational reasons.
That is a different class of problem.
The CLI had to reflect operator intent
The earlier CLI was functional, but too flat.
You could do things like this:
All of that worked.
The issue was that it treated very different operator intents as one long option surface.
In practice, these are separate workflows:
- scan a repository and generate artifacts
- enforce a gate in CI/CD
- export a bounded advisory packet
- validate a downstream provider response
- cross an explicit provider-call boundary
So I refactored the CLI around workflows instead of flag accumulation:
The older paths still exist for compatibility:
But they are no longer the conceptual center.
That matters more than it sounds.
Once the command names match the operator’s intent, the system becomes easier to teach, easier to remember, and easier to wire into pipelines.
This is not just a DX cleanup. In a medical or bio-adjacent audit context, command ambiguity is part of the trust problem.
Repository trust needed four separate lanes

This was the biggest architectural shift.
I stopped treating repository trust as one object.
In practice, it needed four separate lanes:
- deterministic structural scoring
- deterministic diagnostics
- regulatory traceability
- optional AI advisory
If all of those collapse into one final confidence score, the tool becomes harder to reason about.
The more regulated the domain, the more dangerous it becomes to collapse every useful signal into one score.
Some evidence should change the score.
Some evidence should only raise review priority.
Some evidence should support traceability.
Some evidence should be handed to a human or advisory system.
The maturity of the tool is not that it sees all of them.
The maturity is that it does not confuse them.
This separation is not just conceptual. It exists in the code path.
One reasonable objection to any architecture write-up is: are these really separate lanes, or are they just different labels on the same output object?
In STEM BIO-AI, the answer is visible in the execution order.
The scanner computes the formal score first. In the result object, that means keys like:
- Stage 1
- Stage 2R
- Stage 3
- risk penalty
- score cap
final_score
formal_tier
Only after that does it append the non-scoring layers, again as explicit result keys:
regulatory_basis
stage_traceability
regulatory_traceability
reasoning_model
- optional
ai_advisory
That ordering matters.
The score is not derived from the advisory lane.
The regulatory mapping does not mutate the formal tier.
The diagnostics lane can emit evidence without becoming a hidden score multiplier.
This is also why the JSON shape ended up more layered than earlier versions. The output had to preserve the distinction the code was already enforcing.
That execution order is the architectural reason the next four sections exist.
Once I had the lanes separated in code, each lane needed its own claim boundary, its own output semantics, and its own reason for not being collapsed into the others.
Put differently, the next four sections answer four different questions:
- what is allowed to change the formal tier
- what is useful enough to emit, but not yet mature enough to score
- what can support regulatory review without pretending to be compliance
- what can involve AI without letting AI become the scoring authority
1. Deterministic structural scoring

This remains the official score and tier.
It measures the main repository-visible signals:
- README evidence
- repo-local consistency
- code and bio responsibility
- dependency hygiene
- changelog and provenance surfaces
- code-integrity patterns
This lane is local, deterministic, and machine-checkable.
That is the part that can legitimately drive a formal triage tier.
I am not claiming this is the only possible architecture. A different system could have folded diagnostics or replication more aggressively into one unified score.
I chose not to, because the narrower score proved easier to defend. A smaller claim with cleaner boundaries was more valuable here than a broader score with ambiguous semantics.
2. Deterministic diagnostics
This is where the deterministic diagnostics spec became important.
I needed a place for findings that are real, useful, and inspectable, but should not silently perturb the main score until they are calibrated.
That is what
docs/DETERMINISTIC_DIAGNOSTICS.md defines.It separates the diagnostic problem into two lanes:
- Lane A: deterministic local diagnostics
- Lane B: optional AI-assisted semantic review
That separation is central.
The deterministic lane is authoritative for hard findings.
The AI lane is advisory only.
The local diagnostic lane currently focuses on evidence-bearing bio-specific signals such as:
- malformed or suspicious SMILES-like outputs
- missing parser guards
- silent mock or simulated-data fallbacks
- risky subprocess construction around bio tools
- traceability manifest surfaces
The point was not to create a “bio slop detector” with a catchy label.
The point was to create a local evidence lane that could say:
- here is the file
- here is the line
- here is the snippet
- here is the bounded interpretation
That is much more useful than a vague semantic warning.
Why diagnostics stayed evidence-only

This was one of the harder engineering decisions.
It would have been easy to push every new bio-specific detector directly into the final score.
I did not do that.
The deterministic diagnostics spec is explicit that many of these findings begin as evidence-only. In practice, they are emitted as line-level records in the result object's
evidence_ledger:- findings are emitted into the result object’s
evidence_ledger
- findings appear in Markdown and
-explain
- findings do not change
final_scoreorformal_tier
That is the right default.
For example, the SMILES lane can be very useful for detecting:
- malformed surface strings
- low-entropy placeholders
- repeated trivial outputs
- missing parser guards
But it does not prove:
- medicinal usefulness
- synthetic feasibility
- binding plausibility
- biological efficacy
- full chemical validity in every edge case
That boundary is important.
A detector becomes more trustworthy when it is strict about what it cannot conclude.
Just as importantly, this is not meant to be a permanent holding area for every detector. The diagnostics spec is explicit that score impact should only happen after commit-pinned benchmark evidence, explicit false-positive review, and reproducible calibration. In other words, evidence-only is the temporary safe default until a detector has earned score authority.
3. Regulatory traceability

The second document that became central was
docs/REGULATORY_MAPPING.md.This solved a different problem.
Once you audit clinical-adjacent repositories, people naturally ask:
- does this align with EU AI Act themes?
- does this help with FDA-oriented review?
- is there anything relevant to IMDRF or SaMD evidence families?
The wrong answer would be to turn those questions into a fake compliance score.
So I did the opposite.
The regulatory layer is explicitly framed as:
a traceability aid, not a compliance verdict
That document maps observed evidence classes to requirement families with bounded confidence labels like:
- strong
- moderate
- weak-moderate
- weak
- not assessed
And it makes an important distinction:
the confidence applies to the mapping relationship, not to legal acceptability.
Those confidence labels are not model outputs and they are not inferred at runtime. They are fixed, rule-level mapping judgments attached to evidence classes in the mapping document itself. For example, changelog / checksum / config-manifest style evidence is treated as a moderate traceability signal for Article 12-style review, while human-oversight interface signals stay weak because interface presence is not the same thing as oversight procedure.
That means the tool can say things like:
- versioned manifests and changelogs may support record-keeping / traceability review
- intended-use and disclaimer sections may support transparency scaffolding review
- override interfaces may support human-oversight interface review
- subgroup measurement language may support weak evidence of data-governance intent
without claiming:
- legal compliance
- regulatory clearance
- clinical certification
- deployer conformance
In a regulated domain, traceability is useful only when it does not pretend to be permission.
A concrete example: why Article 12 is traceability, not compliance
The best example here is EU AI Act Article 12 style traceability.
The regulatory mapping layer treats signals like:
- changelogs
- checksum manifests
- versioned config surfaces
- audit-log schema fragments
- decision-event or override-event schema tokens
as evidence that a repository may have traceability scaffolding.
That is useful.
It is also bounded.
The mapping document is explicit that changelog presence is not the same thing as deploy-time event logging, and that current scope does not establish runtime log completeness.
So the output can legitimately say:
- there is structural evidence relevant to traceability review
while refusing to say:
- this system satisfies traceability obligations
That is exactly the kind of distinction I wanted this lane to enforce.
What this buys in practice is not a compliance shortcut, but a faster review question. If a repository exposes none of the scaffolding signals in this lane — no change history, no artifact hashes, no versioned manifests, no event-schema surfaces — then there is very little reason to treat it as traceability-ready for deeper institutional review. If those signals do exist, the next step is still expert inspection, but the scanner has at least opened the right folder and pointed at the right files.
Why regulatory mapping stayed subordinate to evidence
This was non-negotiable.
Regulatory relevance had to remain downstream from evidence, not a score multiplier pretending to be law.
That is why the output shape separates things like:
regulatory_basis
stage_traceability
regulatory_traceability
from the actual score computation.
And it is not just decorative structure.
The regulatory basis object is registry-driven. It can mark
review_required when the basis registry is stale or required source families are missing. That is a traceability control on the mapping layer itself, not an input into the scoring formula.This is also why the regulatory note belongs in a muted traceability panel, not next to the main score.
If a repo has traceability-relevant scaffolding, that is useful.
If a repo has traceability-relevant scaffolding, that is still not compliance.
The distinction has to remain visible in both the code and the artifacts.
4. Optional AI advisory

The fourth lane is the advisory layer.
This one exists for bounded model-assisted review, but it does not get to rewrite the official outcome.
That means workflows like:
can exist without creating ambiguity about who owns the formal result.
The advisory layer can:
- export a provider-neutral packet
- validate downstream response structure
- enforce finding-ID citation rules
- reject prohibited claims
- surface runtime and secret boundaries
What it cannot do is silently override:
score.final_score
score.formal_tier
How that rule is actually enforced
This is not just policy language in the README.
The advisory validator explicitly checks for score-override attempts. If a response includes fields like:
final_score
formal_tier
replication_score
replication_tier
or sets
final_score_override, the response is marked invalid with final_score_override_requested.The packet contract also exports the rule in plain language:
Do not modify or overridefinal_score,formal_tier,replication_score, orreplication_tier.
And provider responses must cite exact values from
allowed_finding_ids; citation strings are not repaired or loosely matched later.So the advisory lane is bounded in two ways:
- it has no authority to change the deterministic result
- it cannot cite evidence outside the bounded packet
That is the kind of mechanism I mean when I say “better boundaries.” If the rule cannot be checked, it is not really part of the architecture yet.
What operational use looks like now

Once these lanes were separated, the CLI became much easier to reason about.
Local engineering review:
CI/CD gate:
Offline advisory packet generation:
Downstream provider response validation:
The important point is not just that these commands exist.
It is that each one represents a distinct trust boundary.
That made the project feel more like engineering infrastructure and less like a scoring demo.
A real v1.6.2 packet
To make that less abstract, I re-ran STEM BIO-AI v1.6.2 against a local clone of ClawBio, which describes itself as a local-first, privacy-focused, reproducible bioinformatics-native AI skill library.
The command was:


On my machine, that run took about 9.4 seconds and emitted the usual CLI output set: a machine-readable JSON result, a Markdown report, a 5-page PDF packet, and a line-level explain trace.
Before the numbers, the important context is that STEM BIO-AI uses a published triage scale:
T0= 0-39
T1= 40-54
T2= 55-69
T3= 70-84
T4= 85-100
Stage 4 replication is reported separately as its own lane, where
R2 means some reproducibility scaffolding is present, but not yet enough to call the repository replication-strong.Governance note: This is not a “bad repository” scoreboard, a clinical safety verdict, or a moral ranking. It is a deterministic evidence-surface pre-screen intended to support review, not replace it.
With that in mind, the result was:
- 67 / 100
- T2 Caution
- Replication lane: 55 / 100 (R2)
- Clinical adjacency: CA-DIRECT (the repository surface makes direct healthcare-facing claims, even though it also carries an explicit non-clinical boundary)
- Code integrity warnings: C2 dependency pinning, C4 exception handling
This is exactly the workflow shift I wanted the tool to support.
The same deterministic scan is rendered into multiple operator surfaces:
- JSON for automation
- Markdown for review
- PDF for human-facing packet inspection
-explainfor file / line / snippet proof tracing
That output shape is only possible because the result object already separates:
- formal score and tier
- replication lane
- diagnostics lane
- regulatory traceability
- advisory boundary state
In other words, the PDF is not a separate product. It is a view over the same bounded audit object.
Two details from this run are worth calling out.
First, the scanner did not manufacture chemistry findings just because ClawBio is bio-adjacent. The deterministic diagnostics lane reported:
SMILES Surface Integrity: not_detected
SMILES RDKit Validation: not_applicable
SMILES Parser Guard: not_detected
That is the behavior I want. If a detector has no evidence, it should stay silent instead of inflating the report with domain-flavored noise. This is what the earlier thesis looks like when it hits real output: a detector becomes more trustworthy when it is strict about what it cannot conclude.
Second, the score is strict about observable repository conventions. ClawBio uses
ClawBio_README_Repo.md rather than a root README.md, so the scan records S1_missing_readme: -20. A human reviewer might decide that this is acceptable contextually. The scanner does not make that leap for them. It only records what the repository exposes through the surfaces it knows how to measure.That distinction matters. A
T2 Caution result here does not mean “ClawBio is unsafe.” It means the current repository surface still raises review-relevant signals under the published deterministic rules, including dependency-pinning warnings, exception-handling warnings in a clinical-adjacent surface, and a stricter-than-human README convention check.And that is exactly why the next section matters: once the workflow is concrete, the remaining question is not whether the tool can produce an answer, but where its current boundaries still need to stay visible.
What still has to stay bounded
The system is better than it was, but there are still obvious next steps.
1. The public surface is broad
There is now:
- scoring
- diagnostics
- replication
- advisory packeting
- regulatory traceability
- JSON / Markdown / PDF / explain outputs
That is useful, but it increases onboarding cost.
The CLI is clearer now, but the broader public surface has to stay disciplined.
2. The deterministic diagnostics lane is still missing a published calibration threshold
The diagnostics lane is evidence-first by design, but one practical gap remains: the public release does not yet ship a benchmark-backed threshold document saying exactly when a detector is mature enough to graduate from evidence-only into score-bearing territory.
Right now the rule is conceptually clear:
- commit-pinned fixtures
- reproducible detector output
- explicit false-positive review
But the public decision boundary is still partly narrative. Until that calibration surface is published in a more operational form, keeping diagnostics evidence-only is the safer choice.
3. The regulatory confidence labels are rule-authored, not empirically validated
The mapping labels like
strong, moderate, and weak-moderate are currently fixed rule-level judgments in the mapping document. They are not runtime model outputs, but they are also not yet backed by inter-rater reliability studies or a published reviewer-agreement benchmark.That means they are useful as bounded structural mapping language, but they should not be treated as empirical proof that multiple auditors would converge on exactly the same label distribution.
Earlier context
Try it yourself
STEM BIO-AI is Apache 2.0 and fully open source.
If you want to know whether a bio/medical AI repository is actually exposing reviewable evidence, or whether your own repository is weaker than you think, run it yourself.
That is the real test.
- License: Apache 2.0
Final thought
The earlier STEM-AI posts were about why repository trust deserves its own audit layer.
This phase was about something more practical:
what does that audit layer have to look like if an engineer is actually going to run it, inspect it, and put it in a pipeline?
For me, the answer was simple:
Separate the workflows.
Separate the lanes.
Keep diagnostics evidence-first.
Keep regulatory mapping subordinate to evidence.
Keep advisory AI bounded.
Optimize for inspectability, not just score production.
That is what changed the project.
Not bigger claims.
Better boundaries.

Next Step
If your AI system works in demos but still feels fragile, start here.
Flamehaven reviews where AI systems overclaim, drift quietly, or remain operationally fragile under real conditions. Start with a direct technical conversation or review how the work is structured before you reach out.
Direct founder contact · Response within 1-2 business days
Share
Continue the series
View all in seriesPrevious in STEM-AI:Soverign Trust Evaluator for Medical AI Artifacts
How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits
Series continuation
This is currently the latest published entry.
Related Reading
Scientific & BioAI Infrastructure
How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits
Scientific & BioAI Infrastructure
When an AI Pipeline Passes — But One Path Still Must Be Held: EXP-034
Scientific & BioAI Infrastructure