
Bio-AI Repository Audit 2026: A Technical Report on 10 Open-Source Systems
We audited 10 prominent open-source Bio-AI repositories using code inspection and STEM-AI trust scoring. 8 of 10 scored T0: trust not established. Here is what the code actually shows.

- Report ID: FLAMEHAVEN-AUDIT-BIO-2026-001-v3.1
- Date: March 24, 2026
- Audit Snapshot Date: March 20, 2026
- Report Validity: 180 days from the audit snapshot date
- Audit Framework: Technical Code Audit (Protocol Re-Genesis) + STEM-AI Trust Scoring v1.0.4
- Execution Mode: Manual repository inspection using repository structure analysis, live README verification, and selective file-level source review
- Scope Note: This is a time-bounded audit snapshot. Later remediation experiments, post-audit repository changes, and later STEM-AI rubric revisions are outside the scope of this document unless explicitly stated.
- Disclaimer: All STEM-AI scores in this report are LLM-generated rubric evaluations against publicly visible repository content at the audit snapshot date. This is not a regulatory determination, clinical certification, or legal assessment. No hospital, clinic, procurement body, or regulatory authority should treat this report as a substitute for independent expert review, formal validation, or official clearance.
Â
Abstract
The rapid proliferation of LLM-driven biological AI tools has introduced a critical vulnerability into computational biology pipelines: the appearance of competence without mechanisms to verify it.
This report presents a two-layer audit of ten visible open-source Bio-AI repositories and adjacent scientific automation systems: a workflow-oriented technical code inspection reconstructing repository execution paths and identifying file-level vulnerabilities, followed by structured trust scoring under the STEM-AI v1.0.4 framework.
The central finding is stark: 8 of 10 repositories score T0 (trust not established), 1 scores T1, and only 1 achieves T2. Zero repositories reach the T3 threshold required for supervised pilot consideration.
Four recurring patterns account for most failures: clinical-adjacent scope without corresponding accountability, CI pipelines that validate form rather than scientific validity, mock implementations presented as functional outputs, and secure architectures undermined by insecure defaults.
The report concludes that the core bottleneck in open Bio-AI is not model capability alone, but the absence of governance mechanisms that make claims reproducible, bounded, and institutionally reviewable before deployment.
1. Introduction

Â
Drug development takes roughly ten years and more than a billion dollars per approved medicine. That timeline is not a sign of laziness or lack of imagination. It reflects the cost of verifying biological claims against physical reality.
The promise of AI in drug discovery, functional annotation, and computational genomics is real. These systems may accelerate candidate generation, compress literature review, reduce friction in bioinformatics workflows, and surface patterns that human researchers would otherwise miss.
But the central question is not whether the code runs.
It is whether the system can establish what its outputs actually mean.
This report asks a narrow, technical version of that broader question:
When these repositories run, do they provide enough structural honesty, reproducibility, and accountability to justify trust before downstream use?
Across ten highly visible repositories and adjacent systems, the answer is overwhelmingly no.
This is not primarily a statement about raw scientific accuracy. It is a statement about verifiability. A system whose outputs cannot be independently traced, whose failures cannot be attributed to specific components, and whose claims cannot be matched against a deterministic review surface is not a trustworthy system in any clinical, translational, or pharmaceutical setting, regardless of how sophisticated its architecture appears.
2. Methodology

2.1 Repository Selection
The ten repositories were selected by visibility as of March 2026: GitHub stars, technical discussion frequency, and evidence of practical ecosystem exposure. This is not a random sample of the Bio-AI ecosystem. It is a visibility-weighted sample of repositories and adjacent systems likely to shape first impressions, technical adoption, and procurement attention.
One repository, AI-Scientist, is included as a high-visibility adjacent control case in autonomous scientific automation, even though it is not itself a biological pipeline repository. Its inclusion helps distinguish failures specific to Bio-AI from failures common to autonomous scientific software more broadly.
Operationally, the selection rule was:
- identify repositories and adjacent systems that were highly visible in Bio-AI or scientific-agent discussion at the audit snapshot date
- prefer repositories with practical adoption surfaces rather than purely speculative concept repos
- include at least one adjacent control case to test whether the observed failure patterns are specific to Bio-AI or common across autonomous scientific automation
- stop at ten repositories for a bounded, publication-scale audit set rather than an ecosystem census
This selection rule is directional, not exhaustive. The aim is to evaluate the visible face of the ecosystem, not to claim statistical representativeness over all Bio-AI repositories.
2.2 Two-Layer Audit Design
The audit was conducted in two independent layers.
Layer 1: Technical Code Audit
Each repository was analyzed through repository structure analysis that maps directory layout, module roles, critical function signatures, and dependency relationships. Execution paths were reconstructed from entry point to claimed output using structure mapping, README verification, and selective file-level review. Findings were checked against the active README surface. Where necessary, direct source inspection was performed.
Layer 2: STEM-AI Trust Scoring v1.0.4
STEM-AI (Sovereign Trust Evaluator for Medical AI) is an internal rubric covering documentation integrity, security posture, scientific claim calibration, licensing surface, and governance architecture. Score formula:
Final = (S1 × 0.50 + S3 × 0.50) - Risk PenaltyS1evaluates README and documentation integrity
S3evaluates code substance, testing, change discipline, and biological integrity provisions
The two layers answer different questions:
- Technical audit asks: what does the repository actually do?
- STEM-AI asks: does the repository expose enough structural honesty and accountability that an institution could even begin to establish trust?
2.3 Review Depth And Evidence Rules
The audit did not attempt full dynamic recreation of all ten repositories across their original environments. Instead, it used a bounded but repeatable review depth:
- map the repository structure and likely execution surfaces
- identify entry points, orchestration layers, and output paths
- compare the active README surface against the executable surface
- inspect the files most directly responsible for the claimed workflow
- escalate to deeper file-level review where claims, risks, or contradictions were material
Selective file-level review was therefore risk-oriented, not random. Priority was given to:
- entry points
- orchestrators
- runtime wrappers
- validation logic
- output-generation surfaces
- files implicated by user-facing claims
When README claims and code reality diverged, the repository was judged against the executable surface rather than the prose surface. README claims were still scored under
S1, but they were not treated as proof of runtime truth.2.4 Important Limitation
STEM-AI v1.0.4 is an internally designed rubric, not an externally peer-reviewed standard. Every score in this report was generated by an LLM applying that rubric. This limitation is material and disclosed explicitly. The same structural issue this report identifies in Bio-AI systems, probabilistic systems judging outputs without deterministic external oracles, also applies here. Readers should therefore treat the scores as structured audit outputs, not as final truth.
2.5 Why Full Dynamic Recreation Was Not Performed
Full dynamic recreation of all ten repositories was not attempted for three reasons:
- many repositories depended on heavyweight external services, opaque assets, remote checkpoints, or local environment assumptions that were themselves part of the trust question
- the goal of this report was ecosystem-level structural diagnosis, not benchmark reproduction of each project
- reproducibility claims made by a repository were treated as objects of audit rather than assumed capabilities of the audit environment
This creates a known tradeoff. The report is stronger on structural diagnosis than on live replication. That limitation is intentional and should be read as part of the scope of the document.
2.6 Licensing Assessment Method
Licensing posture was described using observable repository surfaces, not inferred intent.
The audit distinguishes among:
- presence or absence of a root
LICENSEfile
- README-level or custom-README-level license declarations
- contradictory file-level headers inside executable code
This report does not infer author intent. It records only the observable mismatch, if any, between user-facing license claims and repository-root licensing artifacts.
2.7 Risk Penalty And Ambiguity Handling
Risk penalties were applied conservatively. In ambiguous cases, the audit preferred under-claiming rather than over-claiming.
Working rule:
- strong deductions required observable evidence
- absent evidence was not treated as hidden evidence
- ambiguous or mixed signals were recorded as surface inconsistency, not as proof of malicious intent
False positives and false negatives remain possible. This report should therefore be treated as a structured starting point for serious review, not as a substitute for repository-specific validation.
3. Summary of Findings

3.1 Integrated Results Table
# | Repository | Affiliation | License Surface at Audit Time | S1 | S3 | Final | Tier | Clinical Adjacent |
1 | Biomni | Stanford SNAP | MIT (root LICENSE present) | 24 | 10 | 17 | T0 | Yes |
2 | AI-Scientist | SakanaAI | Apache 2.0 (root LICENSE present) | 70 | 26 | 48 | T1 | No |
3 | CellAgent | Academic (CN) | No root LICENSE file observed | 30 | 0 | 15 | T0 | Yes |
4 | ClawBio | Independent | MIT (root LICENSE present) | 42 | 83 | 63 | T2 | Yes |
5 | LabClaw | Stanford/Princeton | Custom repository README declares MIT; no root LICENSE file observed | 40 | 0 | 20 | T0 | Yes |
6 | claude-scientific-skills | K-Dense Inc. | MIT (root LICENSE present) | 30 | 18 | 24 | T0 | Yes |
7 | SciAgent-Skills | HITS Heidelberg | CC BY 4.0 (root LICENSE present) | 40 | 23 | 32 | T0 | Yes |
8 | BioAgents | bio-xyz | No root LICENSE file observed | 52 | 8 | 30 | T0 | No |
9 | BioClaw | Independent | MIT (root LICENSE present) | 40 | 18 | 29 | T0 | Yes |
10 | OpenClaw-Medical-Skills | FreedomIntelligence | Custom repository README displays MIT badge; no root LICENSE file observed; conflicting file-level header present | 27 | 16 | 22 | T0 | Yes |
Ecosystem result: 8 of 10 T0. 1 of 10 T1. 1 of 10 T2. Zero T3 or T4.
Tier scale:
T0 (0–39) / T1 (40–54) / T2 (55–69) / T3 (70–84) / T4 (85–100)3.2 If You Remember One Thing
Almost every repository in this audit can produce outputs. Very few can prove what those outputs mean, under what conditions they are valid, or when they should halt instead of proceeding.
3.3 Repeating Failure Patterns

Across the sample, four patterns dominate:
- Clinical-adjacent scope without clinical-adjacent accountability
- CI pipelines that validate syntax, formatting, or structure rather than scientific validity
- Mock or placeholder logic presented through user-facing surfaces as if functional
- Architectures with strong local design undermined by unsafe defaults at deployment time
These are not cosmetic issues. They are structural trust failures.
3.4 Tier Meaning In Practical Terms
The tier labels in this report are intended as practical operating categories, not prestige labels.
T0: trust not established; repository should not be used without full independent validation
T1: partial structural strengths exist, but the repository remains quarantine-grade rather than deployment-grade
T2: meaningful trust-building architecture exists, but the repository still falls short of supervised pilot readiness
T3: minimum threshold for supervised pilot consideration under explicit oversight
T4: strong enough trust posture for higher-consequence operational integration, subject to domain-specific legal and regulatory constraints
The difference between
17 and 32 matters less than the fact that both remain in T0. The tier is therefore the primary interpretive surface, while the exact score is best read as a structured explanation of why a repository sits where it does.4. Repository-by-Repository Findings

Judgments in this section are made under the criteria stated in Sections 2 and 3. In particular, strong language refers to the repository's trust posture under this audit's executable-surface, governance, and accountability criteria, not to the total scientific value of the underlying research area.
4.1 Biomni
Type: Hybrid agent + tool registry
License Surface: MIT (root LICENSE present)
Biomni is architecturally ambitious: a LangGraph-based research agent with broad tool coverage across biology subdomains. It also exhibits some of the most serious runtime risks in the audit. LLM-generated shell and R code are executed as subprocesses without sandboxing or path restrictions. Data lake assets are downloaded at startup without checksum validation. A dual schema layer (
tool/ vs tool_description/) creates a silent drift risk between descriptive and executable surfaces.Bottom line: impressive scope, weak governance at exactly the surfaces that matter most.
STEM-AI:
17 / 100 (T0)4.2 AI-Scientist
Type: Autonomous research automation pipeline
License Surface: Apache 2.0 (root LICENSE present)
AI-Scientist is not a biological pipeline repository in the narrow sense, but it is relevant as an adjacent control case. It demonstrates the same structural problem in a different domain: unconstrained code generation and execution presented as autonomous research progress. Aider is allowed to modify
experiment.py freely, and the result is executed without sandboxing. The pipeline is conceptually novel and cleanly staged, but its trust posture remains weak.Bottom line: a useful adjacent control case showing that autonomy without deterministic governance remains dangerous even outside Bio-AI proper.
STEM-AI:
48 / 100 (T1)4.3 CellAgent
Type: Specialized multi-agent scRNA-seq workflow
License Surface: No root LICENSE file observed
CellAgent is a compact and conceptually coherent attempt to automate single-cell analysis through a Planner → Executor → Evaluator loop. But the implementation is fragile. It contains a hardcoded developer notebook path, missing exception handling, language-coupled control logic, and a missing final result method that breaks the pipeline at completion.
Bottom line: a research prototype with a sound high-level idea but nowhere near trustworthy execution maturity.
STEM-AI:
15 / 100 (T0)4.4 ClawBio
Type: Skill framework + execution platform
License Surface: MIT (root LICENSE present)
ClawBio was the clear outlier in this audit. It is not flawless, but it is the only repository that showed substantive evidence of reproducibility discipline and domain-aware validation. In particular,
detect_processed_input_reason() in its scRNA input layer and its input checksum strategy stand out as materially stronger than anything else in the sample. It also runs actual domain tests in CI.Its weaknesses are real: several routed skills have only
SKILL.md files without implementations, and extension-only routing remains fragile. But unlike the others, ClawBio demonstrates what it looks like when a repository begins to treat trust as an architectural problem rather than a prose problem.Bottom line: the only repository in the sample that begins to approach trust as a runtime property rather than a marketing surface.
STEM-AI:
63 / 100 (T2)4.5 LabClaw
Type: Skill catalog / knowledge layer
License Surface: Custom repository README declares MIT; no root LICENSE file observed
LabClaw is strategically interesting and operationally weak. It is essentially a massive SKILL.md catalog with no executable runtime, no tests, and no verification layer. It may be useful as a methodology corpus or reasoning support layer, but by itself it cannot validate scientific correctness. It also contains numerous clinical and pharma-facing skills without corresponding clinical accountability language.
Bottom line: broad coverage, almost no executable trust surface.
STEM-AI:
20 / 100 (T0)4.6 claude-scientific-skills
Type: Hybrid skill library
License Surface: MIT (root LICENSE present)
This repository is one of the most mature skill-distribution systems in the sample, but its CI validates documentation structure more than scientific execution. Bundled scripts for clinically adjacent tasks exist, yet they are not meaningfully exercised through the pipeline. License metadata inside individual skills is also inconsistent in places.
Bottom line: strong packaging discipline, weak scientific verification discipline.
STEM-AI:
24 / 100 (T0)4.7 SciAgent-Skills
Type: Structured skill library
License Surface: CC BY 4.0 (root LICENSE present)
SciAgent-Skills has the strongest authoring governance of the skill-library group. Its template discipline, registry validation, and
pixi-based workflow are materially better than most comparable repositories. Yet it remains a skill system, not a governed execution system. Clinical and bioinformatics claims are not backed by domain regression tests, and key parsing logic is brittle.Bottom line: structurally disciplined authoring, but not a trustworthy scientific runtime.
STEM-AI:
32 / 100 (T0)4.8 BioAgents
Type: Production research platform
License Surface: No root LICENSE file observed
BioAgents is one of the most architecturally sophisticated repositories in the sample. It has a real multi-agent state model, a persistent world state, a frontend, an API surface, and a production-style service architecture. But its default posture undercuts that sophistication. Rate limiting disappears when queue mode is disabled. CORS behavior is permissive in no-origin cases. The admin dashboard can be exposed without authentication. External service failure paths are insufficiently surfaced.
Bottom line: a strong platform design weakened by unsafe defaults and a lack of domain-accountability artifacts.
STEM-AI:
30 / 100 (T0)4.9 BioClaw
Type: Messaging-integrated agent runtime
License Surface: MIT (root LICENSE present)
BioClaw is novel in a way most other repositories are not: it attempts to make a multi-channel biological assistant usable through messaging systems while preserving per-group isolation through Docker containers. That is real architectural work. But writable project-root mounts for the main group, mutable module-level globals, and silent truncation of container output create serious security and reliability concerns.
Bottom line: inventive runtime design, insufficient containment discipline.
STEM-AI:
29 / 100 (T0)4.10 OpenClaw-Medical-Skills
Type: Large medical and bioinformatics skill library
License Surface: Custom repository README displays an MIT badge; no root LICENSE file observed; at least one executable file carries an
All Rights Reserved headerThis repository contains some of the broadest biomedical skill coverage in the audit. It also contains the clearest integrity violation in the sample. In
agentd-drug-discovery, a mock analogue-generation function operates through simple string manipulation of SMILES values while the user-facing skill surface presents the pipeline as if it can generate candidate molecules. This is not just a bug. It is a trust-surface failure.The repository also exhibits licensing inconsistency across visible surfaces and executable files.
Bottom line: exceptional breadth, undermined by one of the clearest cases of mock-as-functional misrepresentation in the audit.
STEM-AI:
22 / 100 (T0)5. What These Results Mean

The most important result in this audit is not that most repositories scored poorly.
It is that they failed in similar ways.
Across very different architectures, organizations, and codebases, the same structural weakness appears: the field is much better at producing the appearance of biological competence than at building systems that can bound, trace, and justify that competence before downstream use.
This is why the central bottleneck in Bio-AI is not model capability alone.
It is governance.
Specifically:
- truth-surface separation
- deterministic review surfaces
- fail-closed execution boundaries
- domain regression tests
- provenance discipline
- explicit scope limits
- human-in-command approval structures
Without these, even technically impressive repositories remain unfit for supervised pilot consideration.
6. Procurement and Remediation Priorities

If an institution, lab, or pharma group evaluates Bio-AI repositories for adoption, the baseline questions should not begin with benchmark claims. They should begin with the following:
- What can the local repository actually do without hidden assets?
- What claims are made in README, and are they matched by executable reality?
- Does the system fail closed when required inputs, models, or evidence are missing?
- Are there domain-specific regression tests, not just syntax or structure checks?
- Is the licensing surface coherent?
- Can outputs be traced to specific inputs, code paths, and execution states?
The remediation order suggested by this audit is:
- establish truthful README and scope boundaries
- add fail-closed runtime contracts
- add domain-specific regression tests
- fix unsafe defaults in production-style systems
- unify licensing and executable trust surfaces
6.1 Why T3 Is the Minimum For Supervised Pilot
This report treats
T3 as the minimum threshold for supervised pilot consideration because below that level a repository still lacks too many of the properties required for bounded real-world review:- truthful scope boundaries
- executable trust surfaces
- meaningful regression checks
- tractable failure interpretation
- enough accountability structure to support human oversight
T2 can indicate serious progress. It does not yet indicate that the repository can safely enter a supervised pilot loop in a biology- or patient-adjacent setting.6.2 Why T4 Is the Minimum For Higher-Consequence Integration
This report treats
T4 as the minimum threshold for connection to higher-consequence workflows such as automated synthesis or clinical-adjacent operational integration because those environments magnify the cost of silent error.At that point, a repository should not merely be promising. It should demonstrate:
- strong governance architecture
- clear provenance and traceability
- stable domain validation pathways
- bounded failure behavior
- mature documentation of scope and risk
This threshold is a normative governance judgment, not a regulatory standard. It is stated explicitly so that institutions can agree or disagree with it on visible terms.
7. Evidence Navigation And Reproducibility Notes
This report is designed to function as a canonical reference rather than a one-off essay. The table below is therefore included as an appendix-grade navigation layer: it maps each repository's primary report claim to representative evidence surfaces that a reviewer can inspect directly.
This table is representative rather than exhaustive. It does not list every reviewed file. It anchors each repository verdict to the most consequential visible artifacts under this audit's criteria.
7.1 Appendix-Grade Evidence Table
Repository | Primary Report Claim | Representative Evidence Surface | Evidence Type | Why It Matters Under This Audit |
Biomni | Architecturally ambitious, but high-risk runtime surfaces remain under-governed | Biomini_README_Repo.md; biomni/agent/a1.py; biomni/utils.py; biomni/tool/lab_automation.py | README, orchestration code, execution utility, tool runtime | The README presents broad biomedical execution scope; the agent layer downloads missing assets; shared utilities use shell/subprocess execution; lab automation contains direct script execution. Together these support the governance-risk finding. |
AI-Scientist | Useful adjacent control case showing autonomy without strong deterministic governance | AI_Scientist_README_Repo.md; launch_scientist.py; ai_scientist/perform_experiments.py; ai_scientist/perform_writeup.py | README, orchestration entry point, experiment runner, writeup pipeline | The repository explicitly routes automated scientific work through editable experiment.py flows and subprocess-driven execution. This supports the T1-but-not-trustworthy judgment. |
CellAgent | Compact research prototype with coherent idea but fragile execution maturity | main.py; src/planner.py; src/code_sandbox.py | Entrypoint, planning layer, notebook execution layer | The hardcoded notebook path, completion path, and notebook-execution wiring are concentrated in these files. They anchor the claim that the pipeline idea is stronger than the runtime discipline. |
ClawBio | Clear outlier with real reproducibility and domain-aware validation discipline | AGENTS.md; .github/workflows/ci.yml; pytest.ini; clawbio/common/scrna_io.py; skills/clinpgx/tests/test_clinpgx.py | Process guide, CI, test registration, reusable validation function, domain test | These files show actual CI activity, registered tests, and a concrete input-validation surface ( detect_processed_input_reason) rather than README-only maturity signals. |
LabClaw | Broad coverage, almost no executable trust surface | LabClaw_README_Repo.md; skills/med/clinical/SKILL.md; skills/pharma/rdkit/SKILL.md; skills/bio/scanpy/SKILL.md | README, representative skill surfaces | The repository strongly presents production-ready biomedical workflows through SKILL.md coverage, but the audited surface is overwhelmingly instructional rather than executable or test-backed. |
claude-scientific-skills | Strong packaging and distribution discipline, weak scientific execution verification | claude-scientific-skills_README_Repo.md; LICENSE; docs/examples.md; representative SKILL.md entries | README, root license, examples, skill surfaces | The README promises tested workflows and broad scientific capability, but the visible trust surface is mostly documentation/package structure rather than repository-level domain validation. |
SciAgent-Skills | Best authoring governance among skill libraries, but still not a governed scientific runtime | CLAUDE.md; CONTRIBUTING.md; registry.yaml; pixi.toml; scripts/validate_registry.py | Authoring guide, contribution workflow, registry, test config, validator | These artifacts support the positive finding on authoring and registry discipline, while also showing that the test surface is mainly structural consistency rather than biological regression. |
BioAgents | Sophisticated platform design weakened by unsafe deployment defaults and weak trust artifacts | src/index.ts; src/middleware/rateLimiter.ts; src/routes/admin/queue-dashboard.ts; docker-compose.yml; docker-compose.worker.yml | Runtime entry point, middleware, admin route, deployment config | These surfaces directly govern CORS behavior, queue/rate-limit behavior, admin exposure, and deployment posture. They support the claim that architecture quality is undermined by unsafe defaults. |
BioClaw | Novel messaging-integrated runtime with inventive design but insufficient containment discipline | README.md; src/container-runner.ts; src/mount-security.ts; src/cli.ts | README, container runtime, mount policy, CLI runtime | The core trust question in BioClaw lives in container mounts, mount allowlisting, and output handling. These files anchor the containment and truncation concerns described in Section 4.9. |
OpenClaw-Medical-Skills | Exceptional breadth, but mock-as-functional misrepresentation and license-surface contradiction are material | OpenClaw-Medical-Skills_README_Repo.md; skills/agentd-drug-discovery/agent_d.py; scripts/validate_skill.py | Custom repository README, executable skill code, validator script | The custom README presents broad biomedical and clinical capability; the agentd-drug-discovery runtime contains the clearest mock-as-functional issue in the audit; the validator surface is relevant because it did not adequately prevent that trust-surface gap at audit time. |
7.2 Reproducibility Notes
Readers attempting to reproduce this audit should treat the workflow as:
- inspect the active README or custom repository README surface
- identify the likely entry point or execution orchestrator
- inspect the files most directly responsible for the claimed workflow
- compare the claimed capability surface against the executable surface
- assign STEM-AI scores only after the technical surface is understood
Future companion materials may still expand this section with:
- a rubric summary sheet for STEM-AI scoring interpretation
- a repository-by-repository claim matrix with line-level anchors
- an errata log for post-publication corrections or repository changes
8. Conclusion

The open Bio-AI ecosystem is energetic, creative, and technically ambitious.
It is also structurally under-governed.
This report does not argue that these repositories are worthless. It argues something more specific and more urgent: most of them are not yet trustworthy systems. They are research artifacts, productivity accelerants, or capability demonstrations. Those categories can be valuable. But they are not the same thing as deployable, reviewable, pilot-ready infrastructure.
The field's problem is not a lack of intelligence.
It is a lack of mechanisms that distinguish real capability from implied capability before that distinction becomes expensive, or dangerous, to discover.
That is the threshold Bio-AI must cross next.
9. Versioning And Errata Policy
This document should be read as a versioned audit artifact.
- audit snapshot date: March 20, 2026
- report publication version: v3.1
- later repository changes do not retroactively change this snapshot
- substantive post-publication corrections should be recorded through an explicit errata update rather than silent replacement
For an accessible overview of the STEM-AI framework and its design rationale, see the companion explainer on DEV:
End of document.
Â
Share
Related Reading
Scientific & BioAI Infrastructure
Chaos Engineering for AI: Validating a Fail-Closed Pipeline with Fake Data and Math
Scientific & BioAI Infrastructure
Medical AI Repositories Need More Than Benchmarks. We Built STEM-AI to Audit Trust
Scientific & BioAI Infrastructure