Bio-AI Repository Audit 2026: A Technical Report on 10 Open-Source Systems

💡

Report ID: FLAMEHAVEN-AUDIT-BIO-2026-001-v3.1

Date: March 24, 2026

Audit Snapshot Date: March 20, 2026

Report Validity: 180 days from the audit snapshot date

Audit Framework: Technical Code Audit (Protocol Re-Genesis) + STEM-AI Trust Scoring v1.0.4

Execution Mode: Manual repository inspection using repository structure analysis, live README verification, and selective file-level source review

Scope Note: This is a time-bounded audit snapshot. Later remediation experiments, post-audit repository changes, and later STEM-AI rubric revisions are outside the scope of this document unless explicitly stated.

Disclaimer: All STEM-AI scores in this report are LLM-generated rubric evaluations against publicly visible repository content at the audit snapshot date. This is not a regulatory determination, clinical certification, or legal assessment. No hospital, clinic, procurement body, or regulatory authority should treat this report as a substitute for independent expert review, formal validation, or official clearance.

Abstract

The rapid proliferation of LLM-driven biological AI tools has introduced a critical vulnerability into computational biology pipelines: the appearance of competence without mechanisms to verify it.

This report presents a two-layer audit of ten visible open-source Bio-AI repositories and adjacent scientific automation systems: a workflow-oriented technical code inspection reconstructing repository execution paths and identifying file-level vulnerabilities, followed by structured trust scoring under the STEM-AI v1.0.4 framework.

The central finding is stark: 8 of 10 repositories score T0 (trust not established), 1 scores T1, and only 1 achieves T2. Zero repositories reach the T3 threshold required for supervised pilot consideration.

Four recurring patterns account for most failures: clinical-adjacent scope without corresponding accountability, CI pipelines that validate form rather than scientific validity, mock implementations presented as functional outputs, and secure architectures undermined by insecure defaults.

The report concludes that the core bottleneck in open Bio-AI is not model capability alone, but the absence of governance mechanisms that make claims reproducible, bounded, and institutionally reviewable before deployment.

1. Introduction

Drug development takes roughly ten years and more than a billion dollars per approved medicine. That timeline is not a sign of laziness or lack of imagination. It reflects the cost of verifying biological claims against physical reality.

The promise of AI in drug discovery, functional annotation, and computational genomics is real. These systems may accelerate candidate generation, compress literature review, reduce friction in bioinformatics workflows, and surface patterns that human researchers would otherwise miss.

But the central question is not whether the code runs.

It is whether the system can establish what its outputs actually mean.

This report asks a narrow, technical version of that broader question:

When these repositories run, do they provide enough structural honesty, reproducibility, and accountability to justify trust before downstream use?

Across ten highly visible repositories and adjacent systems, the answer is overwhelmingly no.

This is not primarily a statement about raw scientific accuracy. It is a statement about verifiability. A system whose outputs cannot be independently traced, whose failures cannot be attributed to specific components, and whose claims cannot be matched against a deterministic review surface is not a trustworthy system in any clinical, translational, or pharmaceutical setting, regardless of how sophisticated its architecture appears.

2. Methodology

2.1 Repository Selection

The ten repositories were selected by visibility as of March 2026: GitHub stars, technical discussion frequency, and evidence of practical ecosystem exposure. This is not a random sample of the Bio-AI ecosystem. It is a visibility-weighted sample of repositories and adjacent systems likely to shape first impressions, technical adoption, and procurement attention.

One repository, AI-Scientist, is included as a high-visibility adjacent control case in autonomous scientific automation, even though it is not itself a biological pipeline repository. Its inclusion helps distinguish failures specific to Bio-AI from failures common to autonomous scientific software more broadly.

Operationally, the selection rule was:

identify repositories and adjacent systems that were highly visible in Bio-AI or scientific-agent discussion at the audit snapshot date

prefer repositories with practical adoption surfaces rather than purely speculative concept repos

include at least one adjacent control case to test whether the observed failure patterns are specific to Bio-AI or common across autonomous scientific automation

stop at ten repositories for a bounded, publication-scale audit set rather than an ecosystem census

This selection rule is directional, not exhaustive. The aim is to evaluate the visible face of the ecosystem, not to claim statistical representativeness over all Bio-AI repositories.

2.2 Two-Layer Audit Design

The audit was conducted in two independent layers.

Layer 1: Technical Code Audit

Each repository was analyzed through repository structure analysis that maps directory layout, module roles, critical function signatures, and dependency relationships. Execution paths were reconstructed from entry point to claimed output using structure mapping, README verification, and selective file-level review. Findings were checked against the active README surface. Where necessary, direct source inspection was performed.

Layer 2: STEM-AI Trust Scoring v1.0.4

STEM-AI (Sovereign Trust Evaluator for Medical AI) is an internal rubric covering documentation integrity, security posture, scientific claim calibration, licensing surface, and governance architecture. Score formula:

Final = (S1 × 0.50 + S3 × 0.50) - Risk Penalty

S1 evaluates README and documentation integrity

S3 evaluates code substance, testing, change discipline, and biological integrity provisions

The two layers answer different questions:

Technical audit asks: what does the repository actually do?

STEM-AI asks: does the repository expose enough structural honesty and accountability that an institution could even begin to establish trust?

2.3 Review Depth And Evidence Rules

The audit did not attempt full dynamic recreation of all ten repositories across their original environments. Instead, it used a bounded but repeatable review depth:

map the repository structure and likely execution surfaces

identify entry points, orchestration layers, and output paths

compare the active README surface against the executable surface

inspect the files most directly responsible for the claimed workflow

escalate to deeper file-level review where claims, risks, or contradictions were material

Selective file-level review was therefore risk-oriented, not random. Priority was given to:

entry points

orchestrators

runtime wrappers

validation logic

output-generation surfaces

files implicated by user-facing claims

When README claims and code reality diverged, the repository was judged against the executable surface rather than the prose surface. README claims were still scored under S1, but they were not treated as proof of runtime truth.

2.4 Important Limitation

STEM-AI v1.0.4 is an internally designed rubric, not an externally peer-reviewed standard. Every score in this report was generated by an LLM applying that rubric. This limitation is material and disclosed explicitly. The same structural issue this report identifies in Bio-AI systems, probabilistic systems judging outputs without deterministic external oracles, also applies here. Readers should therefore treat the scores as structured audit outputs, not as final truth.

2.5 Why Full Dynamic Recreation Was Not Performed

Full dynamic recreation of all ten repositories was not attempted for three reasons:

many repositories depended on heavyweight external services, opaque assets, remote checkpoints, or local environment assumptions that were themselves part of the trust question

the goal of this report was ecosystem-level structural diagnosis, not benchmark reproduction of each project

reproducibility claims made by a repository were treated as objects of audit rather than assumed capabilities of the audit environment

This creates a known tradeoff. The report is stronger on structural diagnosis than on live replication. That limitation is intentional and should be read as part of the scope of the document.

2.6 Licensing Assessment Method

Licensing posture was described using observable repository surfaces, not inferred intent.

The audit distinguishes among:

presence or absence of a root LICENSE file

README-level or custom-README-level license declarations

contradictory file-level headers inside executable code

This report does not infer author intent. It records only the observable mismatch, if any, between user-facing license claims and repository-root licensing artifacts.

2.7 Risk Penalty And Ambiguity Handling

Risk penalties were applied conservatively. In ambiguous cases, the audit preferred under-claiming rather than over-claiming.

Working rule:

strong deductions required observable evidence

absent evidence was not treated as hidden evidence

ambiguous or mixed signals were recorded as surface inconsistency, not as proof of malicious intent

False positives and false negatives remain possible. This report should therefore be treated as a structured starting point for serious review, not as a substitute for repository-specific validation.

3. Summary of Findings

3.1 Integrated Results Table

#	Repository	Affiliation	License Surface at Audit Time	S1	S3	Final	Tier	Clinical Adjacent
1	Biomni	Stanford SNAP	MIT (root LICENSE present)	24	10	17	T0	Yes
2	AI-Scientist	SakanaAI	Apache 2.0 (root LICENSE present)	70	26	48	T1	No
3	CellAgent	Academic (CN)	No root LICENSE file observed	30	0	15	T0	Yes
4	ClawBio	Independent	MIT (root LICENSE present)	42	83	63	T2	Yes
5	LabClaw	Stanford/Princeton	Custom repository README declares MIT; no root LICENSE file observed	40	0	20	T0	Yes
6	claude-scientific-skills	K-Dense Inc.	MIT (root LICENSE present)	30	18	24	T0	Yes
7	SciAgent-Skills	HITS Heidelberg	CC BY 4.0 (root LICENSE present)	40	23	32	T0	Yes
8	BioAgents	bio-xyz	No root LICENSE file observed	52	8	30	T0	No
9	BioClaw	Independent	MIT (root LICENSE present)	40	18	29	T0	Yes
10	OpenClaw-Medical-Skills	FreedomIntelligence	Custom repository README displays MIT badge; no root LICENSE file observed; conflicting file-level header present	27	16	22	T0	Yes

Ecosystem result: 8 of 10 T0. 1 of 10 T1. 1 of 10 T2. Zero T3 or T4.

Tier scale: T0 (0–39) / T1 (40–54) / T2 (55–69) / T3 (70–84) / T4 (85–100)

3.2 If You Remember One Thing

Almost every repository in this audit can produce outputs. Very few can prove what those outputs mean, under what conditions they are valid, or when they should halt instead of proceeding.

3.3 Repeating Failure Patterns

Across the sample, four patterns dominate:

Clinical-adjacent scope without clinical-adjacent accountability

CI pipelines that validate syntax, formatting, or structure rather than scientific validity

Mock or placeholder logic presented through user-facing surfaces as if functional

Architectures with strong local design undermined by unsafe defaults at deployment time

These are not cosmetic issues. They are structural trust failures.

3.4 Tier Meaning In Practical Terms

The tier labels in this report are intended as practical operating categories, not prestige labels.

T0: trust not established; repository should not be used without full independent validation

T1: partial structural strengths exist, but the repository remains quarantine-grade rather than deployment-grade

T2: meaningful trust-building architecture exists, but the repository still falls short of supervised pilot readiness

T3: minimum threshold for supervised pilot consideration under explicit oversight

T4: strong enough trust posture for higher-consequence operational integration, subject to domain-specific legal and regulatory constraints

The difference between 17 and 32 matters less than the fact that both remain in T0. The tier is therefore the primary interpretive surface, while the exact score is best read as a structured explanation of why a repository sits where it does.

4. Repository-by-Repository Findings

Judgments in this section are made under the criteria stated in Sections 2 and 3. In particular, strong language refers to the repository's trust posture under this audit's executable-surface, governance, and accountability criteria, not to the total scientific value of the underlying research area.

4.1 Biomni

Type: Hybrid agent + tool registry

License Surface: MIT (root LICENSE present)

Biomni is architecturally ambitious: a LangGraph-based research agent with broad tool coverage across biology subdomains. It also exhibits some of the most serious runtime risks in the audit. LLM-generated shell and R code are executed as subprocesses without sandboxing or path restrictions. Data lake assets are downloaded at startup without checksum validation. A dual schema layer (tool/ vs tool_description/) creates a silent drift risk between descriptive and executable surfaces.

Bottom line: impressive scope, weak governance at exactly the surfaces that matter most.

STEM-AI: 17 / 100 (T0)

4.2 AI-Scientist

Type: Autonomous research automation pipeline

License Surface: Apache 2.0 (root LICENSE present)

AI-Scientist is not a biological pipeline repository in the narrow sense, but it is relevant as an adjacent control case. It demonstrates the same structural problem in a different domain: unconstrained code generation and execution presented as autonomous research progress. Aider is allowed to modify experiment.py freely, and the result is executed without sandboxing. The pipeline is conceptually novel and cleanly staged, but its trust posture remains weak.

Bottom line: a useful adjacent control case showing that autonomy without deterministic governance remains dangerous even outside Bio-AI proper.

STEM-AI: 48 / 100 (T1)

4.3 CellAgent

Type: Specialized multi-agent scRNA-seq workflow

License Surface: No root LICENSE file observed

CellAgent is a compact and conceptually coherent attempt to automate single-cell analysis through a Planner → Executor → Evaluator loop. But the implementation is fragile. It contains a hardcoded developer notebook path, missing exception handling, language-coupled control logic, and a missing final result method that breaks the pipeline at completion.

Bottom line: a research prototype with a sound high-level idea but nowhere near trustworthy execution maturity.

STEM-AI: 15 / 100 (T0)

4.4 ClawBio

Type: Skill framework + execution platform

License Surface: MIT (root LICENSE present)

ClawBio was the clear outlier in this audit. It is not flawless, but it is the only repository that showed substantive evidence of reproducibility discipline and domain-aware validation. In particular, detect_processed_input_reason() in its scRNA input layer and its input checksum strategy stand out as materially stronger than anything else in the sample. It also runs actual domain tests in CI.

Its weaknesses are real: several routed skills have only SKILL.md files without implementations, and extension-only routing remains fragile. But unlike the others, ClawBio demonstrates what it looks like when a repository begins to treat trust as an architectural problem rather than a prose problem.

Bottom line: the only repository in the sample that begins to approach trust as a runtime property rather than a marketing surface.

STEM-AI: 63 / 100 (T2)

4.5 LabClaw

Type: Skill catalog / knowledge layer

License Surface: Custom repository README declares MIT; no root LICENSE file observed

LabClaw is strategically interesting and operationally weak. It is essentially a massive SKILL.md catalog with no executable runtime, no tests, and no verification layer. It may be useful as a methodology corpus or reasoning support layer, but by itself it cannot validate scientific correctness. It also contains numerous clinical and pharma-facing skills without corresponding clinical accountability language.

Bottom line: broad coverage, almost no executable trust surface.

STEM-AI: 20 / 100 (T0)

4.6 claude-scientific-skills

Type: Hybrid skill library

License Surface: MIT (root LICENSE present)

This repository is one of the most mature skill-distribution systems in the sample, but its CI validates documentation structure more than scientific execution. Bundled scripts for clinically adjacent tasks exist, yet they are not meaningfully exercised through the pipeline. License metadata inside individual skills is also inconsistent in places.

Bottom line: strong packaging discipline, weak scientific verification discipline.

STEM-AI: 24 / 100 (T0)

4.7 SciAgent-Skills

Type: Structured skill library

License Surface: CC BY 4.0 (root LICENSE present)

SciAgent-Skills has the strongest authoring governance of the skill-library group. Its template discipline, registry validation, and pixi-based workflow are materially better than most comparable repositories. Yet it remains a skill system, not a governed execution system. Clinical and bioinformatics claims are not backed by domain regression tests, and key parsing logic is brittle.

Bottom line: structurally disciplined authoring, but not a trustworthy scientific runtime.

STEM-AI: 32 / 100 (T0)

4.8 BioAgents

Type: Production research platform

License Surface: No root LICENSE file observed

BioAgents is one of the most architecturally sophisticated repositories in the sample. It has a real multi-agent state model, a persistent world state, a frontend, an API surface, and a production-style service architecture. But its default posture undercuts that sophistication. Rate limiting disappears when queue mode is disabled. CORS behavior is permissive in no-origin cases. The admin dashboard can be exposed without authentication. External service failure paths are insufficiently surfaced.

Bottom line: a strong platform design weakened by unsafe defaults and a lack of domain-accountability artifacts.

STEM-AI: 30 / 100 (T0)

4.9 BioClaw

Type: Messaging-integrated agent runtime

License Surface: MIT (root LICENSE present)

BioClaw is novel in a way most other repositories are not: it attempts to make a multi-channel biological assistant usable through messaging systems while preserving per-group isolation through Docker containers. That is real architectural work. But writable project-root mounts for the main group, mutable module-level globals, and silent truncation of container output create serious security and reliability concerns.

Bottom line: inventive runtime design, insufficient containment discipline.

STEM-AI: 29 / 100 (T0)

4.10 OpenClaw-Medical-Skills

Type: Large medical and bioinformatics skill library

This repository contains some of the broadest biomedical skill coverage in the audit. It also contains the clearest integrity violation in the sample. In agentd-drug-discovery, a mock analogue-generation function operates through simple string manipulation of SMILES values while the user-facing skill surface presents the pipeline as if it can generate candidate molecules. This is not just a bug. It is a trust-surface failure.

The repository also exhibits licensing inconsistency across visible surfaces and executable files.

Bottom line: exceptional breadth, undermined by one of the clearest cases of mock-as-functional misrepresentation in the audit.

STEM-AI: 22 / 100 (T0)

5. What These Results Mean

The most important result in this audit is not that most repositories scored poorly.

It is that they failed in similar ways.

Across very different architectures, organizations, and codebases, the same structural weakness appears: the field is much better at producing the appearance of biological competence than at building systems that can bound, trace, and justify that competence before downstream use.

This is why the central bottleneck in Bio-AI is not model capability alone.

It is governance.

Specifically:

truth-surface separation

deterministic review surfaces

fail-closed execution boundaries

domain regression tests

provenance discipline

explicit scope limits

human-in-command approval structures

Without these, even technically impressive repositories remain unfit for supervised pilot consideration.

6. Procurement and Remediation Priorities

If an institution, lab, or pharma group evaluates Bio-AI repositories for adoption, the baseline questions should not begin with benchmark claims. They should begin with the following:

What can the local repository actually do without hidden assets?

What claims are made in README, and are they matched by executable reality?

Does the system fail closed when required inputs, models, or evidence are missing?

Are there domain-specific regression tests, not just syntax or structure checks?

Is the licensing surface coherent?

Can outputs be traced to specific inputs, code paths, and execution states?

The remediation order suggested by this audit is:

establish truthful README and scope boundaries

add fail-closed runtime contracts

add domain-specific regression tests

fix unsafe defaults in production-style systems

unify licensing and executable trust surfaces

6.1 Why `T3` Is the Minimum For Supervised Pilot

This report treats T3 as the minimum threshold for supervised pilot consideration because below that level a repository still lacks too many of the properties required for bounded real-world review:

truthful scope boundaries

executable trust surfaces

meaningful regression checks

tractable failure interpretation

enough accountability structure to support human oversight

T2 can indicate serious progress. It does not yet indicate that the repository can safely enter a supervised pilot loop in a biology- or patient-adjacent setting.

6.2 Why `T4` Is the Minimum For Higher-Consequence Integration

This report treats T4 as the minimum threshold for connection to higher-consequence workflows such as automated synthesis or clinical-adjacent operational integration because those environments magnify the cost of silent error.

At that point, a repository should not merely be promising. It should demonstrate:

strong governance architecture

clear provenance and traceability

stable domain validation pathways

bounded failure behavior

mature documentation of scope and risk

This threshold is a normative governance judgment, not a regulatory standard. It is stated explicitly so that institutions can agree or disagree with it on visible terms.

7. Evidence Navigation And Reproducibility Notes

This report is designed to function as a canonical reference rather than a one-off essay. The table below is therefore included as an appendix-grade navigation layer: it maps each repository's primary report claim to representative evidence surfaces that a reviewer can inspect directly.

This table is representative rather than exhaustive. It does not list every reviewed file. It anchors each repository verdict to the most consequential visible artifacts under this audit's criteria.

7.1 Appendix-Grade Evidence Table

Repository	Primary Report Claim	Representative Evidence Surface	Evidence Type	Why It Matters Under This Audit
Biomni	Architecturally ambitious, but high-risk runtime surfaces remain under-governed	`Biomini_README_Repo.md`; `biomni/agent/a1.py`; `biomni/utils.py`; `biomni/tool/lab_automation.py`	README, orchestration code, execution utility, tool runtime	The README presents broad biomedical execution scope; the agent layer downloads missing assets; shared utilities use shell/subprocess execution; lab automation contains direct script execution. Together these support the governance-risk finding.
AI-Scientist	Useful adjacent control case showing autonomy without strong deterministic governance	`AI_Scientist_README_Repo.md`; `launch_scientist.py`; `ai_scientist/perform_experiments.py`; `ai_scientist/perform_writeup.py`	README, orchestration entry point, experiment runner, writeup pipeline	The repository explicitly routes automated scientific work through editable `experiment.py` flows and subprocess-driven execution. This supports the T1-but-not-trustworthy judgment.
CellAgent	Compact research prototype with coherent idea but fragile execution maturity	`main.py`; `src/planner.py`; `src/code_sandbox.py`	Entrypoint, planning layer, notebook execution layer	The hardcoded notebook path, completion path, and notebook-execution wiring are concentrated in these files. They anchor the claim that the pipeline idea is stronger than the runtime discipline.
ClawBio	Clear outlier with real reproducibility and domain-aware validation discipline	`AGENTS.md`; `.github/workflows/ci.yml`; `pytest.ini`; `clawbio/common/scrna_io.py`; `skills/clinpgx/tests/test_clinpgx.py`	Process guide, CI, test registration, reusable validation function, domain test	These files show actual CI activity, registered tests, and a concrete input-validation surface (`detect_processed_input_reason`) rather than README-only maturity signals.
LabClaw	Broad coverage, almost no executable trust surface	`LabClaw_README_Repo.md`; `skills/med/clinical/SKILL.md`; `skills/pharma/rdkit/SKILL.md`; `skills/bio/scanpy/SKILL.md`	README, representative skill surfaces	The repository strongly presents production-ready biomedical workflows through `SKILL.md` coverage, but the audited surface is overwhelmingly instructional rather than executable or test-backed.
claude-scientific-skills	Strong packaging and distribution discipline, weak scientific execution verification	`claude-scientific-skills_README_Repo.md`; `LICENSE`; `docs/examples.md`; representative `SKILL.md` entries	README, root license, examples, skill surfaces	The README promises tested workflows and broad scientific capability, but the visible trust surface is mostly documentation/package structure rather than repository-level domain validation.
SciAgent-Skills	Best authoring governance among skill libraries, but still not a governed scientific runtime	`CLAUDE.md`; `CONTRIBUTING.md`; `registry.yaml`; `pixi.toml`; `scripts/validate_registry.py`	Authoring guide, contribution workflow, registry, test config, validator	These artifacts support the positive finding on authoring and registry discipline, while also showing that the test surface is mainly structural consistency rather than biological regression.
BioAgents	Sophisticated platform design weakened by unsafe deployment defaults and weak trust artifacts	`src/index.ts`; `src/middleware/rateLimiter.ts`; `src/routes/admin/queue-dashboard.ts`; `docker-compose.yml`; `docker-compose.worker.yml`	Runtime entry point, middleware, admin route, deployment config	These surfaces directly govern CORS behavior, queue/rate-limit behavior, admin exposure, and deployment posture. They support the claim that architecture quality is undermined by unsafe defaults.
BioClaw	Novel messaging-integrated runtime with inventive design but insufficient containment discipline	`README.md`; `src/container-runner.ts`; `src/mount-security.ts`; `src/cli.ts`	README, container runtime, mount policy, CLI runtime	The core trust question in BioClaw lives in container mounts, mount allowlisting, and output handling. These files anchor the containment and truncation concerns described in Section 4.9.
OpenClaw-Medical-Skills	Exceptional breadth, but mock-as-functional misrepresentation and license-surface contradiction are material	`OpenClaw-Medical-Skills_README_Repo.md`; `skills/agentd-drug-discovery/agent_d.py`; `scripts/validate_skill.py`	Custom repository README, executable skill code, validator script	The custom README presents broad biomedical and clinical capability; the `agentd-drug-discovery` runtime contains the clearest mock-as-functional issue in the audit; the validator surface is relevant because it did not adequately prevent that trust-surface gap at audit time.

7.2 Reproducibility Notes

Readers attempting to reproduce this audit should treat the workflow as:

inspect the active README or custom repository README surface

identify the likely entry point or execution orchestrator

inspect the files most directly responsible for the claimed workflow

compare the claimed capability surface against the executable surface

assign STEM-AI scores only after the technical surface is understood

Future companion materials may still expand this section with:

a rubric summary sheet for STEM-AI scoring interpretation

a repository-by-repository claim matrix with line-level anchors

an errata log for post-publication corrections or repository changes

8. Conclusion

The open Bio-AI ecosystem is energetic, creative, and technically ambitious.

It is also structurally under-governed.

This report does not argue that these repositories are worthless. It argues something more specific and more urgent: most of them are not yet trustworthy systems. They are research artifacts, productivity accelerants, or capability demonstrations. Those categories can be valuable. But they are not the same thing as deployable, reviewable, pilot-ready infrastructure.

The field's problem is not a lack of intelligence.

It is a lack of mechanisms that distinguish real capability from implied capability before that distinction becomes expensive, or dangerous, to discover.

That is the threshold Bio-AI must cross next.

9. Versioning And Errata Policy

This document should be read as a versioned audit artifact.

audit snapshot date: March 20, 2026

report publication version: v3.1

later repository changes do not retroactively change this snapshot

substantive post-publication corrections should be recorded through an explicit errata update rather than silent replacement

For an accessible overview of the STEM-AI framework and its design rationale, see the companion explainer on DEV:

https://dev.to/flamehaven01/medical-ai-repositories-need-more-than-benchmarks-we-built-stem-ai-to-audit-trust-194f

End of document.

Abstract

1. Introduction

2. Methodology

2.1 Repository Selection

2.2 Two-Layer Audit Design

2.3 Review Depth And Evidence Rules

2.4 Important Limitation

2.5 Why Full Dynamic Recreation Was Not Performed

2.6 Licensing Assessment Method

2.7 Risk Penalty And Ambiguity Handling

3. Summary of Findings

3.1 Integrated Results Table

3.2 If You Remember One Thing

3.3 Repeating Failure Patterns

3.4 Tier Meaning In Practical Terms

4. Repository-by-Repository Findings

4.1 Biomni

4.2 AI-Scientist

4.3 CellAgent

4.4 ClawBio

4.5 LabClaw

4.6 claude-scientific-skills

4.7 SciAgent-Skills

4.8 BioAgents

4.9 BioClaw

4.10 OpenClaw-Medical-Skills

5. What These Results Mean

6. Procurement and Remediation Priorities

6.1 Why T3 Is the Minimum For Supervised Pilot

6.2 Why T4 Is the Minimum For Higher-Consequence Integration

7. Evidence Navigation And Reproducibility Notes

7.1 Appendix-Grade Evidence Table

7.2 Reproducibility Notes

8. Conclusion

9. Versioning And Errata Policy

Share

Related Reading

Chaos Engineering for AI: Validating a Fail-Closed Pipeline with Fake Data and Math

Medical AI Repositories Need More Than Benchmarks. We Built STEM-AI to Audit Trust

From 97% Model Accuracy to 74% Clinical Reliability: Building RSN-NNSL-GATE-001

6.1 Why `T3` Is the Minimum For Supervised Pilot

6.2 Why `T4` Is the Minimum For Higher-Consequence Integration