STEM-BIO-AI Audit Report: yorkeccak/bio
When a README Claim Meets a Deterministic Scanner
Published: May 2026
Evaluation Target: yorkeccak/bio
Evaluation Engine: STEM-BIO-AI v1.7.5
Execution Mode: LOCAL_ANALYSIS
Audit Expires: 2026-06-29 (45-day freshness window)
HTML REPORT
Executive Summary
Field | Value |
Final Score | 48 / 100 |
Formal Tier | T1 Quarantine |
Use Scope | Exploratory review only. No patient-adjacent use. |
Replication Tier | R1 (Stage 4: 30/100) |
AIRI Coverage | 8 / 32 |
Β
The
T1 Quarantine result is justified. The repository presents a strong bio-facing README and a recognizable data-source story, but it still lacks the engineering, governance, and deployment-boundary evidence needed for patient-adjacent or regulated use. The most important shift in the v7 output is not the final score itself. It is that the report now surfaces external-service dependency and unsupported legal/compliance claims more explicitly, and no longer presents Code Integrity or AIRI in a way that looks artificially clean.What STEM-BIO-AI Measures
STEM-BIO-AI is a deterministic local CLI scanner. It does not execute model inference, make network requests, or run the target application. It evaluates repository evidence surfaces: what signals are present in README, docs, package metadata, dependency manifests, CI/test infrastructure, code structure, and replication artifacts.
The evaluation is organized into four stages.
Stage | Weight | Description |
Stage 1 README Evidence Signal | 0.40 | What the README and package metadata claim |
Stage 2R Repo-Local Consistency | 0.20 | Whether local code/doc surfaces support those claims |
Stage 3 Code/Bio Responsibility | 0.40 | Engineering governance: tests, CI/CD, provenance, bias, release hygiene |
Stage 4 Replication Evidence | reported separately | Reproducibility infrastructure |
Stage 4 is reported separately and does not alter the final score. Stage 3 remains the heaviest weighted stage and is still the clearest reason this repository remains in quarantine.
Score Breakdown
Stage 1 β README Evidence Signal: 75/100
Signal | Points |
Non-nascent README baseline | +60 |
Bio/medical domain vocabulary in README | +10 |
Bio/medical domain vocabulary in package metadata | +5 |
Self-asserted privacy/compliance language without stronger grounding | +5 |
CA-INDIRECT surface lacks explicit non-clinical boundary | -5 |
The README is the strongest part of the repository. It names the right biomedical sources, uses domain vocabulary competently, and presents a coherent user-facing story. But the same README also contains compliance-adjacent language that the reviewed repository sources do not substantiate.
The current scanner no longer treats
HIPAA-compliant architecture as a strong regulatory-framework signal. It is now surfaced as a weaker self-assertion signal plus a separate unsupported legal/compliance claim warning. That is the right direction. The claim remains visible, but it is no longer credited as if the repository had demonstrated formal governance support.Stage 2R β Repo-Local Consistency: 40/100
Signal | Points |
Non-nascent local repository baseline | +60 |
README domain overlap with package metadata | +15 |
Clinical-adjacent surfaces without non-diagnostic boundary | -20 |
README claims runnable workflow without matching local support | -15 |
This stage captures the central contradiction. The repository reads like a polished product surface, but the local support evidence is shallow. The repository describes a working biomedical workflow, yet there is no matching support structure for tests, CI, or locally verifiable runnable scaffolding.
The
-20 clinical-boundary deduction is especially important. Querying PubMed, ClinicalTrials.gov, FDA DailyMed, and ChEMBL without a clear non-diagnostic or non-clinical intended-use boundary is not a minor documentation omission. It is a governance gap.Stage 3 β Code/Bio Responsibility: 25/100
Rubric Item | Score | Maximum |
T1 CI/CD pipeline | 0 | 15 |
T2 Domain-specific tests | 0 | 15 |
T3 Changelog and release hygiene | 0 | 15 |
B1 Data provenance controls | 15 | 15 |
B2 Bias and limitations documentation | 0 | 15 |
B3 COI, funding, acknowledgements | 5 | 5 |
Raw total | 20 | 80 |
This is still the decisive stage.
Three engineering-governance items remain at zero:
- no workflow files,
- no test suite,
- no changelog or release history.
That means the repository still lacks the minimum operational evidence expected of a tool that presents itself as a serious biomedical interface.
B1 now scores 15/15, which is materially different from earlier runs. The scanner now recognizes JavaScript dependency and lockfile surfaces correctly, so package.json, package-lock.json, and pnpm-lock.yaml count as provenance or dependency evidence where appropriate. That is a real improvement in the audit logic, not a generosity bump.B3 also scores correctly because the repository acknowledges its relationship to Valyu. That disclosure matters. But provenance and acknowledgment do not compensate for missing tests, missing automation, and absent release hygiene.Stage 4 β Replication Evidence: 30/100
Evidence Item | Score | Maximum |
Container or compose file | 10 | 10 |
Environment lock manifest | 10 | 10 |
Exact dependency pins or hashes | 10 | 10 |
Everything else | 0 | remaining |
Stage 4 is now stronger than it appeared in earlier runs because JavaScript lockfiles are correctly recognized. The repository gets real credit for containerization and dependency pinning.
That said, the replication lane is still thin:
- no reproducibility section,
- no runnable examples,
- no dataset URL evidence,
- no seed setting,
- no citation file,
- no checksum or artifact-verification surface.
So
30/100 is an improvement, but not a strong reproducibility posture.Top Risks
The current
v7 report surfaces six top risks. Four are structural and two are mirrored from code-integrity warning lanes.Risk 1: Clinical-adjacent surfaces exist without an explicit non-diagnostic or non-clinical boundary.
This remains the most important governance issue. The tool works close to biomedical and clinical information flows, but the repository does not state a firm intended-use boundary that rules out diagnosis, treatment, or patient-care use.
Risk 2: Self-asserted compliance or privacy-governance claim requires independent verification.
The repository uses compliance-adjacent language in the README. The scanner now treats this as a weak self-assertion that must be reviewed independently rather than as evidence of formal governance maturity.
Risk 3: Legal, privacy, or compliance claim appears without supporting governance or security-grounding evidence in reviewed repository sources.
This is broader than the HIPAA phrase itself. The scanner is now explicitly surfacing unsupported legal/compliance claims in the report layer and regulatory traceability layer.
Risk 4: Core workflow appears materially dependent on named external service providers; local or self-host claims may overstate operational independence.
This is the clearest new structural warning in the
v7 artifact. The repository depends on named external services such as Valyu and Daytona. That dependency is now visible as a first-class risk rather than being buried in config files or documentation assumptions.Risk 5: C2_dependency_pinning: WARN.
In this report,
C2 no longer reads as βall clear.β It now doubles as a code-integrity lane for external operational dependency risk when a named provider becomes a material single point of reliance.Risk 6: C4_exception_handling_clinical_adjacent_paths: WARN.
The current
C4 warning is not about a literal except: pass pattern. It is being used here as a boundary-integrity lane for unsupported legal/compliance claims and clinical-adjacent governance concerns.Unsupported HIPAA / Compliance Claim: Evidence-Bound Reading
The README includes the phrase
HIPAA-compliant architecture (when self-hosted).The important point is not to overread the scanner. STEM-BIO-AI does not determine legal compliance. It determines whether reviewed repository sources surface evidence that would help ground a claim like this.
At the repository-evidence layer, the answer is weak:
- the compliance phrase is present,
- stronger governance or security-grounding evidence is not surfaced clearly in reviewed sources,
- the claim therefore becomes a review signal rather than a positive trust signal.
This is why the current report now surfaces both:
- a weak Stage 1 self-assertion signal, and
- a broader unsupported legal/compliance claim warning.
That is a more defensible interpretation than the earlier behavior, which risked treating compliance-adjacent wording as a stronger positive signal than it deserved.
For context, a HIPAA-oriented operational claim normally invites review of at least the following:
- Business Associate Agreement expectations before PHI handling
- administrative safeguards such as risk analysis, workforce training, and contingency planning
- physical safeguards for devices and workstations
- technical safeguards such as access controls, audit controls, integrity controls, and transmission security
- breach-notification readiness under 45 CFR Part 164, Subpart D
The current report does not say those controls are absent in deployment. It says the reviewed repository sources do not surface enough evidence to treat the README phrase as grounded governance proof.
That distinction matters because the downside of over-reading a README compliance claim is not abstract. Under the 2025 HHS inflation-adjusted civil monetary penalty schedule, HIPAA violations can scale as follows:
Tier | Per Violation | Annual Maximum |
No knowledge | $137 | $27,481 |
Reasonable cause | $1,379 | $137,897 |
Willful neglect, corrected | $13,785 | $68,928 |
Willful neglect, not corrected | $27,570 | $2,190,294 |
The narrow takeaway is not βthis repository is legally noncompliant.β The narrower and more defensible takeaway is that a compliance-adjacent README claim should not be operationalized without independent verification of controls, contracts, deployment boundaries, and incident-handling procedures.
Regulatory Traceability
The traceability block is useful here because it shows where the repository is generating governance-relevant signals without turning those signals into compliance claims.
Framework | Article / Signal | Stage | Strength |
EU AI Act | Article 13 β Transparency | Stage 1 | Weak |
Compliance claim grounding signal | Traceability-only | Stage 1 | Weak to moderate |
IMDRF SaMD/GMLP | Clinical context boundary | Stage 2R | Moderate |
EU AI Act | Article 10 β Data governance | Stage 3 | Weak |
EU AI Act | Article 12 β Record keeping | Stage 4 | Moderate |
This is the right use of traceability. It helps reviewers orient the evidence. It does not certify compliance, safety, or readiness.
AIRI Coverage
Here, AIRI refers to the AI Risk Repository from the MIT AI Risk Initiative: https://airisk.mit.edu/
The AI Risk Repository has three core parts:
- the AI Risk Database, which captures 1700+ risks extracted from 74 existing AI-risk frameworks and classifications,
- the Causal Taxonomy of AI Risks, which classifies how, when, and why these risks occur,
- the Domain Taxonomy of AI Risks, which classifies risks into 7 domains and 24 subdomains, such as false or misleading information.
In STEM-BIO-AI, AIRI is used as a bounded risk-vocabulary layer that connects local detector findings to broader named AI-risk categories when detector-to-risk mappings exist. It is not a separate scoring engine and it is not a truth layer.
The most important difference from older outputs is that AIRI is no longer
0/31.Current result:
- Covered Risks:
8 / 32
- Coverage Rate:
0.250
Examples now surfaced:
24.01.03Safe exploration problem with widely deployed AI assistants
24.04.01Physical and Psychological Harms
33.01.05Privacy and security
39.25.00Verifiability
60.02.01Reliability issues
69.01.00False information
70.01.02Accidental harm
72.04.02Market Concentration and Infrastructure Dependencies
This matters because the earlier
0/31 output could be misread as βno relevant AIRI risk signal.β The current report is better aligned with the actual findings. Unsupported compliance claims, dependency concentration, and boundary weaknesses now connect to a broader AIRI-backed risk vocabulary.Important boundary:
- these AIRI links remain bounded by detector mappings,
- they do not prove actual harm, causality, clinical failure, or legal noncompliance,
- and they still leave important unmapped or partially mapped territory.
Known gaps still called out explicitly:
65.03.03Reidentification
70.02.02Misinformation β hallucination of clinical knowledge
Those gap notes are still important because they point to plausible failure territory that is not yet fully covered by the current static detector set.
Code Integrity
The biggest interpretive improvement in the
v7 report is that Code Integrity no longer looks falsely clean.Check | Result |
C1 Hardcoded credentials | PASS |
C2 Dependency pinning / external operational dependency lane | WARN |
C3 Deprecated patient-adjacent paths | PASS |
C4 Boundary-integrity / unsupported compliance claim lane | WARN |
This is much closer to what a reviewer would intuitively expect after reading the rest of the report.
Two cautions still matter:
PASSdoes not mean strong overall engineering quality. It means the specific detector lane did not surface a mapped problem.
- Some architectural issues remain outside direct static detection. For example, design-level fail-open or mock-auth behavior may still require a dedicated detector rather than relying on current
C4logic.
What Would Need to Change
For this repository to move out of
T1 Quarantine, the changes are not subtle.Immediate
- Add an explicit non-clinical / non-diagnostic intended-use boundary.
- Remove or reframe unsupported compliance language unless governance evidence is documented clearly.
- Document the operational dependency on Valyu and Daytona honestly, including what breaks when those services are unavailable.
Short-term
- Add CI/CD.
- Add tests.
- Add release hygiene and changelog discipline.
- Add explicit limitations/bias discussion.
Medium-term
- Add reproducibility guidance and runnable examples.
- Add stronger artifact and data-reference documentation.
- Add mitigations around AIRI gap territory such as reidentification and hallucination-adjacent misuse risk.
Method Boundary
This remains a deterministic local CLI scan. No LLM inference, runtime execution, or network activity was used to produce the result. The report is therefore strongest when describing repository evidence surfaces and weaker when inferring live deployment behavior.
That boundary matters. This report is a pre-screen and review aid, not clinical certification, regulatory clearance, or legal advice.
STEM-BIO-AI | Weekly Repository Evaluation #1 | Every Friday
Next: BioClaw β AI-powered bioinformatics research assistant for WhatsApp group chats. 374 stars. A bioRxiv paper. Where does the genomic data go after the message is sent?
Next Step
If your AI system works in demos but still feels fragile, start here.
Flamehaven reviews where AI systems overclaim, drift quietly, or remain operationally fragile under real conditions. Start with a direct technical conversation or review how the work is structured before you reach out.
Direct founder contact Β· Response within 1-2 business days
Share
Related Reading
Scientific & BioAI Infrastructure
Beyond Repo Scanning: How AIRI Expanded the Risk Vocabulary in STEM BIO-AI 1.7.x
Scientific & BioAI Infrastructure
From Score to Workflow: Turning STEM BIO-AI Into a Local Audit System
Scientific & BioAI Infrastructure