Public Scientific Correctness Leaderboard

We benchmark our skills.
We publish the failures.
We fix them.

Every ClawBio skill is tested by an independent benchmark suite that checks scientific correctness, edge case behaviour, and reporting honesty. The full scorecard, the bench commit, and every open remediation task are public. Numbers below are regenerated from clawbio_bench against the latest ClawBio commit, not from a static report.

Last Run
2026-05-03
Bench Author
Biostochastics LLC
ClawBio Commit
168 / 182 tests passing (92.3%)
10 skills audited across 3 dimensions: safety, correctness, honesty. Up from 80 / 140 (57.1%) at the original 2026-04-05 audit.
Skill Pass / Total Rate Worst Findings Status
claw-metagenomics 7 / 7
100.0%
none Clear
equity-scorer 15 / 15
100.0%
none Clear
nutrigx-advisor 10 / 10
100.0%
none Clear
bio-orchestrator 53 / 54
98.1%
unroutable_crash Clear
pharmgx-reporter 43 / 44
97.7%
incorrect_indeterminate Clear
fine-mapping 19 / 20
95.0%
susie_inf_est_tausq_ignored Clear
clinical-variant-reporter 4 / 5
80.0%
gene_disease_context_missing Clear
cvr-acmg-correctness 9 / 13
69.2%
none Watch
gwas-prs 5 / 8
62.5%
missing_output (3) Watch
cvr-variant-identity 3 / 6
50.0%
transcript_selection_error (3) Watch

Status legend: Clear at or above 75%, Watch active remediation, P1 medium priority fix, Infra harness setup error (under investigation, not a science regression). Bench source and harness implementations: biostochastics/clawbio_bench.

Most bio-skill repositories never publish a number.

"59 skills, 1,401 tests" is a claim any project can make. Scientific correctness, edge case stability, and reporting honesty are different measurements, and they are the ones that decide whether a skill should run on a clinician's data.

Independent

Benchmark is in a separate repository, written and run by a third party. ClawBio cannot quietly tune the rubric.

Three dimensions

Safety: does the skill reject unsafe input. Correctness: does it produce the right number. Honesty: does the report match what the code actually computed.

Public failure surface

Every failing test maps to a tagged finding (fst_mislabeled, heim_unbounded, pathology_flagged). Every finding maps to a remediation task with an assignee.

Block on regression

The pr-audit skill runs the bench on touched skills before merge. New regressions block the PR. Resolved findings get highlighted as progress.

Submit your bio skill. We will benchmark it.

Skill authors get an independent third-party correctness audit, a public scoreboard slot, and a remediation task list they can work against. No private benchmarks, no marketing-grade pass rates.