Public Scientific Correctness Leaderboard

We benchmark our skills.
We publish the failures.
We fix them.

Every ClawBio skill is tested by an independent benchmark suite that checks scientific correctness, edge case behaviour, and reporting honesty. The full scorecard, the bench commit, and every open remediation task are public. Numbers below are regenerated from clawbio_bench against the latest ClawBio commit, not from a static report.

Last Run

2026-05-03

Bench

clawbio_bench v0.1.5

Bench Author

Biostochastics LLC

ClawBio Commit

7820473

168 / 182 tests passing (92.3%)

10 skills audited across 3 dimensions: safety, correctness, honesty. Up from 80 / 140 (57.1%) at the original 2026-04-05 audit.

Skill	Pass / Total	Rate	Worst Findings	Status
claw-metagenomics	7 / 7	100.0%	none	Clear
equity-scorer	15 / 15	100.0%	none	Clear
nutrigx-advisor	10 / 10	100.0%	none	Clear
bio-orchestrator	53 / 54	98.1%	unroutable_crash	Clear
pharmgx-reporter	43 / 44	97.7%	incorrect_indeterminate	Clear
fine-mapping	19 / 20	95.0%	susie_inf_est_tausq_ignored	Clear
clinical-variant-reporter	4 / 5	80.0%	gene_disease_context_missing	Clear
cvr-acmg-correctness	9 / 13	69.2%	none	Watch
gwas-prs	5 / 8	62.5%	missing_output (3)	Watch
cvr-variant-identity	3 / 6	50.0%	transcript_selection_error (3)	Watch

Status legend: Clear at or above 75%, Watch active remediation, P1 medium priority fix, Infra harness setup error (under investigation, not a science regression). Bench source and harness implementations: biostochastics/clawbio_bench.

Why this exists

Most bio-skill repositories never publish a number.

"59 skills, 1,401 tests" is a claim any project can make. Scientific correctness, edge case stability, and reporting honesty are different measurements, and they are the ones that decide whether a skill should run on a clinician's data.

Independent

Benchmark is in a separate repository, written and run by a third party. ClawBio cannot quietly tune the rubric.

Three dimensions

Safety: does the skill reject unsafe input. Correctness: does it produce the right number. Honesty: does the report match what the code actually computed.

Public failure surface

Every failing test maps to a tagged finding (fst_mislabeled, heim_unbounded, pathology_flagged). Every finding maps to a remediation task with an assignee.

Block on regression

The pr-audit skill runs the bench on touched skills before merge. New regressions block the PR. Resolved findings get highlighted as progress.

Submit your bio skill. We will benchmark it.

Skill authors get an independent third-party correctness audit, a public scoreboard slot, and a remediation task list they can work against. No private benchmarks, no marketing-grade pass rates.

Submit a Skill Run the Bench Yourself Full Remediation Plan

We benchmark our skills.We publish the failures.We fix them.