Every ClawBio skill is tested by an independent benchmark suite that checks scientific correctness, edge case behaviour, and reporting honesty. The full scorecard, the bench commit, and every open remediation task are public. Numbers below are regenerated from clawbio_bench against the latest ClawBio commit, not from a static report.
| Skill | Pass / Total | Rate | Worst Findings | Status |
|---|---|---|---|---|
| claw-metagenomics | 7 / 7 | 100.0% |
none | Clear |
| equity-scorer | 15 / 15 | 100.0% |
none | Clear |
| nutrigx-advisor | 10 / 10 | 100.0% |
none | Clear |
| bio-orchestrator | 53 / 54 | 98.1% |
unroutable_crash | Clear |
| pharmgx-reporter | 43 / 44 | 97.7% |
incorrect_indeterminate | Clear |
| fine-mapping | 19 / 20 | 95.0% |
susie_inf_est_tausq_ignored | Clear |
| clinical-variant-reporter | 4 / 5 | 80.0% |
gene_disease_context_missing | Clear |
| cvr-acmg-correctness | 9 / 13 | 69.2% |
none | Watch |
| gwas-prs | 5 / 8 | 62.5% |
missing_output (3) | Watch |
| cvr-variant-identity | 3 / 6 | 50.0% |
transcript_selection_error (3) | Watch |
Status legend: Clear at or above 75%, Watch active remediation, P1 medium priority fix, Infra harness setup error (under investigation, not a science regression). Bench source and harness implementations: biostochastics/clawbio_bench.
"59 skills, 1,401 tests" is a claim any project can make. Scientific correctness, edge case stability, and reporting honesty are different measurements, and they are the ones that decide whether a skill should run on a clinician's data.
Benchmark is in a separate repository, written and run by a third party. ClawBio cannot quietly tune the rubric.
Safety: does the skill reject unsafe input. Correctness: does it produce the right number. Honesty: does the report match what the code actually computed.
Every failing test maps to a tagged finding (fst_mislabeled, heim_unbounded, pathology_flagged). Every finding maps to a remediation task with an assignee.
The pr-audit skill runs the bench on touched skills before merge. New regressions block the PR. Resolved findings get highlighted as progress.
Skill authors get an independent third-party correctness audit, a public scoreboard slot, and a remediation task list they can work against. No private benchmarks, no marketing-grade pass rates.