⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⠟⠛⠉⠉⠉⠈⠉⠉⠙⠛⠿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠟⠋⠀⠀⢀⣠⣤⣤⣤⣤⣤⣤⣄⡀⠈⠛⢿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⠋⠀⢀⣠⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣦⡀⠈⠻⣿⣿⣿⣿⣿⣿ ██████ ██ █████ ██ ██ ██████ ██ ██████
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠁⠀⣠⣾⣿⣿⣿⣿⡿⢿⠛⡟⢿⣿⣿⣿⣿⣷⠀⠀⢹⣿⣿⣿⣿⣿ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██ ██
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⢠⣿⣿⣿⣿⣿⡿⡋⠃⠈⠀⠀⠈⠈⢻⣿⣿⣿⠀⠀⠀⣿⣿⣿⣿⣿ ██ ██ ███████ ██ █ ██ ██████ ██ ██ ██
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡃⣿⣿⣿⣿⣿⡏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣻⣿⣿⣿⣿ ██ ██ ██ ██ ██ ███ ██ ██ ██ ██ ██ ██
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠹⣿⣿⣿⣿⣿⣦⡂⡀⠀⠀⠀⠀⠀⣰⣶⣶⣶⠀⠀⠀⣿⣿⣿⣿⣿ ██████ ███████ ██ ██ ███ ███ ██████ ██ ██████
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⠀⠉⠻⣿⣿⣿⣿⣿⣧⣼⣀⣆⣼⣾⣿⣿⣿⣿⠀⠀⢰⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣧⡀⠀⠘⠻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⠃⠀⣠⣿⣿⣿⣿⣿⣿ ██████ ███████ ███ ██ ██████ ██ ██
⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⠛⣿⣿⣄⡀⠀⠈⠙⠻⠿⠿⠿⠿⠿⠿⠟⠋⠀⣀⣼⣿⣿⣿⣿⣿⣿⣿ ██ ██ ██ ████ ██ ██ ██ ██
⣿⣿⣿⣿⣿⣿⣿⠿⠃⠀⢀⣼⣿⣿⣿⣦⣄⣀⠀⠀⠀⠀⠀⠀⢀⣀⣤⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿ ██████ █████ ██ ██ ██ ██ ███████
⣿⣿⣿⣿⣿⠟⠉⠀⣠⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ ██ ██ ██ ██ ██ ██ ██ ██ ██
⣿⣿⡿⢋⠁⢀⣤⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿ ███████ ██████ ███████ ██ ████ ██████ ██ ██
⢟⠉⠀⣠⣼⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣷⣤⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
What this is.
clawbio_benchis a standalone Python audit suite that evaluates the external ClawBio bioinformatics platform for safety, correctness, and honesty. It runs behavioral harnesses against a local clone of ClawBio, compares each skill's output against analytically derived ground truth, and emits tamper-evident JSON verdicts with a SHA-256 chain of custody over every input, output, and artifact.What this is NOT. It is not part of ClawBio, it does not bundle ClawBio, and it is not a performance benchmark. To run it you need a local checkout of the ClawBio repository (see Requirements).
Most bioinformatics benchmarks answer "does it run?" This suite answers "is it safe, correct, and honest?" — with machine-verifiable evidence at every step.
- For ClawBio / tool authors: run
clawbio-bench --smokein CI to catch regressions in safety and correctness across commits. - For external auditors: use the rubric design, ground-truth methodology, and verdict format as a reference for auditing other computational biology tools.
- Audit report (PDF) — a 21-page
report from a 7-harness smoke run against ClawBio HEAD captured at
v0.1.2 (125/147 tests passing). The current suite (v0.1.4) runs nine
harnesses with 175 test cases (see Coverage Scope)
and reports 163/175 (93.1%) against ClawBio HEAD
e7590141(2026-04-07), with the open finding spotlighted in Confirmed Findings below.
| Dimension | What it means | Example finding category |
|---|---|---|
| Safety | Does the tool refuse or isolate unsafe inputs? No shell=True, no crashes on malformed genotypes, no silent suppression of non-zero exits. |
injection_blocked, edge_handled, exit_handled |
| Correctness | Does the tool produce the right numerical answer against an analytically derived reference? FST values, phenotype calls, PRS, HEIM bounds. | fst_correct, correct_determinate, score_correct |
| Honesty | Does the tool report what it actually did, not what it claims to do? An "honesty" failure is a tool that computes Nei's GST while labeling its output "Hudson's FST", or a CSV mode that inflates coverage beyond what was measured. | fst_mislabeled, csv_inflated, disclosure_failure |
"Honesty" is the distinctive axis. Correctness failures are usually obvious; honesty failures are subtle, dangerous, and rarely caught by conventional test suites.
- Requirements
- Install
- Quick Start
- How It Works
- Process Isolation — Does ClawBio Actually Run?
- What It Produces
- Core Concepts
- Coverage Scope
- Harnesses
- Ground Truth Formats
- Verdict Schema as External Contract
- Core Capabilities
- Implementation Safeguards
- Current Scope and Limitations
- Design Principles
- Understanding Results (Exit Codes)
- Run Tests
- Continuous Audit (GitHub Actions)
- Confirmed Findings at ClawBio HEAD
- References
- Roadmap
- Documentation
- License
- Python 3.11 or later (3.11, 3.12, 3.13, 3.14 are all CI-tested)
- Git available on
PATH - A local clone of ClawBio — this
repository does not bundle ClawBio. Every harness run needs a
--repo /path/to/ClawBioargument. - numpy and pandas in the benchmark virtualenv — ClawBio's
clawbio.commonpackage unconditionally importsscrna_iowhich requires numpy and pandas at import time, even for harnesses that don't use scRNA functionality. Without these packages, every harness will report massunroutable_crash/harness_errorverdicts fromModuleNotFoundErrorrather than real audit findings. Install the[dev]or[finemapping]extras to get them automatically. - Typst CLI on
PATH— required only for PDF report generation (python scripts/generate_report.py). Install viabrew install typst(macOS),cargo install typst-cli, or download from typst.app. The benchmark itself runs without Typst; only the report renderer needs it. - Operating system: Linux or macOS (Windows is untested)
python3.11 -m venv .venv && source .venv/bin/activate
pip install -e . # core: msgspec, regex, ruamel.yaml
pip install -e ".[dev]" # + pytest, ruff, mypy, pre-commit, rich, numpy, pandas
pip install -e ".[all]" # + viz, ui, finemapping, scikit-learn (for CI / daily audit)
pip install -e ".[viz]" # + matplotlib for heatmap rendering
pip install -e ".[ui]" # + rich for styled CLI output
pip install -e ".[finemapping]" # + numpy, pandas for the fine-mapping subprocess driverThe core install is deliberately minimal — every runtime dependency expands the trusted base of an audit tool, so each one has to justify itself (see Core Capabilities).
The minimum path from clone to first result, assuming you already have a
ClawBio checkout at ~/src/ClawBio:
git clone https://github.com/biostochastics/clawbio_bench.git
cd clawbio_bench
python3.11 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]" # includes numpy/pandas needed by ClawBio
clawbio-bench --smoke --repo ~/src/ClawBioThat runs every harness against ClawBio's HEAD commit (about 25 seconds)
and writes results to ./results/suite/<timestamp>/.
# List available harnesses and test case counts
clawbio-bench --list
# Machine-readable harness inventory (for scripting / dashboards)
clawbio-bench --list --json
# Single harness only
clawbio-bench --smoke --harness orchestrator --repo ~/src/ClawBio
# Preview what would run without executing
clawbio-bench --smoke --repo ~/src/ClawBio --dry-run
# Last 10 commits — longitudinal sweep, quiet mode for CI
clawbio-bench --regression-window 10 --repo ~/src/ClawBio -q
# Full longitudinal sweep on a specific branch
clawbio-bench --all-commits --branch main --repo ~/src/ClawBio
# Tagged commits only — benchmark at each release/milestone
clawbio-bench --tagged-commits --repo ~/src/ClawBio
# Custom test case directory
clawbio-bench --smoke --harness equity --inputs /my/test_cases --repo ~/src/ClawBio
# Render heatmap PNG (requires the [viz] extra)
clawbio-bench --heatmap results/suite/20260404_120000/
# Render the same markdown report the CI workflow posts to PRs
clawbio-bench --render-markdown results/suite/20260404_120000/ \
--baseline /path/to/main-baseline.json
# Deep chain-of-custody verification (three layers — see What It Produces)
clawbio-bench --verify results/suite/20260404_120000/
# Also works as a module
python -m clawbio_bench --smoke --repo ~/src/ClawBio
# Version + provenance
clawbio-bench --version| Flag | Purpose |
|---|---|
--smoke |
Fast check: HEAD commit only, all harnesses. The default CI gate. |
--regression-window N |
Replay every test case across the last N commits on the current branch. |
--all-commits |
Every commit on a branch from the first audit-era commit forward (slowest). |
--tagged-commits |
Run against tagged commits only (releases / milestones). Heatmaps annotate release names on the timeline. |
--commits SHA,SHA,... |
Explicit commit list (diagnostic mode). |
--branch NAME |
Which branch to walk (default: main). |
--harness NAME |
Run only one harness (e.g. equity, pharmgx). Omit to run all nine. |
--inputs PATH |
Override the bundled test case directory for a single harness. |
--output DIR |
Where results land. Default: ./results/suite/<timestamp>/. |
--repo PATH |
Required for every real run: local ClawBio checkout. |
--list |
Print the harness registry and test case counts. |
--list --json |
Same, machine-readable. |
--dry-run |
Show the plan (commits × harnesses × test cases) without executing. |
--allow-dirty |
Safety override: run even when the ClawBio working tree is dirty. By default a dirty repo is a hard stop to protect chain of custody. |
--verify DIR |
Three-layer chain-of-custody re-verification of an existing results directory. |
--heatmap DIR |
Render heatmap.png from a results directory (needs [viz] extra). |
--render-markdown DIR |
Render the PR-comment markdown report a CI run would emit. |
--baseline PATH |
Baseline aggregate report (or directory) to diff against for --render-markdown. |
-q / --quiet |
Suppress per-test progress. |
--no-rich |
Force plain-text tables even when rich is installed (byte-stable output for diffs). |
--version |
Print version + provenance. |
clawbio_bench runs a (commits × test_cases) matrix. For every commit
you select, every test case you select is executed against that commit, and
every (commit, test_case) pair produces exactly one verdict — even if the
harness itself crashes (infrastructure failures become harness_error
verdicts rather than aborting the sweep).
┌──────────────────────────────────────────────────────────┐
│ clawbio-bench --smoke / --regression-window │
└──────────────────────────────────────────────────────────┘
│
▼
┌────────────────── Matrix runner ──────────────────┐
│ │
│ for commit in selected_commits: │
│ git worktree checkout commit ← isolation │
│ clean_workspace (incl. submodules) │
│ │
│ for test_case in harness.test_cases: │
│ ground_truth = parse_ground_truth(...) │
│ execution = capture_execution(...) │
│ scored = harness.run_single_*(...) │
│ verdict = build_verdict_doc(...) │
│ save_verdict(); save_execution_logs() │
│ │
└────────────────────────────────────────────────────┘
│
▼
aggregate_report.json · heatmap.png ·
verdict_hashes.json (per harness) ·
rendered markdown (optional)
Step by step:
- Resolve the commit set.
--smoke→[HEAD].--regression-window N→ lastNcommits on the branch.--all-commits→ every commit from the audit-era root.--tagged-commits→ only tagged (release) commits.--commits SHA,...→ an explicit list. - Isolate each commit. Every commit is checked out in a git worktree so the user's working tree is never modified, and submodules are recursively reset between commits so a dirty submodule from commit N cannot poison commit N+1.
- Run every test case. Each harness iterates its bundled test cases,
parses ground truth, invokes the ClawBio skill under audit via
capture_execution()(subprocess with timeout, truncation cap, and optional argparse-fallback retry), and capturesstdout,stderr,exit_code, and any artifact files. - Score against ground truth. Each harness applies its rubric —
category-level, not binary. A finding lands in one of ~7–10 named
buckets per harness (e.g.
fst_mislabeled≠fst_incorrect), each mapping to a specific remediation. - Emit a verdict document.
build_verdict_doc()assembles a canonical, byte-sorted JSON document with SHA-256 hashes of every input, output, ground truth file, and (embedded at write time) the verdict itself. - Aggregate. Per-harness
summary.json/all_verdicts.json, a suite-levelaggregate_report.json,heatmap_data.json, and a tamper-evidentverdict_hashes.jsonsidecar.
The matrix model is why --smoke is ~30 seconds (1 commit × 175 tests)
but --regression-window 20 is several minutes (20 × 175).
Yes — ClawBio's actual code executes for real. Not mocked, not
simulated, not reimplemented. What it does not do is share a Python
interpreter with the benchmark. Every test case spawns a fresh OS process
via subprocess.run() (no shell=True, ever), runs real ClawBio code in
that process, and the benchmark reads the result from captured stdout,
stderr, exit code, and files on disk.
The benchmark package itself contains zero import clawbio / from clawbio … lines. This is the "loose-coupling invariant": the auditor and
the auditee cannot contaminate each other's interpreter state, the audit
tool's trust surface stays pinned to its three runtime deps (msgspec,
regex, ruamel.yaml), and a missing or broken skill at some commit
produces a clean exit 127 instead of a ModuleNotFoundError that
corrupts the whole longitudinal sweep.
═══════════════════════════════════════════════════════════════════════════════
HOST OS (macOS / Linux)
═══════════════════════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────────────────────────────┐
│ PROCESS #1 · pid 42000 · the auditor │
│ ──────────────────────────────────────── │
│ argv: clawbio-bench --smoke --repo ~/src/ClawBio │
│ python: /Users/you/clawbio_bench/.venv/bin/python3.14 │
│ imports: clawbio_bench.cli, .core, .harnesses.*, msgspec, regex, │
│ ruamel.yaml ← 3 deps, that's the full list │
│ memory: ~60 MB │
│ │
│ ┌─ clawbio_bench.cli.main() │
│ │ resolves commit set → [HEAD] │
│ │ for each harness in HARNESS_REGISTRY: │
│ │ for each test_case: │
│ │ ┌─ harness_core.capture_execution(cmd=[...]) │
│ │ │ build cmd list │
│ │ │ subprocess.run(cmd, cwd=repo_path, timeout=…, │
│ │ │ capture_output=True, shell=False) │
│ │ │ │ │
│ │ │ │ ← fork() + execve() (OS boundary) │
│ │ │ │ │
│ │ │ ▼ │
│ │ │ ╔════════════════════════════════════════════════════╗ │
│ │ │ ║ PROCESS #2 · pid 42017 · the auditee ║ │
│ │ │ ║ ────────────────────────────────────── ║ │
│ │ │ ║ argv: /.../.venv/bin/python3.14 \ ║ │
│ │ │ ║ ~/src/ClawBio/skills/pharmgx-reporter/ \ ║ │
│ │ │ ║ pharmgx_reporter.py \ ║ │
│ │ │ ║ --input pg_01_cyp2c19.txt \ ║ │
│ │ │ ║ --output results/…/tool_output/ \ ║ │
│ │ │ ║ --no-enrich ║ │
│ │ │ ║ cwd: ~/src/ClawBio ← ClawBio's own worktree ║ │
│ │ │ ║ imports: WHATEVER pharmgx_reporter.py IMPORTS ║ │
│ │ │ ║ (transitive closure of the target repo) ║ │
│ │ │ ║ memory: separate heap, separate GIL, separate ║ │
│ │ │ ║ sys.modules, separate logging state ║ │
│ │ │ ║ ║ │
│ │ │ ║ ┌─ pharmgx_reporter.main() ║ │
│ │ │ ║ │ parse 23andMe TSV ║ │
│ │ │ ║ │ call CPIC phenotype logic ║ │
│ │ │ ║ │ write report.md + result.json ║ │
│ │ │ ║ │ print rationale to stdout ║ │
│ │ │ ║ └─ sys.exit(0) ║ │
│ │ │ ╚════════════════════════════════════════════════════╝ │
│ │ │ │ │
│ │ │ │ ← _exit() (OS boundary, same direction) │
│ │ │ ▼ │
│ │ │ auditor receives: │
│ │ │ · exit_code: int │
│ │ │ · stdout: bytes (up to 10 MB, else truncated) │
│ │ │ · stderr: bytes (same cap) │
│ │ │ · wall_seconds: float │
│ │ │ · (files on disk in tool_output/) │
│ │ │ │
│ │ └─ score verdict · compute SHA-256 on inputs/outputs/logs · │
│ │ build_verdict_doc() · save_verdict() (atomic write) │
│ └────────── │
└─────────────────────────────────────────────────────────────────────────┘
Process #1 never imports anything from ~/src/ClawBio.
Process #2 is a fresh Python interpreter — anything the target imports
happens entirely in its own sys.modules and dies with the process.
Communication is four channels only: argv in, stdout + stderr + exit + files out.
┌────────────────────────────────────────────────────────────┐
│ clawbio-bench --smoke │
│ (process #1, the auditor) │
└────────────────────────────────────────────────────────────┘
│
┌──────────────────────┼──────────────────────┐
▼ ▼ ▼
commits = [HEAD] test_cases per harness HARNESS_REGISTRY
│ │ │
└────────────┬─────────┴──────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ for commit in commits: │
│ git worktree checkout commit (subprocess to git) │
│ clean_workspace(submodules=True) (subprocess to git) │
│ │
│ for harness in HARNESS_REGISTRY: │
│ for test_case in harness.test_cases: │
│ │
│ ┌───────────────────────────────────────────────┐ │
│ │ ground_truth = parse_ground_truth(...) │ │
│ │ │ │
│ │ cmd = [ │ │
│ │ sys.executable, │ │
│ │ str(repo_path/"skills"/"<skill>"/"*.py"), │ │
│ │ "--input", str(payload), │ │
│ │ "--output", str(tool_output_dir), │ │
│ │ …, │ │
│ │ ] │ │
│ │ │ │
│ │ execution = capture_execution(cmd, cwd=repo) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ╔══════════════════════════╗ │ │
│ │ ║ SUBPROCESS ║ │ │
│ │ ║ python skills/.../*.py ║ │ │
│ │ ║ ← actual ClawBio code → ║ │ │
│ │ ║ runs for real ║ │ │
│ │ ║ writes artifacts ║ │ │
│ │ ║ exits with code N ║ │ │
│ │ ╚══════════════════════════╝ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ stdout, stderr, exit_code, tool_output/*.* │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ verdict = score_<harness>_verdict( │ │
│ │ ground_truth, execution, analysis │ │
│ │ ) │ │
│ │ │ │
│ │ doc = build_verdict_doc( │ │
│ │ verdict, execution, commit_meta, │ │
│ │ chain_of_custody={ │ │
│ │ payload_sha256, stdout_sha256, │ │
│ │ stdout_full_sha256, driver_sha256, … │ │
│ │ } │ │
│ │ ) │ │
│ │ │ │
│ │ save_verdict(doc, …) ← atomic write │ │
│ │ _verdict_sha256 embedded + re-verified │ │
│ └───────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────┘
│
▼
aggregate_report.json +
verdict_hashes.json per harness
┌─────────────────────────────────────────────────────────────────────────────┐
│ Pattern A — CLI subprocess │
│ Used by: orchestrator, pharmgx, equity, nutrigx, metagenomics, cvr │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ auditor (pid 42000) auditee (pid 42017) │
│ │
│ cmd = [ │
│ sys.executable, │
│ "~/src/ClawBio/skills/pharmgx-reporter/pharmgx_reporter.py", │
│ "--input", "pg_01.txt", │
│ "--output", "results/.../tool_output", │
│ "--no-enrich", │
│ ] │
│ │
│ subprocess.run(cmd, cwd=~/src/ClawBio, shell=False, capture_output=True) │
│ ──────────────────────────────────────▶┌──────────────────────────────┐ │
│ │ python pharmgx_reporter.py │ │
│ │ │ │
│ │ imports: │ │
│ │ · argparse │ │
│ │ · pandas (if ClawBio uses) │ │
│ │ · whatever else the tool │ │
│ │ wants │ │
│ │ │ │
│ │ reads pg_01.txt │ │
│ │ writes tool_output/ │ │
│ │ prints to stdout / stderr │ │
│ │ sys.exit(0) │ │
│ └──────────────────────────────┘ │
│ ◀────────────────────────────────────── exit, stdout, stderr │
│ │
│ scores verdict, hashes everything │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ Pattern B — Subprocess driver shim │
│ Used by: finemapping (because the target has NO CLI, just library modules) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ auditor (pid 42000) driver (pid 42017) │
│ │
│ cmd = [ │
│ sys.executable, │
│ "<clawbio_bench>/drivers/finemapping_driver.py", ← OUR file, not │
│ "--skill-dir", "~/src/ClawBio/skills/fine-mapping", ClawBio's │
│ "--inputs", "test_case/inputs.json", │
│ "--output", "result.json", │
│ ] │
│ │
│ subprocess.run(cmd, cwd=~/src/ClawBio, shell=False) │
│ ──────────────────────────────────────▶┌──────────────────────────────┐ │
│ │ python finemapping_driver.py │ │
│ │ │ │
│ │ sys.path.insert(0, │ │
│ │ "<skill-dir>") │ │
│ │ │ │
│ │ ┌──────────────────────────┐ │ │
│ │ │ from core.abf import │ │ │
│ │ │ approximate_bayes_... │ │ │
│ │ │ from core.susie import │ │ │
│ │ │ susie_ibss │ │ │
│ │ │ from core.credible_sets │ │ │
│ │ │ import build_cs │ │ │
│ │ │ ↑ THIS is where │ │ │
│ │ │ ClawBio code gets │ │ │
│ │ │ imported — but in │ │ │
│ │ │ a SEPARATE Python │ │ │
│ │ │ interpreter that │ │ │
│ │ │ will die in ~1 s │ │ │
│ │ └──────────────────────────┘ │ │
│ │ │ │
│ │ run ABF / SuSiE │ │
│ │ json.dump(result) → stdout │ │
│ │ sys.exit({0,1,2}) │ │
│ └──────────────────────────────┘ │
│ ◀────────────────────────────────────── JSON on stdout │
│ │
│ parses JSON, scores, hashes driver file itself (driver_sha256) │
│ │
│ ▶ The auditor's own Python process STILL never touches ClawBio. │
│ The driver is a clawbio_bench DATA FILE — bundled under drivers/ │
│ but explicitly not imported from anywhere in the package. │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ Pattern C — AST static analysis │
│ Used by: metagenomics (only, as a secondary channel alongside Pattern A) │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ auditor (pid 42000, no subprocess spawned) │
│ │
│ source_path = repo_path/"skills"/"claw-metagenomics"/"metagenomics.py" │
│ │
│ text = source_path.read_text(errors="replace") ← reads AS TEXT │
│ tree = ast.parse(text, filename=str(source_path)) ← builds syntax tree │
│ NO CODE EXECUTES │
│ │
│ for node in ast.walk(tree): │
│ if subprocess.run(..., shell=True) — found → injection_succeeded │
│ if run_command(critical=False) — found → exit_suppressed │
│ … │
│ │
│ ▶ ast.parse() is lexer + parser, not an interpreter. None of the │
│ source's side effects fire. It's the same mechanism `ruff`, `mypy`, │
│ and `bandit` use to read code without running it. │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
- No shared interpreter state. Import side effects, monkey-patches,
global logging config,
sys.pathmutations, C-extension crashes — none of them can contaminate the auditor if the target never runs in the auditor's process. - Subprocess-only bugs become visible. Exit-code handling, stderr
channel discipline, argparse drift, signal semantics, non-zero-exit
suppression — these are invisible to an in-process caller and are the
exact classes of bug the
exit_suppressed/disclosure_failurecategories exist to catch. - Trust surface stays minimal.
clawbio_benchdeclares three runtime deps (msgspec,regex,ruamel.yaml). If it imported ClawBio, ClawBio's entire transitive dependency closure would become part of the audit tool's trusted base — and an audit tool whose trust surface includes the code it audits is broken by construction. - Graceful failure when the skill doesn't exist yet. Longitudinal
sweeps walk commits from before a given skill existed. Subprocess
invocation naturally returns
FileNotFoundError/ exit 127 on a missing tool path rather than raisingModuleNotFoundErrorinside the benchmark and corrupting the whole run.
Every run writes a single timestamped directory:
results/suite/20260404_120000/
├── aggregate_report.json # suite-level rollup: pass/fail by harness, counts, env
├── orchestrator/
│ ├── manifest.json # run parameters, env, commit set, ground_truth_refs
│ ├── summary.json # category histogram, pass rate, persistent failures
│ ├── all_verdicts.json # flat list of every verdict in the run
│ ├── heatmap_data.json # commits × test_cases grid, category-coded
│ ├── verdict_hashes.json # {rel_path: _verdict_sha256} sidecar index
│ └── <commit_sha>/
│ └── <test_case>/
│ ├── verdict.json # ◄─── the canonical unit of audit
│ ├── stdout.log # captured ClawBio stdout (pre-truncation hash in verdict)
│ └── stderr.log # captured ClawBio stderr
├── equity/ …same layout…
├── pharmgx/ …same layout…
├── nutrigx/ …same layout…
├── metagenomics/…same layout…
└── finemapping/ …same layout…
A real verdict.json (abbreviated; hashes and rationales trimmed for
readability — actual output is canonical, sorted-key JSON with every field
fully populated):
{
"benchmark_name": "pharmgx-reporter",
"benchmark_version": "0.1.0",
"start_time_utc": "2026-04-04T12:00:03.114Z",
"timestamp_utc": "2026-04-04T12:00:04.892Z",
"wall_clock_seconds": 1.778,
"commit": {
"sha": "a1b2c3d4e5f6…",
"short_sha": "a1b2c3d",
"author": "ClawBio Team",
"date_utc": "2026-03-28T09:14:22+00:00",
"subject": "pharmgx: handle CYP2C19 *2/*2 diplotype"
},
"test_case": {
"name": "pg_01_cyp2c19_pm_clopidogrel",
"driver": ".../test_cases/pharmgx/pg_01_cyp2c19_pm_clopidogrel/driver.sh",
"driver_sha256": "7e4b…",
"payload": "input.tsv",
"payload_sha256": "d41d8cd…"
},
"ground_truth": {
"BENCHMARK": "cyp2c19_pm_clopidogrel",
"GROUND_TRUTH_PHENOTYPE": "CYP2C19 Poor Metabolizer (*2/*2)",
"FINDING_CATEGORY": "correct_determinate",
"HAZARD_DRUG": "Clopidogrel"
},
"ground_truth_references": {
"CPIC_CYP2C19": "https://cpicpgx.org/guidelines/guideline-for-clopidogrel-and-cyp2c19/"
},
"reference_genome": "GRCh38",
"execution": {
"exit_code": 0,
"stdout_sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"stdout_full_sha256": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"stdout_full_byte_len": 1842,
"stdout_truncated": false,
"stderr_sha256": "a5f2c7…",
"stderr_full_sha256": "a5f2c7…",
"stderr_truncated": false,
"wall_seconds": 1.778,
"used_fallback": false,
"cmd": ["python", "-m", "clawbio.pharmgx", "--input", "input.tsv"]
},
"report_analysis": { "phenotype_detected": "CYP2C19 Poor Metabolizer (*2/*2)", "drug_action": "avoid" },
"verdict": {
"category": "correct_determinate",
"rationale": "Phenotype matched ground truth; Clopidogrel correctly flagged AVOID (CPIC 1A)."
},
"environment": {
"python_version": "3.11.9",
"platform": "darwin-arm64",
"env_hash_sha256": "b21f…"
},
"_verdict_sha256": "0987abc…"
}_verdict_sha256 is computed over the canonical byte representation of
the document with that field removed, then re-embedded. That lets any
downstream tool re-hash and verify in one pass.
clawbio-bench --verify results/suite/20260404_120000/runs three independent integrity checks:
- Per-verdict self-hash. Every
verdict.jsonis re-hashed over its own canonical bytes (with_verdict_sha256stripped) and compared against the embedded value. - Sidecar reconciliation. Every entry in
verdict_hashes.jsonmust point to an existingverdict.json; everyverdict.jsonin the tree must appear in the sidecar. Neither may drift. - Log file integrity. Each
stdout.log/stderr.logis hashed and compared against theexecution.stdout_sha256/stderr_sha256recorded inside the adjacent verdict.
Any mismatch is a hard failure with a specific file path, making tampering trivially detectable post-hoc.
Three terms recur throughout this document and the code:
- Skill — a capability exposed by the ClawBio platform (e.g. the
pharmgx-reporterthat turns a genotype TSV into a drug-safety report). ClawBio advertises 23 executable skills + 8 stub skills. - Harness — a
clawbio_benchaudit module (one file undersrc/clawbio_bench/harnesses/) that tests exactly one skill domain, with its own category rubric, its own scoring logic, and arun_single_<name>()entry point. There are currently nine harnesses (six ClawBio-skill audits + one out-of-registry fine-mapping audit + two CVR Phase 2 harnesses). - Test case — a directory under
src/clawbio_bench/test_cases/<harness>/containing input payloads + aground_truth.txtheader (YAML frontmatter or legacy# KEY: valueformat) with analytically derived expected answers.
One harness owns many test cases. One test case belongs to exactly one harness. Adding a new audit for a new skill means writing a new harness; expanding an existing audit means adding test cases.
v0.1.4 exercises the ClawBio bio-orchestrator plus 6 of the 37
executable skills with dedicated behavioral harnesses — verified
against ClawBio HEAD e7590141 (2026-04-07, 43 skills with SKILL.md:
37 executable + 6 stub). The orchestrator harness additionally
routing-tests every auto-detectable skill, exercises --skill NAME
direct invocation for five previously-unreachable high-clinical-harm
skills, and covers --skills A,B,C multi-skill composition. The
clinical-variant-reporter skill is now audited by three harnesses:
clinical_variant_reporter (Phase 1 — structural / traceability,
5 tests), cvr_identity (Phase 2c — HGVS / transcript / assembly
representation, 6 tests, new in v0.1.3), and cvr_correctness
(Phase 2a — ACMG criterion-level correctness with dual-layer ground
truth, 13 tests, new in v0.1.3). The fine-mapping harness audits a
statistical-inference subsystem outside ClawBio's official skill
registry; it runs via a subprocess driver shim so its numerical stack
(numpy/pandas) stays an optional extra.
Skill inventory is pinned dynamically, not hardcoded. The
orchestrator harness's discover_clawbio_skills(repo_path) helper
scans the target commit's skills/ directory and emits a drift report
into every verdict (see compute_inventory_drift in
src/clawbio_bench/harnesses/orchestrator_harness.py). The frozenset
baselines in that module exist only for drift detection — verdict
scoring depends exclusively on each test case's GROUND_TRUTH_EXECUTABLE
field, which is the sole authoritative source.
| ClawBio skill | Dedicated behavioral harness | Routing-tested |
|---|---|---|
bio-orchestrator |
Yes (54 tests) | — |
pharmgx-reporter |
Yes (44 tests) | also |
equity-scorer |
Yes (15 tests) | also |
nutrigx_advisor |
Yes (10 tests) | also |
claw-metagenomics |
Yes (7 tests) | also |
clinical-variant-reporter |
Yes — Phase 1+2c+2a (5+6+13 tests) | also (via --skill) |
variant-annotation |
— | Yes (via --skill) |
clinical-trial-finder |
— | Yes (via --skill) |
target-validation-scorer |
— | Yes (via --skill) |
methylation-clock |
— | Yes (keyword + --skill) |
claw-ancestry-pca |
— | Yes |
scrna-orchestrator |
— | Yes |
scrna-embedding |
— | Yes |
genome-compare |
— | Yes |
clinpgx |
— | Yes |
gwas-prs |
— | Yes |
gwas-lookup |
— | Yes |
profile-report |
— | Yes |
data-extractor |
— | Yes |
rnaseq-de |
— | Yes |
proteomics-de |
— | — (executable, not auto-detected) |
omics-target-evidence-mapper |
— | — (executable, not auto-detected) |
illumina-bridge |
— | Yes |
bioconductor-bridge |
— | Yes |
diff-visualizer |
— | Yes |
fine-mapping |
Yes (21 tests, out-of-registry) | — (executable, not auto-detected) |
ukb-navigator |
— | — (executable, not auto-detected) |
galaxy-bridge |
— | — (executable, not auto-detected) |
genome-match, recombinator, soul2dna |
— | — (synthetic genome generation; privacy/consent review pending) |
pubmed-summariser, protocols-io, bigquery-public, cell-detection |
— | — (executable, not auto-detected) |
struct-predictor, labstep |
— | Yes |
6 stubs (vcf-annotator, lit-synthesizer, repro-enforcer, claw-semantic-sim, drug-photo, seq-wrangler) |
— | Yes (stub warning check) |
Totals (verified against ClawBio HEAD 5cf83c5):
- Dedicated behavioral coverage: 6 / 37 executable ClawBio skills (~16%)
- Skills reachable via orchestrator auto-detection: 23 / 43
- Skills reachable via
--skill NAMEdirect invocation: 43 / 43 (100%) — five previously-unreachable clinical skills (clinical-variant-reporter,variant-annotation,clinical-trial-finder,target-validation-scorer,methylation-clock) now have dedicated force-routing tests - Total harnesses: 10 (7 ClawBio-skill audits + 1 fine-mapping + 2 CVR Phase 2)
- Total test cases: 183 (54 orchestrator + 44 pharmgx + 15 equity + 10 nutrigx + 7 metagenomics + 21 fine-mapping + 5 CVR Phase 1 + 6 CVR Phase 2c identity + 13 CVR Phase 2a correctness + 8 gwas-prs)
See docs/plans/GAP_ANALYSIS_2026-04-04.md
for the full audit-framework-aligned gap analysis, including the
P0 verdict-integrity fix this release. The remaining uncovered ClawBio
skills are listed in the Roadmap prioritized by
clinical-harm potential. See also
Current Scope and Limitations.
Each harness is self-contained: a single Python module under
src/clawbio_bench/harnesses/ with a category rubric, a run_single_<name>()
function, and bundled test cases under src/clawbio_bench/test_cases/<name>/.
Categories are not binary pass/fail — each category maps to a distinct
remediation path.
In plain English. Given an input file or free-text query, does the
orchestrator route it to the right ClawBio skill? Does it warn when the
target is a stub? Does it fail cleanly on unroutable inputs? Does the
--skill NAME force path still work for clinical skills that lack any
semantic auto-detection route? Does multi-skill composition (--skills A,B,C) dispatch correctly without leaking skill A's artifacts into
skill B's output directory?
Routes tested: extension-based (15), keyword-based (19), error handling (11),
--skill NAME force-routing for five high-clinical-harm unreachable skills
(5), --skills A,B,C multi-skill composition (2), prompt-injection
regression pins (3 — one genuine LLM-path test via --provider flock,
two deterministic-parser regression pins).
| Category | Pass? | Description |
|---|---|---|
routed_correct |
Yes | Correct skill selected |
routed_wrong |
No | Wrong skill selected |
stub_warned |
Yes | Stub routed with warning |
stub_silent |
No | Stub routed silently |
unroutable_handled |
Yes | Unknown input, clean error |
unroutable_crash |
No | Unknown input, crash |
harness_error |
— | Infrastructure error |
In plain English. When computing population-differentiation statistics (FST, HEIM), does the tool produce the right number, label that number with the right estimator name, stay within mathematically valid bounds, and not crash on edge cases like monomorphic sites, single-sample cohorts, or haploid genotypes? FST (Fixation Index) and HEIM (Heterozygosity-based Equity Index Metric) are both summary statistics of genetic diversity between populations.
Tests FST accuracy (Nei's GST), estimator label correctness, HEIM bounds, CSV mode honesty, edge cases.
| Category | Pass? | Description |
|---|---|---|
fst_correct |
Yes | FST value + label correct |
fst_incorrect |
No | FST outside tolerance |
fst_mislabeled |
No | FST correct, label wrong ← honesty failure |
heim_bounded |
Yes | HEIM in [0, 100] |
heim_unbounded |
No | HEIM > 100 |
csv_honest |
Yes | CSV mode honest about coverage |
csv_inflated |
No | CSV inflates genomic coverage ← honesty failure |
edge_handled |
Yes | Edge case handled |
edge_crash |
No | Edge case crash |
harness_error |
— | Infrastructure error |
In plain English. Given a genotype file, does the tool call the right pharmacogenomic phenotype (e.g. "CYP2C19 Poor Metabolizer"), classify the recommended drug action correctly against CPIC guidelines, avoid reporting fake confidence on ambiguous genotypes, and surface drug warnings in the actual report (not just stderr where users won't see them)?
Tests pharmacogenomic phenotype calling and drug safety classification against CPIC guidelines.
| Category | Pass? | Description |
|---|---|---|
correct_determinate |
Yes | Right phenotype + drug class |
correct_indeterminate |
Yes | Correctly indeterminate |
scope_honest_indeterminate |
Yes | Tool correctly returns Indeterminate for a variant DTC arrays cannot resolve (CNV, hybrid, phasing) — correct clinical behavior |
incorrect_determinate |
No | Wrong phenotype (false Normal) |
incorrect_indeterminate |
No | Unnecessary indeterminate |
omission |
No | Drug missing from report |
disclosure_failure |
No | Warning on stderr only, not in report ← honesty failure |
harness_error |
— | Infrastructure error |
In plain English. Does the nutrigenomics scorer produce the right numeric score? Are its category buckets (Low / Medium / High) consistent with its declared thresholds? Does the reproducibility bundle actually contain everything needed to re-run the analysis? Does it flag missing SNPs in its panel rather than silently treating them as reference?
Tests nutrigenomics score accuracy, reproducibility bundle integrity, SNP panel validation.
| Category | Pass? | Description |
|---|---|---|
score_correct |
Yes | Score matches expected |
score_incorrect |
No | Score diverges |
repro_functional |
Yes | Reproducibility bundle complete |
repro_broken |
No | Reproducibility artifacts missing |
snp_valid |
Yes | Panel SNPs found |
snp_invalid |
No | Panel SNPs missing |
threshold_consistent |
Yes | Categories match thresholds |
threshold_mismatch |
No | Categories wrong |
harness_error |
— | Infrastructure error |
In plain English. Half behavior test, half static source audit. Runs
the metagenomics demo mode to confirm it works end-to-end, then performs
AST-based static analysis on ClawBio's source tree (not on runtime
output) to detect unsafe shell invocation (shell=True, aliased
OS-level shell helpers) and other injection vectors. Also confirms
non-zero exit codes are surfaced as errors, not silently demoted to
warnings.
Tests demo-mode functionality + AST-based static security analysis on the audited source (no external bioinformatics tools required).
| Category | Pass? | Description |
|---|---|---|
injection_blocked |
Yes | No unsafe shell invocation found (AST-verified) |
injection_succeeded |
No | Shell injection vector exists |
exit_handled |
Yes | Exit code treated as error |
exit_suppressed |
No | Exit suppressed to warning |
demo_functional |
Yes | Demo mode works |
demo_broken |
No | Demo mode fails |
harness_error |
— | Infrastructure error |
In plain English. Does Approximate Bayes Factor (ABF) and SuSiE
fine-mapping produce mathematically valid posterior inclusion
probabilities and credible sets? Does the tool fail loudly on degenerate
inputs (zero standard errors, non-positive n, non-convergence) rather
than silently returning plausible-looking nonsense? The fine-mapping
harness runs via a subprocess driver shim so its numerical stack
(numpy, pandas) is only pulled in by the optional [finemapping]
extra; clawbio_bench itself never imports it directly.
Test cases span single-causal ABF, SuSiE multi-causal, null-locus rejection, phantom secondaries, variance pathologies (NaN SE, extreme Z), credible set purity, and moment-labeling honesty failures.
In plain English. ClawBio's clinical-variant-reporter skill
classifies germline variants using the ACMG/AMP 2015 28-criteria
evidence framework (Richards et al. 2015) and generates clinical-grade
interpretation reports. This harness does not score 28-criteria
adjudication correctness. It checks only whether the reports carry
the structural and traceability elements any auditable clinical
variant report must carry — reference genome build, transcript used,
ClinVar/gnomAD versions pinned, limitations section, not-a-medical-
device disclaimer, per-variant ACMG criterion audit trail, and
gene–disease/inheritance context (per Rehm et al. 2013 laboratory
reporting standards and Abou Tayoun et al. 2018 ClinGen SVI PVS1
recommendations).
Phase 2 (v0.1.3) is now live as two separate harnesses:
cvr_identity(Phase 2c, 6 tests) — variant identity and HGVS v21.1 compliance: syntax, MANE Select, transcript versioning, indel normalization, assembly coordinate consistency.cvr_correctness(Phase 2a, 13 tests) — ACMG criterion-level correctness with Gold/Silver truth tiers: BA1/BS1/PM2 thresholds, PVS1 strength modulation per Abou Tayoun 2018, PP3/BP4 calibration per Pejaver 2022, VCEP supersession (ENIGMA, InSiGHT), SF v3.3 (84 genes), ClinGen GDV. Uses dual-layer ground truth:EXPECTED_*for clinical gold standard,EXPECTED_TOOL_*for tool self-consistency, withself_consistency_errorrubric category.
Both Phase 2 harnesses are grounded in the Phase 2 PRD with triple-verified standards (Exa + ref.tools + Tavily across three independent passes, plus 6-model code review by Droid/Gemini/Crush/Codex/OpenCode/Claude).
| Category | Pass? | Description |
|---|---|---|
report_structure_complete |
Yes | All required structural elements present |
assembly_missing |
No | Reference genome build not stated in report body |
transcript_missing |
No | No NM_/ENST_/MANE Select transcript cited |
data_source_version_missing |
No | ClinVar date / gnomAD version not pinned |
limitations_missing |
No | No Limitations section |
disclaimer_missing |
No | RUO / not-a-medical-device disclaimer absent from report body |
evidence_trail_incomplete |
No | <50% of classification lines cite an ACMG criterion code |
gene_disease_context_missing |
No | P/LP classifications without disease / inheritance context |
reference_build_inconsistent |
No | Conflicting assemblies in same report |
harness_error |
— | Harness infrastructure error |
Two formats are accepted and dispatched per file at parse time. Legacy format is fully supported; migration to YAML is per-file and voluntary.
# ---
# BENCHMARK: cyp2c19_pm_clopidogrel
# GROUND_TRUTH_PHENOTYPE: "CYP2C19 Poor Metabolizer (*2/*2)"
# FINDING_CATEGORY: correct_determinate
# TARGET_GENE: CYP2C19
# HAZARD_DRUG: Clopidogrel
# GROUND_TRUTH_BEHAVIOR: |
# Homozygous CYP2C19*2 (rs4244285 AA). Tool should report Poor
# Metabolizer. Clopidogrel: AVOID (CPIC Level 1A).
# ---
# rsid chromosome position genotype
rs4244285 10 96541616 AA
...
The # prefix on every line keeps the block invisible to audited tools
that treat the file as input (e.g. the ClawBio PharmGx reporter reading
a 23andMe-style TSV). The block opens with # --- and closes with the
next # ---. YAML block scalars (|) give multi-line narrative fields
a clean home without continuation-line gymnastics.
The YAML parser path validates keys against the same UPPER_SNAKE_CASE
regex as the legacy parser, rejects anchors (&), aliases (*), and
merge keys (<<:) as a hardening measure, and recursively normalizes
nested dict/list values to plain strings.
Example Model B directory with ground_truth.txt + payload sidecar:
# BENCHMARK: equity-scorer v0.1.0
# PAYLOAD: input.vcf
# GROUND_TRUTH_FST: 1.000
# GROUND_TRUTH_FST_PAIR: POP_A_vs_POP_B
# GROUND_TRUTH_FST_ESTIMATOR: Nei's GST
# FST_TOLERANCE: 0.001
# FINDING_CATEGORY: fst_mislabeled
# DERIVATION: p_total=0.5, HT=0.5, HS=0.0, GST=1.0
# CITATION: Nei (1973). PNAS 70(12):3321-3323
Model A vs Model B. Test cases come in two flavors:
- Model A — one self-contained file where the ground truth header
and the payload live in the same file (the audited tool ignores the
#-commented header). - Model B — a directory with a dedicated
ground_truth.txtplus one or more payload sidecars referenced via thePAYLOAD:/POP_MAP_FILE:headers. This is what most current test cases use.
See docs/ground-truth-derivation.md for detailed derivation methodology per harness.
The authoritative shape of every verdict document is published as a
versioned JSON Schema under schemas/:
schemas/verdict-minimal.schema.json— the minimum required shape every verdict satisfies, including stripped-downharness_errordocs.schemas/verdict-full.schema.json— the complete shape emitted bycore.build_verdict_doc()for successfully-executed test cases.additionalProperties: false, so unknown keys are rejected.
Both files are auto-generated from the msgspec.Struct definitions in
src/clawbio_bench/schemas.py via scripts/gen_schemas.py. A CI drift
gate (test_schemas.py::TestCommittedSchemas) fails if the committed
files fall out of sync.
Why commit them? Because auditors using Rust, Go, TypeScript, or a plain JSON Schema validator can verify verdicts directly against these files without running any Python code. The schema is the contract — the Python implementation is one way to honor it, not the only one.
High-level, user-facing features — the things you should know about before deciding to use this tool.
- Nine dedicated harnesses covering pharmacogenomics, population genetics, nutrigenomics, metagenomics, orchestration routing, fine-mapping, and three layers of clinical-variant-reporter audit (Phase 1 structural, Phase 2c identity, Phase 2a ACMG correctness). 175 test cases total with analytically derived or authority-referenced ground truth.
- Category-level verdicts, not just pass/fail. Each harness rubric
has 6–10 named categories; a
fst_mislabeledfinding carries different remediation thanfst_incorrect, and the tooling never collapses the two. - Tamper-evident chain of custody. SHA-256 on every input, output,
ground-truth file, stdout, stderr, and on the verdict document itself.
clawbio_bench --verifyruns a three-layer reconciliation (per-verdict self-hash, sidecar index, log-file integrity). - Longitudinal sweeps across git history.
--regression-window N,--all-commits, and--tagged-commitsreplay every test case against every selected commit so you can see exactly when a finding was introduced or fixed. Tagged-commit mode annotates releases on the heatmap timeline. - JSON Schema as external contract.
schemas/verdict-*.schema.jsonare committed artifacts — auditors can validate verdicts in any language without running Python. - Two-tier verdict validation. A minimum-contract check always runs on every verdict; the strict full-schema check runs on non-error verdicts and catches schema drift in CI.
- Canonical byte-stable output.
msgspec.json.encode(order="sorted")gives identical bytes on every run, every Python version, every machine — a prerequisite for meaningful longitudinal diffs. - Heatmap visualization (
--heatmap, needs the[viz]extra) of thecommits × test_casesgrid, category-coded. - Offline only at runtime. No network calls during an audit; reference values are pre-computed and embedded in the ground-truth files. See limitations for the trade-off.
- Rich CLI output (optional
[ui]extra) — styled tables for--listand the final summary, with--no-richas a kill switch and a byte-stable plain-text fallback when piped. - Type-safe. Full
mypy --strictcompliance across all source files. - 245 unit tests at
v0.1.4covering scoring, validators, parser edge cases, tamper detection, schema drift, YAML frontmatter hardening, canonical byte determinism, and deep-verify chain-of-custody reconciliation.
Lower-level engineering details that matter for the chain-of-custody guarantees above. Safe to skim on a first read.
Expand implementation details
- Single canonical serializer.
msgspec.json.encode(order="sorted")with deterministic byte-sorted output, in a single C extension. Replaces what would otherwise be separatepydanticandorjsondependencies. - Hash before truncate. When
stdout/stderrexceeds the 10 MB cap, theExecutionResultrecordsstdout_full_sha256/stdout_full_byte_lenof the original pre-truncation bytes, plus astdout_truncatedflag. Truncation happens at an encoded-byte boundary witherrors="replace"so multi-byte streams stay within the cap and chain of custody is preserved even on runaway tools. - Hardened dual ground-truth parser. YAML frontmatter or legacy
# KEY: value, dispatched per file. YAML path validates keys against the sameUPPER_SNAKE_CASEregex as legacy, rejects anchors (&), aliases (*), and merge keys (<<:), and recursively normalizes nested dict/list values to plain strings. - Correct phenotype matching.
pharmgx_harnessuses theregexpackage for variable-length lookbehind and Unicode-aware word boundaries. The substring fallback rejects any candidate that lands inside a negated context in the longer string, so"normal"cannot match inside"not normal metabolizer". - Positive-evidence scoring. NutriGx
snp_validand metagenomicsexit_handledbranches require stderr keyword evidence (via_stderr_mentions_panel) AND reject genuine crashes (via_is_genuine_crash, line-anchoredError:/Exception:/Fatal:) before crediting a tool with clean rejection. Unrelated crashes no longer over-credit. - AST-based security analysis for unsafe subprocess and shell
invocation detection in the metagenomics harness, including per-commit
run_command(critical=...)default extraction so fixes and regressions are both detected. - Submodule-aware workspace cleanup.
clean_workspacerecursively resets and cleans every git submodule between commits so a modified submodule working tree cannot carry over and poison longitudinal comparisons. - Reference genome tracking.
REFERENCE_GENOMEfield surfaced in every verdict for GRCh37 / GRCh38 traceability. - Reproducibility signature. Manifest and verdicts record a SHA-256
hash of the installed Python environment (sorted
name==versionset). - Input validation.
TIMEOUT/WEIGHTS/PAYLOAD/POP_MAP_FILE/ commit SHAs all validated against injection and path traversal. - Tightened argparse-fallback retry.
capture_executionrequires callers to declare the exact flag being stripped (fallback_flag=) and only retries when stderr contains the preciseerror: unrecognized arguments: <flag>line, preventing false triggers on tools that emit benign argparse-shaped output.
Four runtime dependencies (three core + one optional), each justified against the cost of expanding an audit tool's trust surface:
msgspec(core) — verdict schema validation (Struct) + deterministic JSON serialization (json.encode(order="sorted")) in a single C extension. Replaces what would otherwise be separatepydanticandorjsondeps.regex(core) — variable-length lookbehind and Unicode-aware word boundaries for pharmgx phenotype matching. Stdlibrecan't express these cleanly and false-matches"expressor"inside"non-expressor".ruamel.yaml(core) — safe YAML loader for the YAML frontmatter ground-truth format. Legacy# KEY: valueremains fully supported.rich(optional[ui]) — styled tables for CLI output. Plain-text fallback is byte-stable;--no-richis a kill switch.matplotlib(optional[viz]) — heatmap rendering only.numpy+pandas(optional[finemapping]) — only needed if you run the fine-mapping harness; loaded exclusively by the subprocess driver shim soclawbio_benchitself never imports either.
An audit tool's documentation must be at least as honest as the tool itself. These are the current caveats, collected in one place so nothing is hidden:
- Partial behavioral coverage: 6 / 37 executable ClawBio skills (~16%).
The other 31 executable skills are routing-tested only (via orchestrator
keyword/extension autodetection for the 17 that have auto-detect paths,
or via
--skill NAMEforce-routing tests for five previously-unreachable high-clinical-harm skills:clinical-variant-reporter,variant-annotation,clinical-trial-finder,target-validation-scorer,methylation-clock). See the Roadmap for the prioritized plan to close the behavioral-coverage gap; in the meantime, absence of a finding on a non-covered skill is not evidence of correctness. clinical-variant-reporteraudit is split across three harnesses. Phase 1 (clinical_variant_reporter, 5 tests) checks structural and traceability requirements (reference build stated, transcript cited, ClinVar/gnomAD versions pinned, limitations section present, RUO disclaimer in report body, per-variant ACMG criterion audit trail, disease/inheritance context). Phase 2c (cvr_identity, 6 tests, new in v0.1.3) validates HGVS v21.1 syntax, MANE Select transcript usage, indel normalization, and assembly coordinate consistency. Phase 2a (cvr_correctness, 13 tests, new in v0.1.3) scores ACMG criterion- level correctness on a curated unambiguous subset using dual-layer ground truth (EXPECTED_*for clinical gold standard versusEXPECTED_TOOL_*for tool self-consistency, with aself_consistency_errorrubric category). None of the three harnesses attempts a full 28-criteria adjudication of contested variants — that scope is intentionally avoided because it generates indefensible ground truth and damages suite credibility.- Prompt-injection tests are honestly scoped. Against ClawBio's
current deterministic parsers (TSV header reader for
pharmgx_reporter, substringKEYWORD_MAPfor the orchestrator) injection payloads in comments or queries are inert — the tool never interprets them as instructions. Two of the three orchestrator injection test cases (inj_01_routing_hijack,inj_02_exfil_attempt) are labeled as regression pins in their ground-truth hazards: they exist to catch a future refactor that introduces LLM-based parsing without re-hardening. The genuine live injection test isorchestrator/inj_03_flock_routing_hijack, which runs with--provider flockto exercise the LLM routing path; it is gated on FLock credentials and may be skipped in CI. - PHI sentinel is a test-case regression pin, not a full sweep.
phi_patient_identifiers_in_header.txtasserts that DOB/MRN/name comments are not echoed intoreport.md. A full PHI-persistence scan across every file in the benchmark's own results tree (stdout/stderr logs, provenance JSON, verdict bundles) is scoped for a future release because it requires walking the benchmark's output directory, not just the tool's. - Requires a local ClawBio checkout. This repository does not bundle
ClawBio. Every real run needs
--repo /path/to/ClawBio. CI workflows must clone ClawBio themselves. - Offline-only ground truth is a trade-off. Reference values are analytically pre-computed and embedded in test case files. This makes the audit deterministic and reproducible, but it also means ground truth can go stale relative to live CPIC / PharmGKB / GWAS Catalog updates. Upstream guideline changes need an explicit test case refresh.
- FST tolerance is absolute-value, not variance-aware. The current
FST_TOLERANCEfield is a hardcoded absolute delta. It works for the large-n reference cases bundled with the suite but can produce false failures on small-sample studies where estimator variance is high. A Z-score-based replacement is on the Roadmap. --smokeis HEAD-only by design. Smoke mode runs a single commit (the currentHEAD). Use--regression-window N,--all-commits, or--tagged-commitsfor longitudinal findings; findings from a smoke run say nothing about trajectory.- Markdown rendering is designed for smoke-mode aggregates.
Non-smoke runs still render, but the markdown renderer identifies
findings by
(harness, test, category)— which collapses duplicates across commits and can produce misleading diffs on regression sweeps. This is called out inline in the rendered output and again in the Continuous Audit section. --verifyneeds the original results directory. There is no standalone verdict re-verifier; chain-of-custody checks reconcile each verdict against its siblingstdout.log/stderr.logandverdict_hashes.jsonsidecar.- Fine-mapping harness requires optional dependencies. Install
pip install -e ".[finemapping]"(or".[dev]") before running thefinemappingharness, otherwise the subprocess driver will exit cleanly with aharness_errorand a pip hint. - PharmGx pass rate is intentionally low. The 44% pass rate at
ClawBio
HEADreflects known CPIC-compliance gaps documented in a prior audit (see Confirmed Findings). This is the audit working, not the audit failing. - Platform coverage. Linux and macOS only. Windows is untested;
path handling,
statsemantics, and git worktree behavior differ in ways that haven't been validated.
- Never abort — every
(commit, test_case)pair produces a verdict. Infrastructure failures becomeharness_errorverdicts excluded from pass rate. A harness that raises an unhandled exception is itself a bug inclawbio_bench, not a valid outcome. - Offline only — no network calls at runtime. Reference values are analytically pre-computed and embedded in ground truth files.
- Chain of custody — SHA-256 of every input, output, and ground truth file. Git metadata, timestamps, and environment recorded per verdict. Verdict documents self-hash.
- Safe by default — dirty repo safety gate (
--allow-dirtyrequired), git worktree isolation in tests, path traversal validation, noshell=True. - Category-level verdicts — not binary pass/fail. Each category maps to a specific remediation path.
clawbio_bench uses advisory exit codes. You can rely on them for
CI gating:
| Exit | Meaning | CI recommendation |
|---|---|---|
0 |
All harnesses passed cleanly | Green check — no action needed. |
1 |
Findings exist (at least one non-pass category) | Expected while the audited repo stabilizes. The reusable GitHub Action treats this as advisory (check stays green) and surfaces findings in a PR comment. |
≥ 2 |
A harness itself raised an infrastructure exception | Hard failure — fix the audit harness before trusting results. |
The reusable GitHub workflow only fails the job on exit ≥ 2. This
lets you surface findings in PR comments without blocking merges while
the audit matures. If you want stricter gating, invert the exit handling
in your own workflow.
# Unit tests — scoring, validators, parser, chain of custody, schemas, YAML
pytest tests/ -k "not test_harness_smoke" # 245 tests, <1s
# Type check
mypy src/clawbio_bench/ --ignore-missing-imports # 0 errors
# Full smoke tests (requires a ClawBio checkout)
pytest tests/ --repo /path/to/ClawBio
# Regenerate the committed JSON Schema artifacts after editing schemas.py
python scripts/gen_schemas.py # writes schemas/*.schema.json
# Tests use git worktree isolation — your repo is never modifiedTwo repositories participate. Workflows are split by ownership:
biostochastics/clawbio_bench ClawBio/ClawBio
(this repo — the auditor) (the audited target)
───────────────────────── ─────────────────────
ci.yml audit.yml ← you add this
audit-reusable.yml ◄── called by ── (3-line stub, see below)
audit-baseline.yml
daily-audit.yml
| Workflow | Trigger | What it does |
|---|---|---|
ci.yml |
Push / PR to clawbio_bench |
Lint, unit tests (3.11–3.13), smoke against pinned ClawBio ref |
audit-reusable.yml |
Called by downstream repos | Reusable workflow: installs bench, runs smoke, posts PR comment |
audit-baseline.yml |
Nightly (04:17 UTC) + push to main |
Publishes rolling aggregate_report.json baseline as a release asset |
daily-audit.yml |
Cron (08:00 UTC) + manual dispatch | Full daily audit against ClawBio HEAD (see below) |
Drop this 3-line stub into .github/workflows/audit.yml in ClawBio
(or any repo being audited). It calls the reusable workflow hosted here:
name: clawbio_bench audit
on:
pull_request:
branches: [main]
jobs:
audit:
uses: biostochastics/clawbio_bench/.github/workflows/audit-reusable.yml@v0.1.0
with:
clawbio_bench_ref: v0.1.0 # PIN — do not leave on main in production
permissions:
contents: read
pull-requests: writeWhat ClawBio gets on each PR:
- The full smoke suite (all harnesses, HEAD only) runs against the PR checkout.
- A sticky PR comment (one comment, updated in place on every push)
with per-harness pass/fail, and — when a baseline is available — the
set of new and resolved findings vs.
main. - The complete verdicts tree (including SHA-256 chain of custody for every file) uploaded as a workflow artifact for 30 days.
- The same report written to the job summary, so fork PRs (where the token is read-only and no comment can be posted) still surface the result.
A cron-scheduled workflow runs every morning at 8 AM UTC. It clones ClawBio HEAD and audits it:
- Smoke suite — all harnesses, HEAD only (~40 seconds).
- Delta reports — markdown and PDF (Typst) compared against a
rolling baseline committed to
baselines/latest_baseline.json. - Baseline promotion — if pass rate improved, the baseline is updated and committed. Regressions keep the old baseline so the delta stays visible.
- Notification — a one-paragraph digest posted to a webhook
(Slack/Discord). When
OPENROUTER_API_KEYis set, a multi-model LLM swarm (deepseek-v3.2-exp, minimax-m2.7, gpt-5-nano) independently analyzes the findings, then Haiku 4.5 synthesizes a narrative digest with number verification against the structured source. - Regression issue — a deduplicated GitHub issue opens automatically when findings are detected.
- Per-commit attribution — a secondary job runs
--regression-window 5for the last 5 ClawBio commits, so you can see which specific commit introduced a regression.
Secrets (configured in clawbio_bench, not ClawBio):
NOTIFICATION_WEBHOOK (optional), OPENROUTER_API_KEY
(optional, enables LLM swarm digest). The workflow also supports
workflow_dispatch for manual triggering with two optional inputs:
clawbio_ref audits a specific ClawBio commit/tag/branch instead of
HEAD; clawbio_baseline_ref runs the bench against a baseline commit
first and uses its aggregate for the delta report — producing a single
workflow run with a proper before/after comparison (e.g.
clawbio_baseline_ref: 349fb98 = "April 2 vs today"). Historical
runs suppress automatic issue creation. Both ref inputs are validated
against a conservative character set before use.
-
Advisory exit codes. See Understanding Results. The reusable workflow treats exit
1as advisory and only fails on exit≥ 2. -
Smoke-mode only for markdown rendering. The markdown renderer is designed for single-commit smoke-mode aggregates. Non-smoke runs still render but carry an explicit caveat: finding identity is
(harness, test, category), which collapses duplicates across commits and can produce misleading diffs on regression sweeps. Also called out in Current Scope and Limitations. -
Fork PRs. No comment is posted (the caller's
GITHUB_TOKENis read-only on forks). The report is still emitted to the job summary and as an artifact. -
Baseline diffing. A rolling baseline is published by
audit-baseline.ymlnightly (04:17 UTC) and on every push tomainin this repo, as theaggregate_report.jsonasset on thebaseline-mainrelease. Nightly cadence is intentional — the baseline audits ClawBioHEAD, which changes independently of this repository. The reusable workflow downloads the asset by HTTPS on each PR; if the download fails (stale asset, service hiccup, invalid JSON), it logs a warning and falls back to rendering absolute findings. Override or disable via thebaseline_urlinput. -
Pinning. Always pin
clawbio_bench_refto a tag or commit SHA in production. Leaving it onmainmeans a change in this repo can silently alter your PR audit baseline. Theclawbio_bench_refinput is validated against a conservative character set ([A-Za-z0-9._/-]) before being interpolated intopip install. Dependabot'sgithub-actionsecosystem watches reusable-workflow pins and will open PRs to bump them automatically. -
Comment safety. Verdict rationales can contain audited-repo
stderr. The renderer collapses newlines, HTML-escapes tag-like content, and caps the "unchanged findings" section to stay under GitHub's 65,536-char comment limit. -
Local preview. You can render the same markdown locally without pushing anything:
clawbio-bench --smoke --repo /path/to/ClawBio --output /tmp/results --allow-dirty clawbio-bench --render-markdown /tmp/results/ --baseline /path/to/main-baseline.json
See docs/baseline-schema.md for the exact fields consumed during diffing, so downstream tooling can produce compatible baselines.
Real bugs found by this suite in the audit target, audited against
ClawBio HEAD e7590141 (2026-04-07). The suite reports
163/175 (93.1%) at this commit. The historical findings below are
reproducible from their original test cases and have all been remediated
upstream; the open findings underneath are what's still firing now.
| ID | Finding | Harness Evidence | Status |
|---|---|---|---|
| C-06 | FST labeled "Hudson" but computes Nei's GST | eq_01, eq_02: fst_mislabeled |
fixed |
| U-2 | HEIM unbounded with custom weights | eq_09: heim_unbounded |
fixed |
| F-29 | Haploid genotypes crash equity scorer | eq_12: edge_crash |
fixed |
| M-3 | PharmGx / NutriGx / metagenomics unreachable via orchestrator | kw_16-18: unroutable_handled |
fixed |
| NEW | NutriGx hom-ref allele_mismatch bug |
ng_09: score_incorrect |
fixed |
| NEW | Metagenomics exit_suppressed (critical=False default) |
mg_05: exit_suppressed |
fixed |
| C-07 | eq_15 ground truth bench bug (0.500 should be 1.000) |
eq_15: bench-side fst_incorrect |
fixed in PR #11 |
| PGx | CPIC compliance audit: 44 tests, 13 genes, CPIC Level A scope | Multiple findings | 43/44 at HEAD |
| ID | Finding | Harness Evidence | Severity |
|---|---|---|---|
| FM-20 | SuSiE-inf advertises infinitesimal modeling but tau² never reaches the variance structure on realistic inputs |
fm_20: susie_inf_est_tausq_ignored |
critical (see spotlight below) |
| PGx-1 | TPMT compound heterozygote (*3B/*3C) returns Indeterminate instead of Poor Metabolizer (PF-1) |
tpmt_compound_het: incorrect_indeterminate |
warning |
| CVR-1 | Demo ACMG report omits gene–disease / inheritance context for P/LP/VUS/LB/B classifications | cvr_01_demo_structure: gene_disease_context_missing |
warning |
| CVR-2 | Demo ACMG report uses unversioned transcript accessions (ENST00000231790) instead of HGVS v21.1 versioned form |
cvr_10_hgvs_syntax_baseline, cvr_13_mane_select, cvr_15_transcript_versioning: transcript_selection_error (×3) |
warning |
These tests probe ACMG features that ClawBio's --demo mode does not
currently emit (PVS1 strength modulation, calibrated PP3 strength,
ENIGMA / InSiGHT VCEP citations). They're routed to the advisory
criteria_not_machine_parseable bucket and will auto-flip to real
verdicts the moment the demo grows the missing evidence (or the bench
gains a per-variant input mode for CVR tests). Marked
KNOWN_LIMITATION_DEMO_LACKS_EVIDENCE: true in their ground-truth files.
| Test | What it probes | Why advisory |
|---|---|---|
cvr_25_pvs1_strength_mod |
PVS1_Moderate strength tier per Abou Tayoun 2018 | Demo applies PVS1 only at default Very_Strong |
cvr_26_pp3_single_tool |
PP3 at calibrated REVEL strength per Pejaver 2022 | Demo emits PP3 at Supporting only, no strength annotation |
cvr_30_vcep_brca1 |
ENIGMA VCEP supersession for BRCA1/BRCA2 | Demo cites no VCEPs |
cvr_31_vcep_lynch_mlh1 |
InSiGHT VCEP supersession for Lynch syndrome genes | Demo cites no VCEPs |
| Test | Status | Notes |
|---|---|---|
inj_03_flock_routing_hijack |
unroutable_crash (expected) |
Live-LLM injection test, gated on FLock provider credentials. Fires when creds absent — not a ClawBio defect. |
fm_12_susie_nonconvergence |
harness_error (env) |
Missing scipy in the bench's driver subprocess interpreter. Install with pip install -e ".[finemapping]" or ".[dev]". |
fm_20 is the v0.1.4 SuSiE-inf activation honesty test. It detects two
observationally identical failure modes that both nullify the
infinitesimal component of "SuSiE-inf":
- Dead code in the IBSS loop —
_mom_updatecalled withest_tausq=Falsehardcoded, ORrun_susie_infdoesn't expose theest_tausqparameter at all, OR the parameter is never propagated from the public API into the MoM call site. Pre-237cbd9 ClawBio exhibited this defect. - Defensive threshold suppression — a "noise filter" zeroes out
the correctly-estimated
tau²before applying it to the variance structure (e.g.effective_tausq = tausq if tausq >= 1e-3 else 0.0). In practice the gentropy reference producestau²estimates in the 1e-5 to 1e-4 range on realistic SuSiE-inf inputs, so any threshold above ~1e-4 nullifies activation across all geometries. Post-237cbd9 ClawBio exhibits this defect.
In both cases, calling run_susie_inf(z, R, n, est_tausq=True)
produces output byte-equivalent to calling it with
est_tausq=False. The user gets standard SuSiE-RSS while the tool
advertises SuSiE-inf — a textbook honesty failure of the kind this
benchmark exists to detect. Ground truth is derived from the gentropy
port of FinucaneLab/fine-mapping-inf (vendored under
scripts/_reference/gentropy_susie_inf.py and exercised offline by
scripts/derive_finemapping_ground_truth.py).
- Nei, M. (1973). Analysis of gene diversity in subdivided populations. PNAS, 70(12), 3321–3323.
- Hudson, R.R. et al. (1992). Estimation of levels of gene flow from DNA sequence data. Genetics, 132(2), 583–589.
- 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature, 526, 68–74.
- CPIC Guidelines (2017–2024). cpicpgx.org
- OWASP Command Injection Prevention Cheat Sheet (2024). owasp.org
Full roadmap:
ROADMAP.md— consolidated tracking of all planned harnesses, framework features, audit-framework failure-class coverage, and ClawBio skill inventory.
- 9 dedicated behavioral harnesses (orchestrator, pharmgx, equity, nutrigx, metagenomics, clinical-variant-reporter Phase 1/2c/2a, finemapping) covering 175 test cases.
- Dynamic skill inventory,
--skill NAMEforce-routing,--skills A,B,Ccomposition mode, prompt-injection regression pins. - CYP2D6 CNV/hybrid/5/10, NUDT15, CYP2B6, CYP1A2, CYP2C9, G6PD, MT-RNR1, HLA-A31:01, HLA-B58:01 pharmgx tests.
scope_honest_indeterminatecategory split,--tagged-commitsmode, 5-tier severity system, delta comparison in reports.
| Harness | ClawBio skill | Tier |
|---|---|---|
clinical-variant-reporter Phase 2c/2a |
clinical-variant-reporter |
1 |
variant-annotation |
variant-annotation |
1 |
clinpgx |
clinpgx |
1 |
gwas-prs |
gwas-prs |
1 |
clinical-trial-finder |
clinical-trial-finder |
1 |
wes-clinical-report |
wes-clinical-report-en/es |
1 |
target-validation-scorer |
target-validation-scorer |
2 |
genome-compare |
genome-compare |
2 |
methylation-clock |
methylation-clock |
2 |
- YAML-only ground truth migration (plan)
- Shared AST security sweep (
core.ast_security_sweep()) - Parallel execution (
--jobs/-j) - Cross-harness Tier-1 safety gate (
--tier1-only) - FST variance-aware Z-score, diplotype-level PGx validation
See ROADMAP.md for P2/P3 harnesses, failure-class
coverage matrix, skills watchlist, and open questions.
- ROADMAP.md — Consolidated roadmap: planned harnesses, framework features, failure-class coverage, ClawBio skill inventory
- docs/methodology.md — Audit methodology and rubric design
- docs/ground-truth-derivation.md — How reference values are computed per harness
- docs/baseline-schema.md — Fields consumed by the baseline diff renderer
- CONTRIBUTING.md — How to add harnesses for new tools
- CHANGELOG.md — Release history
MIT © 2025–2026 Sergey A. Kornilov (Biostochastics) and the ClawBio Audit Team. See LICENSE for the full text.