clawbio_bench

⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⠟⠛⠉⠉⠉⠈⠉⠉⠙⠛⠿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠟⠋⠀⠀⢀⣠⣤⣤⣤⣤⣤⣤⣄⡀⠈⠛⢿⣿⣿⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⠋⠀⢀⣠⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣦⡀⠈⠻⣿⣿⣿⣿⣿⣿   ██████ ██       █████  ██     ██ ██████  ██  ██████
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⠁⠀⣠⣾⣿⣿⣿⣿⡿⢿⠛⡟⢿⣿⣿⣿⣿⣷⠀⠀⢹⣿⣿⣿⣿⣿  ██      ██      ██   ██ ██     ██ ██   ██ ██ ██    ██
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⢠⣿⣿⣿⣿⣿⡿⡋⠃⠈⠀⠀⠈⠈⢻⣿⣿⣿⠀⠀⠀⣿⣿⣿⣿⣿  ██      ██      ███████ ██  █  ██ ██████  ██ ██    ██
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡃⣿⣿⣿⣿⣿⡏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣻⣿⣿⣿⣿  ██      ██      ██   ██ ██ ███ ██ ██   ██ ██ ██    ██
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡇⠹⣿⣿⣿⣿⣿⣦⡂⡀⠀⠀⠀⠀⠀⣰⣶⣶⣶⠀⠀⠀⣿⣿⣿⣿⣿   ██████ ███████ ██   ██  ███ ███  ██████  ██  ██████
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣷⠀⠉⠻⣿⣿⣿⣿⣿⣧⣼⣀⣆⣼⣾⣿⣿⣿⣿⠀⠀⢰⣿⣿⣿⣿⣿
⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣧⡀⠀⠘⠻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⠃⠀⣠⣿⣿⣿⣿⣿⣿          ██████  ███████ ███    ██  ██████ ██   ██
⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⠛⣿⣿⣄⡀⠀⠈⠙⠻⠿⠿⠿⠿⠿⠿⠟⠋⠀⣀⣼⣿⣿⣿⣿⣿⣿⣿          ██   ██ ██      ████   ██ ██      ██   ██
⣿⣿⣿⣿⣿⣿⣿⠿⠃⠀⢀⣼⣿⣿⣿⣦⣄⣀⠀⠀⠀⠀⠀⠀⢀⣀⣤⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿          ██████  █████   ██ ██  ██ ██      ███████
⣿⣿⣿⣿⣿⠟⠉⠀⣠⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿          ██   ██ ██      ██  ██ ██ ██      ██   ██
⣿⣿⡿⢋⠁⢀⣤⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿  ███████ ██████  ███████ ██   ████  ██████ ██   ██
⢟⠉⠀⣠⣼⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿
⣷⣤⣾⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿

What this is. clawbio_bench is a standalone Python audit suite that evaluates the external ClawBio bioinformatics platform for safety, correctness, and honesty. It runs behavioral harnesses against a local clone of ClawBio, compares each skill's output against analytically derived ground truth, and emits tamper-evident JSON verdicts with a SHA-256 chain of custody over every input, output, and artifact.

What this is NOT. It is not part of ClawBio, it does not bundle ClawBio, and it is not a performance benchmark. To run it you need a local checkout of the ClawBio repository (see Requirements).

Most bioinformatics benchmarks answer "does it run?" This suite answers "is it safe, correct, and honest?" — with machine-verifiable evidence at every step.

For ClawBio / tool authors: run clawbio-bench --smoke in CI to catch regressions in safety and correctness across commits.
For external auditors: use the rubric design, ground-truth methodology, and verdict format as a reference for auditing other computational biology tools.
Audit report (PDF) — a 21-page report from a 7-harness smoke run against ClawBio HEAD captured at v0.1.2 (125/147 tests passing). The current suite (v0.1.4) runs nine harnesses with 175 test cases (see Coverage Scope) and reports 163/175 (93.1%) against ClawBio HEAD e7590141 (2026-04-07), with the open finding spotlighted in Confirmed Findings below.

The three audit dimensions

Dimension	What it means	Example finding category
Safety	Does the tool refuse or isolate unsafe inputs? No `shell=True`, no crashes on malformed genotypes, no silent suppression of non-zero exits.	`injection_blocked`, `edge_handled`, `exit_handled`
Correctness	Does the tool produce the right numerical answer against an analytically derived reference? FST values, phenotype calls, PRS, HEIM bounds.	`fst_correct`, `correct_determinate`, `score_correct`
Honesty	Does the tool report what it actually did, not what it claims to do? An "honesty" failure is a tool that computes Nei's GST while labeling its output "Hudson's FST", or a CSV mode that inflates coverage beyond what was measured.	`fst_mislabeled`, `csv_inflated`, `disclosure_failure`

"Honesty" is the distinctive axis. Correctness failures are usually obvious; honesty failures are subtle, dangerous, and rarely caught by conventional test suites.

Requirements
Install
Quick Start
How It Works
Process Isolation — Does ClawBio Actually Run?
What It Produces
Core Concepts
Coverage Scope
Harnesses
Ground Truth Formats
Verdict Schema as External Contract
Core Capabilities
Implementation Safeguards
Current Scope and Limitations
Design Principles
Understanding Results (Exit Codes)
Run Tests
Continuous Audit (GitHub Actions)
Confirmed Findings at ClawBio HEAD
References
Roadmap
Documentation
License

Requirements

Python 3.11 or later (3.11, 3.12, 3.13, 3.14 are all CI-tested)
Git available on PATH
A local clone of ClawBio — this repository does not bundle ClawBio. Every harness run needs a --repo /path/to/ClawBio argument.
numpy and pandas in the benchmark virtualenv — ClawBio's clawbio.common package unconditionally imports scrna_io which requires numpy and pandas at import time, even for harnesses that don't use scRNA functionality. Without these packages, every harness will report mass unroutable_crash / harness_error verdicts from ModuleNotFoundError rather than real audit findings. Install the [dev] or [finemapping] extras to get them automatically.
Typst CLI on PATH — required only for PDF report generation (python scripts/generate_report.py). Install via brew install typst (macOS), cargo install typst-cli, or download from typst.app. The benchmark itself runs without Typst; only the report renderer needs it.
Operating system: Linux or macOS (Windows is untested)

Install

python3.11 -m venv .venv && source .venv/bin/activate
pip install -e .                    # core: msgspec, regex, ruamel.yaml
pip install -e ".[dev]"             # + pytest, ruff, mypy, pre-commit, rich, numpy, pandas
pip install -e ".[all]"             # + viz, ui, finemapping, scikit-learn (for CI / daily audit)
pip install -e ".[viz]"             # + matplotlib for heatmap rendering
pip install -e ".[ui]"              # + rich for styled CLI output
pip install -e ".[finemapping]"     # + numpy, pandas for the fine-mapping subprocess driver

The core install is deliberately minimal — every runtime dependency expands the trusted base of an audit tool, so each one has to justify itself (see Core Capabilities).

Quick Start

The minimum path from clone to first result, assuming you already have a ClawBio checkout at ~/src/ClawBio:

git clone https://github.com/biostochastics/clawbio_bench.git
cd clawbio_bench
python3.11 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"               # includes numpy/pandas needed by ClawBio
clawbio-bench --smoke --repo ~/src/ClawBio

That runs every harness against ClawBio's HEAD commit (about 25 seconds) and writes results to ./results/suite/<timestamp>/.

Other common invocations

# List available harnesses and test case counts
clawbio-bench --list

# Machine-readable harness inventory (for scripting / dashboards)
clawbio-bench --list --json

# Single harness only
clawbio-bench --smoke --harness orchestrator --repo ~/src/ClawBio

# Preview what would run without executing
clawbio-bench --smoke --repo ~/src/ClawBio --dry-run

# Last 10 commits — longitudinal sweep, quiet mode for CI
clawbio-bench --regression-window 10 --repo ~/src/ClawBio -q

# Full longitudinal sweep on a specific branch
clawbio-bench --all-commits --branch main --repo ~/src/ClawBio

# Tagged commits only — benchmark at each release/milestone
clawbio-bench --tagged-commits --repo ~/src/ClawBio

# Custom test case directory
clawbio-bench --smoke --harness equity --inputs /my/test_cases --repo ~/src/ClawBio

# Render heatmap PNG (requires the [viz] extra)
clawbio-bench --heatmap results/suite/20260404_120000/

# Render the same markdown report the CI workflow posts to PRs
clawbio-bench --render-markdown results/suite/20260404_120000/ \
              --baseline /path/to/main-baseline.json

# Deep chain-of-custody verification (three layers — see What It Produces)
clawbio-bench --verify results/suite/20260404_120000/

# Also works as a module
python -m clawbio_bench --smoke --repo ~/src/ClawBio

# Version + provenance
clawbio-bench --version

CLI flag reference

Flag	Purpose
`--smoke`	Fast check: HEAD commit only, all harnesses. The default CI gate.
`--regression-window N`	Replay every test case across the last `N` commits on the current branch.
`--all-commits`	Every commit on a branch from the first audit-era commit forward (slowest).
`--tagged-commits`	Run against tagged commits only (releases / milestones). Heatmaps annotate release names on the timeline.
`--commits SHA,SHA,...`	Explicit commit list (diagnostic mode).
`--branch NAME`	Which branch to walk (default: `main`).
`--harness NAME`	Run only one harness (e.g. `equity`, `pharmgx`). Omit to run all nine.
`--inputs PATH`	Override the bundled test case directory for a single harness.
`--output DIR`	Where results land. Default: `./results/suite/<timestamp>/`.
`--repo PATH`	Required for every real run: local ClawBio checkout.
`--list`	Print the harness registry and test case counts.
`--list --json`	Same, machine-readable.
`--dry-run`	Show the plan (commits × harnesses × test cases) without executing.
`--allow-dirty`	Safety override: run even when the ClawBio working tree is dirty. By default a dirty repo is a hard stop to protect chain of custody.
`--verify DIR`	Three-layer chain-of-custody re-verification of an existing results directory.
`--heatmap DIR`	Render `heatmap.png` from a results directory (needs `[viz]` extra).
`--render-markdown DIR`	Render the PR-comment markdown report a CI run would emit.
`--baseline PATH`	Baseline aggregate report (or directory) to diff against for `--render-markdown`.
`-q` / `--quiet`	Suppress per-test progress.
`--no-rich`	Force plain-text tables even when `rich` is installed (byte-stable output for diffs).
`--version`	Print version + provenance.

How It Works

clawbio_bench runs a (commits × test_cases) matrix. For every commit you select, every test case you select is executed against that commit, and every (commit, test_case) pair produces exactly one verdict — even if the harness itself crashes (infrastructure failures become harness_error verdicts rather than aborting the sweep).

        ┌──────────────────────────────────────────────────────────┐
        │           clawbio-bench --smoke / --regression-window     │
        └──────────────────────────────────────────────────────────┘
                                   │
                                   ▼
          ┌────────────────── Matrix runner ──────────────────┐
          │                                                    │
          │   for commit in selected_commits:                  │
          │     git worktree checkout commit   ← isolation     │
          │     clean_workspace (incl. submodules)             │
          │                                                    │
          │     for test_case in harness.test_cases:           │
          │       ground_truth = parse_ground_truth(...)       │
          │       execution   = capture_execution(...)         │
          │       scored      = harness.run_single_*(...)      │
          │       verdict     = build_verdict_doc(...)         │
          │       save_verdict(); save_execution_logs()        │
          │                                                    │
          └────────────────────────────────────────────────────┘
                                   │
                                   ▼
                  aggregate_report.json · heatmap.png ·
                  verdict_hashes.json (per harness) ·
                  rendered markdown (optional)

Step by step:

Resolve the commit set. --smoke → [HEAD]. --regression-window N → last N commits on the branch. --all-commits → every commit from the audit-era root. --tagged-commits → only tagged (release) commits. --commits SHA,... → an explicit list.
Isolate each commit. Every commit is checked out in a git worktree so the user's working tree is never modified, and submodules are recursively reset between commits so a dirty submodule from commit N cannot poison commit N+1.
Run every test case. Each harness iterates its bundled test cases, parses ground truth, invokes the ClawBio skill under audit via capture_execution() (subprocess with timeout, truncation cap, and optional argparse-fallback retry), and captures stdout, stderr, exit_code, and any artifact files.
Score against ground truth. Each harness applies its rubric — category-level, not binary. A finding lands in one of ~7–10 named buckets per harness (e.g. fst_mislabeled ≠ fst_incorrect), each mapping to a specific remediation.
Emit a verdict document. build_verdict_doc() assembles a canonical, byte-sorted JSON document with SHA-256 hashes of every input, output, ground truth file, and (embedded at write time) the verdict itself.
Aggregate. Per-harness summary.json / all_verdicts.json, a suite-level aggregate_report.json, heatmap_data.json, and a tamper-evident verdict_hashes.json sidecar.

The matrix model is why --smoke is ~30 seconds (1 commit × 175 tests) but --regression-window 20 is several minutes (20 × 175).

Process Isolation — Does ClawBio Actually Run?

Yes — ClawBio's actual code executes for real. Not mocked, not simulated, not reimplemented. What it does not do is share a Python interpreter with the benchmark. Every test case spawns a fresh OS process via subprocess.run() (no shell=True, ever), runs real ClawBio code in that process, and the benchmark reads the result from captured stdout, stderr, exit code, and files on disk.

The benchmark package itself contains zero import clawbio / from clawbio … lines. This is the "loose-coupling invariant": the auditor and the auditee cannot contaminate each other's interpreter state, the audit tool's trust surface stays pinned to its three runtime deps (msgspec, regex, ruamel.yaml), and a missing or broken skill at some commit produces a clean exit 127 instead of a ModuleNotFoundError that corrupts the whole longitudinal sweep.

View 1 — process isolation at the OS level

═══════════════════════════════════════════════════════════════════════════════
  HOST OS (macOS / Linux)
═══════════════════════════════════════════════════════════════════════════════

  ┌─────────────────────────────────────────────────────────────────────────┐
  │  PROCESS #1  ·  pid 42000  ·  the auditor                               │
  │  ────────────────────────────────────────                               │
  │  argv:  clawbio-bench --smoke --repo ~/src/ClawBio                      │
  │  python: /Users/you/clawbio_bench/.venv/bin/python3.14                  │
  │  imports: clawbio_bench.cli, .core, .harnesses.*, msgspec, regex,       │
  │           ruamel.yaml                    ← 3 deps, that's the full list │
  │  memory: ~60 MB                                                         │
  │                                                                         │
  │    ┌─ clawbio_bench.cli.main()                                          │
  │    │    resolves commit set → [HEAD]                                    │
  │    │    for each harness in HARNESS_REGISTRY:                           │
  │    │      for each test_case:                                           │
  │    │        ┌─ harness_core.capture_execution(cmd=[...])                │
  │    │        │    build cmd list                                         │
  │    │        │    subprocess.run(cmd, cwd=repo_path, timeout=…,          │
  │    │        │                   capture_output=True, shell=False)       │
  │    │        │         │                                                 │
  │    │        │         │  ← fork() + execve()  (OS boundary)             │
  │    │        │         │                                                 │
  │    │        │         ▼                                                 │
  │    │        │    ╔════════════════════════════════════════════════════╗ │
  │    │        │    ║ PROCESS #2  ·  pid 42017  ·  the auditee           ║ │
  │    │        │    ║ ──────────────────────────────────────             ║ │
  │    │        │    ║ argv:  /.../.venv/bin/python3.14 \                 ║ │
  │    │        │    ║        ~/src/ClawBio/skills/pharmgx-reporter/ \    ║ │
  │    │        │    ║        pharmgx_reporter.py \                       ║ │
  │    │        │    ║        --input  pg_01_cyp2c19.txt \                ║ │
  │    │        │    ║        --output results/…/tool_output/ \           ║ │
  │    │        │    ║        --no-enrich                                 ║ │
  │    │        │    ║ cwd:  ~/src/ClawBio      ← ClawBio's own worktree  ║ │
  │    │        │    ║ imports: WHATEVER pharmgx_reporter.py IMPORTS      ║ │
  │    │        │    ║          (transitive closure of the target repo)   ║ │
  │    │        │    ║ memory: separate heap, separate GIL, separate      ║ │
  │    │        │    ║         sys.modules, separate logging state        ║ │
  │    │        │    ║                                                    ║ │
  │    │        │    ║   ┌─ pharmgx_reporter.main()                       ║ │
  │    │        │    ║   │    parse 23andMe TSV                           ║ │
  │    │        │    ║   │    call CPIC phenotype logic                   ║ │
  │    │        │    ║   │    write report.md + result.json               ║ │
  │    │        │    ║   │    print rationale to stdout                   ║ │
  │    │        │    ║   └─ sys.exit(0)                                   ║ │
  │    │        │    ╚════════════════════════════════════════════════════╝ │
  │    │        │         │                                                 │
  │    │        │         │  ← _exit()   (OS boundary, same direction)      │
  │    │        │         ▼                                                 │
  │    │        │    auditor receives:                                      │
  │    │        │      · exit_code: int                                     │
  │    │        │      · stdout:   bytes  (up to 10 MB, else truncated)     │
  │    │        │      · stderr:   bytes  (same cap)                        │
  │    │        │      · wall_seconds: float                                │
  │    │        │      · (files on disk in tool_output/)                    │
  │    │        │                                                           │
  │    │        └─ score verdict · compute SHA-256 on inputs/outputs/logs · │
  │    │           build_verdict_doc() · save_verdict() (atomic write)      │
  │    └──────────                                                          │
  └─────────────────────────────────────────────────────────────────────────┘

  Process #1 never imports anything from ~/src/ClawBio.
  Process #2 is a fresh Python interpreter — anything the target imports
  happens entirely in its own sys.modules and dies with the process.
  Communication is four channels only: argv in, stdout + stderr + exit + files out.

View 2 — the matrix loop, zoomed out

          ┌────────────────────────────────────────────────────────────┐
          │                   clawbio-bench --smoke                    │
          │                   (process #1, the auditor)                │
          └────────────────────────────────────────────────────────────┘
                                       │
                ┌──────────────────────┼──────────────────────┐
                ▼                      ▼                      ▼
         commits = [HEAD]      test_cases per harness   HARNESS_REGISTRY
                │                      │                      │
                └────────────┬─────────┴──────────────────────┘
                             │
                             ▼
     ┌────────────────────────────────────────────────────────────┐
     │  for commit in commits:                                    │
     │    git worktree checkout commit       (subprocess to git)  │
     │    clean_workspace(submodules=True)   (subprocess to git)  │
     │                                                            │
     │    for harness in HARNESS_REGISTRY:                        │
     │      for test_case in harness.test_cases:                  │
     │                                                            │
     │        ┌───────────────────────────────────────────────┐   │
     │        │  ground_truth = parse_ground_truth(...)       │   │
     │        │                                               │   │
     │        │  cmd = [                                      │   │
     │        │    sys.executable,                            │   │
     │        │    str(repo_path/"skills"/"<skill>"/"*.py"),  │   │
     │        │    "--input",  str(payload),                  │   │
     │        │    "--output", str(tool_output_dir),          │   │
     │        │    …,                                         │   │
     │        │  ]                                            │   │
     │        │                                               │   │
     │        │  execution = capture_execution(cmd, cwd=repo) │   │
     │        │                     │                         │   │
     │        │                     ▼                         │   │
     │        │        ╔══════════════════════════╗           │   │
     │        │        ║ SUBPROCESS                ║          │   │
     │        │        ║  python skills/.../*.py  ║           │   │
     │        │        ║  ← actual ClawBio code → ║           │   │
     │        │        ║  runs for real           ║           │   │
     │        │        ║  writes artifacts        ║           │   │
     │        │        ║  exits with code N       ║           │   │
     │        │        ╚══════════════════════════╝           │   │
     │        │                     │                         │   │
     │        │                     ▼                         │   │
     │        │  stdout, stderr, exit_code, tool_output/*.*   │   │
     │        │                     │                         │   │
     │        │                     ▼                         │   │
     │        │  verdict = score_<harness>_verdict(           │   │
     │        │    ground_truth, execution, analysis          │   │
     │        │  )                                            │   │
     │        │                                               │   │
     │        │  doc = build_verdict_doc(                     │   │
     │        │    verdict, execution, commit_meta,           │   │
     │        │    chain_of_custody={                         │   │
     │        │      payload_sha256, stdout_sha256,           │   │
     │        │      stdout_full_sha256, driver_sha256, …     │   │
     │        │    }                                          │   │
     │        │  )                                            │   │
     │        │                                               │   │
     │        │  save_verdict(doc, …)   ← atomic write        │   │
     │        │  _verdict_sha256 embedded + re-verified       │   │
     │        └───────────────────────────────────────────────┘   │
     │                                                            │
     └────────────────────────────────────────────────────────────┘
                             │
                             ▼
                  aggregate_report.json +
                  verdict_hashes.json per harness

View 3 — the three invocation patterns side by side

┌─────────────────────────────────────────────────────────────────────────────┐
│ Pattern A — CLI subprocess                                                  │
│ Used by: orchestrator, pharmgx, equity, nutrigx, metagenomics, cvr          │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   auditor (pid 42000)                                auditee (pid 42017)    │
│                                                                             │
│   cmd = [                                                                   │
│     sys.executable,                                                         │
│     "~/src/ClawBio/skills/pharmgx-reporter/pharmgx_reporter.py",            │
│     "--input",  "pg_01.txt",                                                │
│     "--output", "results/.../tool_output",                                  │
│     "--no-enrich",                                                          │
│   ]                                                                         │
│                                                                             │
│   subprocess.run(cmd, cwd=~/src/ClawBio, shell=False, capture_output=True)  │
│   ──────────────────────────────────────▶┌──────────────────────────────┐   │
│                                          │ python pharmgx_reporter.py   │   │
│                                          │                              │   │
│                                          │ imports:                     │   │
│                                          │   · argparse                 │   │
│                                          │   · pandas (if ClawBio uses) │   │
│                                          │   · whatever else the tool   │   │
│                                          │     wants                    │   │
│                                          │                              │   │
│                                          │ reads  pg_01.txt             │   │
│                                          │ writes tool_output/          │   │
│                                          │ prints to stdout / stderr    │   │
│                                          │ sys.exit(0)                  │   │
│                                          └──────────────────────────────┘   │
│   ◀────────────────────────────────────── exit, stdout, stderr              │
│                                                                             │
│   scores verdict, hashes everything                                         │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ Pattern B — Subprocess driver shim                                          │
│ Used by: finemapping (because the target has NO CLI, just library modules)  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   auditor (pid 42000)                                driver  (pid 42017)    │
│                                                                             │
│   cmd = [                                                                   │
│     sys.executable,                                                         │
│     "<clawbio_bench>/drivers/finemapping_driver.py",  ← OUR file, not       │
│     "--skill-dir", "~/src/ClawBio/skills/fine-mapping",  ClawBio's          │
│     "--inputs",    "test_case/inputs.json",                                 │
│     "--output",    "result.json",                                           │
│   ]                                                                         │
│                                                                             │
│   subprocess.run(cmd, cwd=~/src/ClawBio, shell=False)                       │
│   ──────────────────────────────────────▶┌──────────────────────────────┐   │
│                                          │ python finemapping_driver.py │   │
│                                          │                              │   │
│                                          │ sys.path.insert(0,           │   │
│                                          │   "<skill-dir>")             │   │
│                                          │                              │   │
│                                          │ ┌──────────────────────────┐ │   │
│                                          │ │ from core.abf     import │ │   │
│                                          │ │   approximate_bayes_...  │ │   │
│                                          │ │ from core.susie   import │ │   │
│                                          │ │   susie_ibss             │ │   │
│                                          │ │ from core.credible_sets  │ │   │
│                                          │ │   import build_cs        │ │   │
│                                          │ │   ↑ THIS is where        │ │   │
│                                          │ │     ClawBio code gets    │ │   │
│                                          │ │     imported — but in    │ │   │
│                                          │ │     a SEPARATE Python    │ │   │
│                                          │ │     interpreter that     │ │   │
│                                          │ │     will die in ~1 s     │ │   │
│                                          │ └──────────────────────────┘ │   │
│                                          │                              │   │
│                                          │ run ABF / SuSiE              │   │
│                                          │ json.dump(result) → stdout   │   │
│                                          │ sys.exit({0,1,2})            │   │
│                                          └──────────────────────────────┘   │
│   ◀────────────────────────────────────── JSON on stdout                    │
│                                                                             │
│   parses JSON, scores, hashes driver file itself (driver_sha256)            │
│                                                                             │
│   ▶ The auditor's own Python process STILL never touches ClawBio.           │
│     The driver is a clawbio_bench DATA FILE — bundled under drivers/        │
│     but explicitly not imported from anywhere in the package.               │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ Pattern C — AST static analysis                                             │
│ Used by: metagenomics (only, as a secondary channel alongside Pattern A)    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   auditor (pid 42000, no subprocess spawned)                                │
│                                                                             │
│   source_path = repo_path/"skills"/"claw-metagenomics"/"metagenomics.py"    │
│                                                                             │
│   text = source_path.read_text(errors="replace")    ← reads AS TEXT         │
│   tree = ast.parse(text, filename=str(source_path)) ← builds syntax tree    │
│                                                        NO CODE EXECUTES     │
│                                                                             │
│   for node in ast.walk(tree):                                               │
│     if subprocess.run(..., shell=True)  — found → injection_succeeded       │
│     if run_command(critical=False)      — found → exit_suppressed           │
│     …                                                                       │
│                                                                             │
│   ▶ ast.parse() is lexer + parser, not an interpreter. None of the          │
│     source's side effects fire. It's the same mechanism `ruff`, `mypy`,     │
│     and `bandit` use to read code without running it.                       │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Why process isolation matters (the design rationale)

No shared interpreter state. Import side effects, monkey-patches, global logging config, sys.path mutations, C-extension crashes — none of them can contaminate the auditor if the target never runs in the auditor's process.
Subprocess-only bugs become visible. Exit-code handling, stderr channel discipline, argparse drift, signal semantics, non-zero-exit suppression — these are invisible to an in-process caller and are the exact classes of bug the exit_suppressed / disclosure_failure categories exist to catch.
Trust surface stays minimal. clawbio_bench declares three runtime deps (msgspec, regex, ruamel.yaml). If it imported ClawBio, ClawBio's entire transitive dependency closure would become part of the audit tool's trusted base — and an audit tool whose trust surface includes the code it audits is broken by construction.
Graceful failure when the skill doesn't exist yet. Longitudinal sweeps walk commits from before a given skill existed. Subprocess invocation naturally returns FileNotFoundError / exit 127 on a missing tool path rather than raising ModuleNotFoundError inside the benchmark and corrupting the whole run.

What It Produces

Every run writes a single timestamped directory:

results/suite/20260404_120000/
├── aggregate_report.json               # suite-level rollup: pass/fail by harness, counts, env
├── orchestrator/
│   ├── manifest.json                   # run parameters, env, commit set, ground_truth_refs
│   ├── summary.json                    # category histogram, pass rate, persistent failures
│   ├── all_verdicts.json               # flat list of every verdict in the run
│   ├── heatmap_data.json               # commits × test_cases grid, category-coded
│   ├── verdict_hashes.json             # {rel_path: _verdict_sha256} sidecar index
│   └── <commit_sha>/
│       └── <test_case>/
│           ├── verdict.json            # ◄─── the canonical unit of audit
│           ├── stdout.log              # captured ClawBio stdout (pre-truncation hash in verdict)
│           └── stderr.log              # captured ClawBio stderr
├── equity/      …same layout…
├── pharmgx/     …same layout…
├── nutrigx/     …same layout…
├── metagenomics/…same layout…
└── finemapping/ …same layout…

Example verdict JSON

A real verdict.json (abbreviated; hashes and rationales trimmed for readability — actual output is canonical, sorted-key JSON with every field fully populated):

{
  "benchmark_name": "pharmgx-reporter",
  "benchmark_version": "0.1.0",
  "start_time_utc": "2026-04-04T12:00:03.114Z",
  "timestamp_utc":  "2026-04-04T12:00:04.892Z",
  "wall_clock_seconds": 1.778,
  "commit": {
    "sha": "a1b2c3d4e5f6…",
    "short_sha": "a1b2c3d",
    "author": "ClawBio Team",
    "date_utc": "2026-03-28T09:14:22+00:00",
    "subject": "pharmgx: handle CYP2C19 *2/*2 diplotype"
  },
  "test_case": {
    "name": "pg_01_cyp2c19_pm_clopidogrel",
    "driver": ".../test_cases/pharmgx/pg_01_cyp2c19_pm_clopidogrel/driver.sh",
    "driver_sha256": "7e4b…",
    "payload": "input.tsv",
    "payload_sha256": "d41d8cd…"
  },
  "ground_truth": {
    "BENCHMARK": "cyp2c19_pm_clopidogrel",
    "GROUND_TRUTH_PHENOTYPE": "CYP2C19 Poor Metabolizer (*2/*2)",
    "FINDING_CATEGORY": "correct_determinate",
    "HAZARD_DRUG": "Clopidogrel"
  },
  "ground_truth_references": {
    "CPIC_CYP2C19": "https://cpicpgx.org/guidelines/guideline-for-clopidogrel-and-cyp2c19/"
  },
  "reference_genome": "GRCh38",
  "execution": {
    "exit_code": 0,
    "stdout_sha256":       "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
    "stdout_full_sha256":  "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
    "stdout_full_byte_len": 1842,
    "stdout_truncated": false,
    "stderr_sha256":       "a5f2c7…",
    "stderr_full_sha256":  "a5f2c7…",
    "stderr_truncated": false,
    "wall_seconds": 1.778,
    "used_fallback": false,
    "cmd": ["python", "-m", "clawbio.pharmgx", "--input", "input.tsv"]
  },
  "report_analysis": { "phenotype_detected": "CYP2C19 Poor Metabolizer (*2/*2)", "drug_action": "avoid" },
  "verdict": {
    "category": "correct_determinate",
    "rationale": "Phenotype matched ground truth; Clopidogrel correctly flagged AVOID (CPIC 1A)."
  },
  "environment": {
    "python_version": "3.11.9",
    "platform": "darwin-arm64",
    "env_hash_sha256": "b21f…"
  },
  "_verdict_sha256": "0987abc…"
}

_verdict_sha256 is computed over the canonical byte representation of the document with that field removed, then re-embedded. That lets any downstream tool re-hash and verify in one pass.

`--verify`: three-layer chain of custody

clawbio-bench --verify results/suite/20260404_120000/

runs three independent integrity checks:

Per-verdict self-hash. Every verdict.json is re-hashed over its own canonical bytes (with _verdict_sha256 stripped) and compared against the embedded value.
Sidecar reconciliation. Every entry in verdict_hashes.json must point to an existing verdict.json; every verdict.json in the tree must appear in the sidecar. Neither may drift.
Log file integrity. Each stdout.log / stderr.log is hashed and compared against the execution.stdout_sha256 / stderr_sha256 recorded inside the adjacent verdict.

Any mismatch is a hard failure with a specific file path, making tampering trivially detectable post-hoc.

Core Concepts

Three terms recur throughout this document and the code:

Skill — a capability exposed by the ClawBio platform (e.g. the pharmgx-reporter that turns a genotype TSV into a drug-safety report). ClawBio advertises 23 executable skills + 8 stub skills.
Harness — a clawbio_bench audit module (one file under src/clawbio_bench/harnesses/) that tests exactly one skill domain, with its own category rubric, its own scoring logic, and a run_single_<name>() entry point. There are currently nine harnesses (six ClawBio-skill audits + one out-of-registry fine-mapping audit + two CVR Phase 2 harnesses).
Test case — a directory under src/clawbio_bench/test_cases/<harness>/ containing input payloads + a ground_truth.txt header (YAML frontmatter or legacy # KEY: value format) with analytically derived expected answers.

One harness owns many test cases. One test case belongs to exactly one harness. Adding a new audit for a new skill means writing a new harness; expanding an existing audit means adding test cases.

Coverage Scope

v0.1.4 exercises the ClawBio bio-orchestrator plus 6 of the 37 executable skills with dedicated behavioral harnesses — verified against ClawBio HEAD e7590141 (2026-04-07, 43 skills with SKILL.md: 37 executable + 6 stub). The orchestrator harness additionally routing-tests every auto-detectable skill, exercises --skill NAME direct invocation for five previously-unreachable high-clinical-harm skills, and covers --skills A,B,C multi-skill composition. The clinical-variant-reporter skill is now audited by three harnesses: clinical_variant_reporter (Phase 1 — structural / traceability, 5 tests), cvr_identity (Phase 2c — HGVS / transcript / assembly representation, 6 tests, new in v0.1.3), and cvr_correctness (Phase 2a — ACMG criterion-level correctness with dual-layer ground truth, 13 tests, new in v0.1.3). The fine-mapping harness audits a statistical-inference subsystem outside ClawBio's official skill registry; it runs via a subprocess driver shim so its numerical stack (numpy/pandas) stays an optional extra.

Skill inventory is pinned dynamically, not hardcoded. The orchestrator harness's discover_clawbio_skills(repo_path) helper scans the target commit's skills/ directory and emits a drift report into every verdict (see compute_inventory_drift in src/clawbio_bench/harnesses/orchestrator_harness.py). The frozenset baselines in that module exist only for drift detection — verdict scoring depends exclusively on each test case's GROUND_TRUTH_EXECUTABLE field, which is the sole authoritative source.

ClawBio skill	Dedicated behavioral harness	Routing-tested
`bio-orchestrator`	Yes (54 tests)	—
`pharmgx-reporter`	Yes (44 tests)	also
`equity-scorer`	Yes (15 tests)	also
`nutrigx_advisor`	Yes (10 tests)	also
`claw-metagenomics`	Yes (7 tests)	also
`clinical-variant-reporter`	Yes — Phase 1+2c+2a (5+6+13 tests)	also (via `--skill`)
`variant-annotation`	—	Yes (via `--skill`)
`clinical-trial-finder`	—	Yes (via `--skill`)
`target-validation-scorer`	—	Yes (via `--skill`)
`methylation-clock`	—	Yes (keyword + `--skill`)
`claw-ancestry-pca`	—	Yes
`scrna-orchestrator`	—	Yes
`scrna-embedding`	—	Yes
`genome-compare`	—	Yes
`clinpgx`	—	Yes
`gwas-prs`	—	Yes
`gwas-lookup`	—	Yes
`profile-report`	—	Yes
`data-extractor`	—	Yes
`rnaseq-de`	—	Yes
`proteomics-de`	—	— (executable, not auto-detected)
`omics-target-evidence-mapper`	—	— (executable, not auto-detected)
`illumina-bridge`	—	Yes
`bioconductor-bridge`	—	Yes
`diff-visualizer`	—	Yes
`fine-mapping`	Yes (21 tests, out-of-registry)	— (executable, not auto-detected)
`ukb-navigator`	—	— (executable, not auto-detected)
`galaxy-bridge`	—	— (executable, not auto-detected)
`genome-match`, `recombinator`, `soul2dna`	—	— (synthetic genome generation; privacy/consent review pending)
`pubmed-summariser`, `protocols-io`, `bigquery-public`, `cell-detection`	—	— (executable, not auto-detected)
`struct-predictor`, `labstep`	—	Yes
6 stubs (`vcf-annotator`, `lit-synthesizer`, `repro-enforcer`, `claw-semantic-sim`, `drug-photo`, `seq-wrangler`)	—	Yes (stub warning check)

Totals (verified against ClawBio HEAD 5cf83c5):

Dedicated behavioral coverage: 6 / 37 executable ClawBio skills (~16%)
Skills reachable via orchestrator auto-detection: 23 / 43
Skills reachable via --skill NAME direct invocation: 43 / 43 (100%) — five previously-unreachable clinical skills (clinical-variant-reporter, variant-annotation, clinical-trial-finder, target-validation-scorer, methylation-clock) now have dedicated force-routing tests
Total harnesses: 10 (7 ClawBio-skill audits + 1 fine-mapping + 2 CVR Phase 2)
Total test cases: 183 (54 orchestrator + 44 pharmgx + 15 equity + 10 nutrigx + 7 metagenomics + 21 fine-mapping + 5 CVR Phase 1 + 6 CVR Phase 2c identity + 13 CVR Phase 2a correctness + 8 gwas-prs)

See docs/plans/GAP_ANALYSIS_2026-04-04.md for the full audit-framework-aligned gap analysis, including the P0 verdict-integrity fix this release. The remaining uncovered ClawBio skills are listed in the Roadmap prioritized by clinical-harm potential. See also Current Scope and Limitations.

Harnesses

Each harness is self-contained: a single Python module under src/clawbio_bench/harnesses/ with a category rubric, a run_single_<name>() function, and bundled test cases under src/clawbio_bench/test_cases/<name>/. Categories are not binary pass/fail — each category maps to a distinct remediation path.

Bio-Orchestrator (54 tests, 7 categories)

In plain English. Given an input file or free-text query, does the orchestrator route it to the right ClawBio skill? Does it warn when the target is a stub? Does it fail cleanly on unroutable inputs? Does the --skill NAME force path still work for clinical skills that lack any semantic auto-detection route? Does multi-skill composition (--skills A,B,C) dispatch correctly without leaking skill A's artifacts into skill B's output directory?

Routes tested: extension-based (15), keyword-based (19), error handling (11), --skill NAME force-routing for five high-clinical-harm unreachable skills (5), --skills A,B,C multi-skill composition (2), prompt-injection regression pins (3 — one genuine LLM-path test via --provider flock, two deterministic-parser regression pins).

Category	Pass?	Description
`routed_correct`	Yes	Correct skill selected
`routed_wrong`	No	Wrong skill selected
`stub_warned`	Yes	Stub routed with warning
`stub_silent`	No	Stub routed silently
`unroutable_handled`	Yes	Unknown input, clean error
`unroutable_crash`	No	Unknown input, crash
`harness_error`	—	Infrastructure error

Equity Scorer (15 tests, 10 categories)

In plain English. When computing population-differentiation statistics (FST, HEIM), does the tool produce the right number, label that number with the right estimator name, stay within mathematically valid bounds, and not crash on edge cases like monomorphic sites, single-sample cohorts, or haploid genotypes? FST (Fixation Index) and HEIM (Heterozygosity-based Equity Index Metric) are both summary statistics of genetic diversity between populations.

Tests FST accuracy (Nei's GST), estimator label correctness, HEIM bounds, CSV mode honesty, edge cases.

Category	Pass?	Description
`fst_correct`	Yes	FST value + label correct
`fst_incorrect`	No	FST outside tolerance
`fst_mislabeled`	No	FST correct, label wrong ← honesty failure
`heim_bounded`	Yes	HEIM in [0, 100]
`heim_unbounded`	No	HEIM > 100
`csv_honest`	Yes	CSV mode honest about coverage
`csv_inflated`	No	CSV inflates genomic coverage ← honesty failure
`edge_handled`	Yes	Edge case handled
`edge_crash`	No	Edge case crash
`harness_error`	—	Infrastructure error

PharmGx Reporter (44 tests, 7 categories)

In plain English. Given a genotype file, does the tool call the right pharmacogenomic phenotype (e.g. "CYP2C19 Poor Metabolizer"), classify the recommended drug action correctly against CPIC guidelines, avoid reporting fake confidence on ambiguous genotypes, and surface drug warnings in the actual report (not just stderr where users won't see them)?

Tests pharmacogenomic phenotype calling and drug safety classification against CPIC guidelines.

Category	Pass?	Description
`correct_determinate`	Yes	Right phenotype + drug class
`correct_indeterminate`	Yes	Correctly indeterminate
`scope_honest_indeterminate`	Yes	Tool correctly returns Indeterminate for a variant DTC arrays cannot resolve (CNV, hybrid, phasing) — correct clinical behavior
`incorrect_determinate`	No	Wrong phenotype (false Normal)
`incorrect_indeterminate`	No	Unnecessary indeterminate
`omission`	No	Drug missing from report
`disclosure_failure`	No	Warning on stderr only, not in report ← honesty failure
`harness_error`	—	Infrastructure error

NutriGx Advisor (10 tests, 9 categories)

In plain English. Does the nutrigenomics scorer produce the right numeric score? Are its category buckets (Low / Medium / High) consistent with its declared thresholds? Does the reproducibility bundle actually contain everything needed to re-run the analysis? Does it flag missing SNPs in its panel rather than silently treating them as reference?

Tests nutrigenomics score accuracy, reproducibility bundle integrity, SNP panel validation.

Category	Pass?	Description
`score_correct`	Yes	Score matches expected
`score_incorrect`	No	Score diverges
`repro_functional`	Yes	Reproducibility bundle complete
`repro_broken`	No	Reproducibility artifacts missing
`snp_valid`	Yes	Panel SNPs found
`snp_invalid`	No	Panel SNPs missing
`threshold_consistent`	Yes	Categories match thresholds
`threshold_mismatch`	No	Categories wrong
`harness_error`	—	Infrastructure error

Metagenomics Profiler (7 tests, 7 categories)

In plain English. Half behavior test, half static source audit. Runs the metagenomics demo mode to confirm it works end-to-end, then performs AST-based static analysis on ClawBio's source tree (not on runtime output) to detect unsafe shell invocation (shell=True, aliased OS-level shell helpers) and other injection vectors. Also confirms non-zero exit codes are surfaced as errors, not silently demoted to warnings.

Tests demo-mode functionality + AST-based static security analysis on the audited source (no external bioinformatics tools required).

Category	Pass?	Description
`injection_blocked`	Yes	No unsafe shell invocation found (AST-verified)
`injection_succeeded`	No	Shell injection vector exists
`exit_handled`	Yes	Exit code treated as error
`exit_suppressed`	No	Exit suppressed to warning
`demo_functional`	Yes	Demo mode works
`demo_broken`	No	Demo mode fails
`harness_error`	—	Infrastructure error

Fine-Mapping (21 tests)

In plain English. Does Approximate Bayes Factor (ABF) and SuSiE fine-mapping produce mathematically valid posterior inclusion probabilities and credible sets? Does the tool fail loudly on degenerate inputs (zero standard errors, non-positive n, non-convergence) rather than silently returning plausible-looking nonsense? The fine-mapping harness runs via a subprocess driver shim so its numerical stack (numpy, pandas) is only pulled in by the optional [finemapping] extra; clawbio_bench itself never imports it directly.

Test cases span single-causal ABF, SuSiE multi-causal, null-locus rejection, phantom secondaries, variance pathologies (NaN SE, extreme Z), credible set purity, and moment-labeling honesty failures.

Clinical Variant Reporter — Phase 1 (5 tests, 10 categories)

In plain English. ClawBio's clinical-variant-reporter skill classifies germline variants using the ACMG/AMP 2015 28-criteria evidence framework (Richards et al. 2015) and generates clinical-grade interpretation reports. This harness does not score 28-criteria adjudication correctness. It checks only whether the reports carry the structural and traceability elements any auditable clinical variant report must carry — reference genome build, transcript used, ClinVar/gnomAD versions pinned, limitations section, not-a-medical- device disclaimer, per-variant ACMG criterion audit trail, and gene–disease/inheritance context (per Rehm et al. 2013 laboratory reporting standards and Abou Tayoun et al. 2018 ClinGen SVI PVS1 recommendations).

Phase 2 (v0.1.3) is now live as two separate harnesses:

cvr_identity (Phase 2c, 6 tests) — variant identity and HGVS v21.1 compliance: syntax, MANE Select, transcript versioning, indel normalization, assembly coordinate consistency.
cvr_correctness (Phase 2a, 13 tests) — ACMG criterion-level correctness with Gold/Silver truth tiers: BA1/BS1/PM2 thresholds, PVS1 strength modulation per Abou Tayoun 2018, PP3/BP4 calibration per Pejaver 2022, VCEP supersession (ENIGMA, InSiGHT), SF v3.3 (84 genes), ClinGen GDV. Uses dual-layer ground truth: EXPECTED_* for clinical gold standard, EXPECTED_TOOL_* for tool self-consistency, with self_consistency_error rubric category.

Both Phase 2 harnesses are grounded in the Phase 2 PRD with triple-verified standards (Exa + ref.tools + Tavily across three independent passes, plus 6-model code review by Droid/Gemini/Crush/Codex/OpenCode/Claude).

Category	Pass?	Description
`report_structure_complete`	Yes	All required structural elements present
`assembly_missing`	No	Reference genome build not stated in report body
`transcript_missing`	No	No NM_/ENST_/MANE Select transcript cited
`data_source_version_missing`	No	ClinVar date / gnomAD version not pinned
`limitations_missing`	No	No Limitations section
`disclaimer_missing`	No	RUO / not-a-medical-device disclaimer absent from report body
`evidence_trail_incomplete`	No	<50% of classification lines cite an ACMG criterion code
`gene_disease_context_missing`	No	P/LP classifications without disease / inheritance context
`reference_build_inconsistent`	No	Conflicting assemblies in same report
`harness_error`	—	Harness infrastructure error

Ground Truth Formats

Two formats are accepted and dispatched per file at parse time. Legacy format is fully supported; migration to YAML is per-file and voluntary.

YAML frontmatter (preferred for new test cases)

# ---
# BENCHMARK: cyp2c19_pm_clopidogrel
# GROUND_TRUTH_PHENOTYPE: "CYP2C19 Poor Metabolizer (*2/*2)"
# FINDING_CATEGORY: correct_determinate
# TARGET_GENE: CYP2C19
# HAZARD_DRUG: Clopidogrel
# GROUND_TRUTH_BEHAVIOR: |
#   Homozygous CYP2C19*2 (rs4244285 AA). Tool should report Poor
#   Metabolizer. Clopidogrel: AVOID (CPIC Level 1A).
# ---
# rsid	chromosome	position	genotype
rs4244285	10	96541616	AA
...

The # prefix on every line keeps the block invisible to audited tools that treat the file as input (e.g. the ClawBio PharmGx reporter reading a 23andMe-style TSV). The block opens with # --- and closes with the next # ---. YAML block scalars (|) give multi-line narrative fields a clean home without continuation-line gymnastics.

The YAML parser path validates keys against the same UPPER_SNAKE_CASE regex as the legacy parser, rejects anchors (&), aliases (*), and merge keys (<<:) as a hardening measure, and recursively normalizes nested dict/list values to plain strings.

Legacy `# KEY: value` (every existing test case)

Example Model B directory with ground_truth.txt + payload sidecar:

# BENCHMARK: equity-scorer v0.1.0
# PAYLOAD: input.vcf
# GROUND_TRUTH_FST: 1.000
# GROUND_TRUTH_FST_PAIR: POP_A_vs_POP_B
# GROUND_TRUTH_FST_ESTIMATOR: Nei's GST
# FST_TOLERANCE: 0.001
# FINDING_CATEGORY: fst_mislabeled
# DERIVATION: p_total=0.5, HT=0.5, HS=0.0, GST=1.0
# CITATION: Nei (1973). PNAS 70(12):3321-3323

Model A vs Model B. Test cases come in two flavors:

Model A — one self-contained file where the ground truth header and the payload live in the same file (the audited tool ignores the #-commented header).
Model B — a directory with a dedicated ground_truth.txt plus one or more payload sidecars referenced via the PAYLOAD: / POP_MAP_FILE: headers. This is what most current test cases use.

See docs/ground-truth-derivation.md for detailed derivation methodology per harness.

Verdict Schema as External Contract

The authoritative shape of every verdict document is published as a versioned JSON Schema under schemas/:

schemas/verdict-minimal.schema.json — the minimum required shape every verdict satisfies, including stripped-down harness_error docs.
schemas/verdict-full.schema.json — the complete shape emitted by core.build_verdict_doc() for successfully-executed test cases. additionalProperties: false, so unknown keys are rejected.

Both files are auto-generated from the msgspec.Struct definitions in src/clawbio_bench/schemas.py via scripts/gen_schemas.py. A CI drift gate (test_schemas.py::TestCommittedSchemas) fails if the committed files fall out of sync.

Why commit them? Because auditors using Rust, Go, TypeScript, or a plain JSON Schema validator can verify verdicts directly against these files without running any Python code. The schema is the contract — the Python implementation is one way to honor it, not the only one.

Core Capabilities

High-level, user-facing features — the things you should know about before deciding to use this tool.

Nine dedicated harnesses covering pharmacogenomics, population genetics, nutrigenomics, metagenomics, orchestration routing, fine-mapping, and three layers of clinical-variant-reporter audit (Phase 1 structural, Phase 2c identity, Phase 2a ACMG correctness). 175 test cases total with analytically derived or authority-referenced ground truth.
Category-level verdicts, not just pass/fail. Each harness rubric has 6–10 named categories; a fst_mislabeled finding carries different remediation than fst_incorrect, and the tooling never collapses the two.
Tamper-evident chain of custody. SHA-256 on every input, output, ground-truth file, stdout, stderr, and on the verdict document itself. clawbio_bench --verify runs a three-layer reconciliation (per-verdict self-hash, sidecar index, log-file integrity).
Longitudinal sweeps across git history. --regression-window N, --all-commits, and --tagged-commits replay every test case against every selected commit so you can see exactly when a finding was introduced or fixed. Tagged-commit mode annotates releases on the heatmap timeline.
JSON Schema as external contract. schemas/verdict-*.schema.json are committed artifacts — auditors can validate verdicts in any language without running Python.
Two-tier verdict validation. A minimum-contract check always runs on every verdict; the strict full-schema check runs on non-error verdicts and catches schema drift in CI.
Canonical byte-stable output. msgspec.json.encode(order="sorted") gives identical bytes on every run, every Python version, every machine — a prerequisite for meaningful longitudinal diffs.
Heatmap visualization (--heatmap, needs the [viz] extra) of the commits × test_cases grid, category-coded.
Offline only at runtime. No network calls during an audit; reference values are pre-computed and embedded in the ground-truth files. See limitations for the trade-off.
Rich CLI output (optional [ui] extra) — styled tables for --list and the final summary, with --no-rich as a kill switch and a byte-stable plain-text fallback when piped.
Type-safe. Full mypy --strict compliance across all source files.
245 unit tests at v0.1.4 covering scoring, validators, parser edge cases, tamper detection, schema drift, YAML frontmatter hardening, canonical byte determinism, and deep-verify chain-of-custody reconciliation.

Implementation Safeguards

Lower-level engineering details that matter for the chain-of-custody guarantees above. Safe to skim on a first read.

Expand implementation details

Single canonical serializer. msgspec.json.encode(order="sorted") with deterministic byte-sorted output, in a single C extension. Replaces what would otherwise be separate pydantic and orjson dependencies.
Hash before truncate. When stdout / stderr exceeds the 10 MB cap, the ExecutionResult records stdout_full_sha256 / stdout_full_byte_len of the original pre-truncation bytes, plus a stdout_truncated flag. Truncation happens at an encoded-byte boundary with errors="replace" so multi-byte streams stay within the cap and chain of custody is preserved even on runaway tools.
Hardened dual ground-truth parser. YAML frontmatter or legacy # KEY: value, dispatched per file. YAML path validates keys against the same UPPER_SNAKE_CASE regex as legacy, rejects anchors (&), aliases (*), and merge keys (<<:), and recursively normalizes nested dict/list values to plain strings.
Correct phenotype matching. pharmgx_harness uses the regex package for variable-length lookbehind and Unicode-aware word boundaries. The substring fallback rejects any candidate that lands inside a negated context in the longer string, so "normal" cannot match inside "not normal metabolizer".
Positive-evidence scoring. NutriGx snp_valid and metagenomics exit_handled branches require stderr keyword evidence (via _stderr_mentions_panel) AND reject genuine crashes (via _is_genuine_crash, line-anchored Error: / Exception: / Fatal:) before crediting a tool with clean rejection. Unrelated crashes no longer over-credit.
AST-based security analysis for unsafe subprocess and shell invocation detection in the metagenomics harness, including per-commit run_command(critical=...) default extraction so fixes and regressions are both detected.
Submodule-aware workspace cleanup. clean_workspace recursively resets and cleans every git submodule between commits so a modified submodule working tree cannot carry over and poison longitudinal comparisons.
Reference genome tracking. REFERENCE_GENOME field surfaced in every verdict for GRCh37 / GRCh38 traceability.
Reproducibility signature. Manifest and verdicts record a SHA-256 hash of the installed Python environment (sorted name==version set).
Input validation. TIMEOUT / WEIGHTS / PAYLOAD / POP_MAP_FILE / commit SHAs all validated against injection and path traversal.
Tightened argparse-fallback retry. capture_execution requires callers to declare the exact flag being stripped (fallback_flag=) and only retries when stderr contains the precise error: unrecognized arguments: <flag> line, preventing false triggers on tools that emit benign argparse-shaped output.

Runtime dependencies

Four runtime dependencies (three core + one optional), each justified against the cost of expanding an audit tool's trust surface:

msgspec (core) — verdict schema validation (Struct) + deterministic JSON serialization (json.encode(order="sorted")) in a single C extension. Replaces what would otherwise be separate pydantic and orjson deps.
regex (core) — variable-length lookbehind and Unicode-aware word boundaries for pharmgx phenotype matching. Stdlib re can't express these cleanly and false-matches "expressor" inside "non-expressor".
ruamel.yaml (core) — safe YAML loader for the YAML frontmatter ground-truth format. Legacy # KEY: value remains fully supported.
rich (optional [ui]) — styled tables for CLI output. Plain-text fallback is byte-stable; --no-rich is a kill switch.
matplotlib (optional [viz]) — heatmap rendering only.
numpy + pandas (optional [finemapping]) — only needed if you run the fine-mapping harness; loaded exclusively by the subprocess driver shim so clawbio_bench itself never imports either.

Current Scope and Limitations

An audit tool's documentation must be at least as honest as the tool itself. These are the current caveats, collected in one place so nothing is hidden:

Partial behavioral coverage: 6 / 37 executable ClawBio skills (~16%). The other 31 executable skills are routing-tested only (via orchestrator keyword/extension autodetection for the 17 that have auto-detect paths, or via --skill NAME force-routing tests for five previously-unreachable high-clinical-harm skills: clinical-variant-reporter, variant-annotation, clinical-trial-finder, target-validation-scorer, methylation-clock). See the Roadmap for the prioritized plan to close the behavioral-coverage gap; in the meantime, absence of a finding on a non-covered skill is not evidence of correctness.
clinical-variant-reporter audit is split across three harnesses. Phase 1 (clinical_variant_reporter, 5 tests) checks structural and traceability requirements (reference build stated, transcript cited, ClinVar/gnomAD versions pinned, limitations section present, RUO disclaimer in report body, per-variant ACMG criterion audit trail, disease/inheritance context). Phase 2c (cvr_identity, 6 tests, new in v0.1.3) validates HGVS v21.1 syntax, MANE Select transcript usage, indel normalization, and assembly coordinate consistency. Phase 2a (cvr_correctness, 13 tests, new in v0.1.3) scores ACMG criterion- level correctness on a curated unambiguous subset using dual-layer ground truth (EXPECTED_* for clinical gold standard versus EXPECTED_TOOL_* for tool self-consistency, with a self_consistency_error rubric category). None of the three harnesses attempts a full 28-criteria adjudication of contested variants — that scope is intentionally avoided because it generates indefensible ground truth and damages suite credibility.
Prompt-injection tests are honestly scoped. Against ClawBio's current deterministic parsers (TSV header reader for pharmgx_reporter, substring KEYWORD_MAP for the orchestrator) injection payloads in comments or queries are inert — the tool never interprets them as instructions. Two of the three orchestrator injection test cases (inj_01_routing_hijack, inj_02_exfil_attempt) are labeled as regression pins in their ground-truth hazards: they exist to catch a future refactor that introduces LLM-based parsing without re-hardening. The genuine live injection test is orchestrator/inj_03_flock_routing_hijack, which runs with --provider flock to exercise the LLM routing path; it is gated on FLock credentials and may be skipped in CI.
PHI sentinel is a test-case regression pin, not a full sweep. phi_patient_identifiers_in_header.txt asserts that DOB/MRN/name comments are not echoed into report.md. A full PHI-persistence scan across every file in the benchmark's own results tree (stdout/stderr logs, provenance JSON, verdict bundles) is scoped for a future release because it requires walking the benchmark's output directory, not just the tool's.
Requires a local ClawBio checkout. This repository does not bundle ClawBio. Every real run needs --repo /path/to/ClawBio. CI workflows must clone ClawBio themselves.
Offline-only ground truth is a trade-off. Reference values are analytically pre-computed and embedded in test case files. This makes the audit deterministic and reproducible, but it also means ground truth can go stale relative to live CPIC / PharmGKB / GWAS Catalog updates. Upstream guideline changes need an explicit test case refresh.
FST tolerance is absolute-value, not variance-aware. The current FST_TOLERANCE field is a hardcoded absolute delta. It works for the large-n reference cases bundled with the suite but can produce false failures on small-sample studies where estimator variance is high. A Z-score-based replacement is on the Roadmap.
--smoke is HEAD-only by design. Smoke mode runs a single commit (the current HEAD). Use --regression-window N, --all-commits, or --tagged-commits for longitudinal findings; findings from a smoke run say nothing about trajectory.
Markdown rendering is designed for smoke-mode aggregates. Non-smoke runs still render, but the markdown renderer identifies findings by (harness, test, category) — which collapses duplicates across commits and can produce misleading diffs on regression sweeps. This is called out inline in the rendered output and again in the Continuous Audit section.
--verify needs the original results directory. There is no standalone verdict re-verifier; chain-of-custody checks reconcile each verdict against its sibling stdout.log / stderr.log and verdict_hashes.json sidecar.
Fine-mapping harness requires optional dependencies. Install pip install -e ".[finemapping]" (or ".[dev]") before running the finemapping harness, otherwise the subprocess driver will exit cleanly with a harness_error and a pip hint.
PharmGx pass rate is intentionally low. The 44% pass rate at ClawBio HEAD reflects known CPIC-compliance gaps documented in a prior audit (see Confirmed Findings). This is the audit working, not the audit failing.
Platform coverage. Linux and macOS only. Windows is untested; path handling, stat semantics, and git worktree behavior differ in ways that haven't been validated.

Design Principles

Never abort — every (commit, test_case) pair produces a verdict. Infrastructure failures become harness_error verdicts excluded from pass rate. A harness that raises an unhandled exception is itself a bug in clawbio_bench, not a valid outcome.
Offline only — no network calls at runtime. Reference values are analytically pre-computed and embedded in ground truth files.
Chain of custody — SHA-256 of every input, output, and ground truth file. Git metadata, timestamps, and environment recorded per verdict. Verdict documents self-hash.
Safe by default — dirty repo safety gate (--allow-dirty required), git worktree isolation in tests, path traversal validation, no shell=True.
Category-level verdicts — not binary pass/fail. Each category maps to a specific remediation path.

Understanding Results (Exit Codes)

clawbio_bench uses advisory exit codes. You can rely on them for CI gating:

Exit	Meaning	CI recommendation
`0`	All harnesses passed cleanly	Green check — no action needed.
`1`	Findings exist (at least one non-pass category)	Expected while the audited repo stabilizes. The reusable GitHub Action treats this as advisory (check stays green) and surfaces findings in a PR comment.
`≥ 2`	A harness itself raised an infrastructure exception	Hard failure — fix the audit harness before trusting results.

The reusable GitHub workflow only fails the job on exit ≥ 2. This lets you surface findings in PR comments without blocking merges while the audit matures. If you want stricter gating, invert the exit handling in your own workflow.

Run Tests

# Unit tests — scoring, validators, parser, chain of custody, schemas, YAML
pytest tests/ -k "not test_harness_smoke"           # 245 tests, <1s

# Type check
mypy src/clawbio_bench/ --ignore-missing-imports    # 0 errors

# Full smoke tests (requires a ClawBio checkout)
pytest tests/ --repo /path/to/ClawBio

# Regenerate the committed JSON Schema artifacts after editing schemas.py
python scripts/gen_schemas.py                       # writes schemas/*.schema.json

# Tests use git worktree isolation — your repo is never modified

Continuous Audit (GitHub Actions)

Two repositories participate. Workflows are split by ownership:

biostochastics/clawbio_bench          ClawBio/ClawBio
(this repo — the auditor)             (the audited target)
─────────────────────────             ─────────────────────
 ci.yml                                audit.yml  ← you add this
 audit-reusable.yml  ◄── called by ──  (3-line stub, see below)
 audit-baseline.yml
 daily-audit.yml

Workflows in `clawbio_bench` (this repo)

Workflow	Trigger	What it does
`ci.yml`	Push / PR to `clawbio_bench`	Lint, unit tests (3.11–3.13), smoke against pinned ClawBio ref
`audit-reusable.yml`	Called by downstream repos	Reusable workflow: installs bench, runs smoke, posts PR comment
`audit-baseline.yml`	Nightly (04:17 UTC) + push to `main`	Publishes rolling `aggregate_report.json` baseline as a release asset
`daily-audit.yml`	Cron (08:00 UTC) + manual dispatch	Full daily audit against ClawBio HEAD (see below)

Workflow in ClawBio (the audited repo)

Drop this 3-line stub into .github/workflows/audit.yml in ClawBio (or any repo being audited). It calls the reusable workflow hosted here:

name: clawbio_bench audit
on:
  pull_request:
    branches: [main]

jobs:
  audit:
    uses: biostochastics/clawbio_bench/.github/workflows/audit-reusable.yml@v0.1.0
    with:
      clawbio_bench_ref: v0.1.0   # PIN — do not leave on main in production
    permissions:
      contents: read
      pull-requests: write

What ClawBio gets on each PR:

The full smoke suite (all harnesses, HEAD only) runs against the PR checkout.
A sticky PR comment (one comment, updated in place on every push) with per-harness pass/fail, and — when a baseline is available — the set of new and resolved findings vs. main.
The complete verdicts tree (including SHA-256 chain of custody for every file) uploaded as a workflow artifact for 30 days.
The same report written to the job summary, so fork PRs (where the token is read-only and no comment can be posted) still surface the result.

Daily automated audit (`daily-audit.yml` in clawbio_bench)

A cron-scheduled workflow runs every morning at 8 AM UTC. It clones ClawBio HEAD and audits it:

Smoke suite — all harnesses, HEAD only (~40 seconds).
Delta reports — markdown and PDF (Typst) compared against a rolling baseline committed to baselines/latest_baseline.json.
Baseline promotion — if pass rate improved, the baseline is updated and committed. Regressions keep the old baseline so the delta stays visible.
Notification — a one-paragraph digest posted to a webhook (Slack/Discord). When OPENROUTER_API_KEY is set, a multi-model LLM swarm (deepseek-v3.2-exp, minimax-m2.7, gpt-5-nano) independently analyzes the findings, then Haiku 4.5 synthesizes a narrative digest with number verification against the structured source.
Regression issue — a deduplicated GitHub issue opens automatically when findings are detected.
Per-commit attribution — a secondary job runs --regression-window 5 for the last 5 ClawBio commits, so you can see which specific commit introduced a regression.

Secrets (configured in clawbio_bench, not ClawBio): NOTIFICATION_WEBHOOK (optional), OPENROUTER_API_KEY (optional, enables LLM swarm digest). The workflow also supports workflow_dispatch for manual triggering with two optional inputs: clawbio_ref audits a specific ClawBio commit/tag/branch instead of HEAD; clawbio_baseline_ref runs the bench against a baseline commit first and uses its aggregate for the delta report — producing a single workflow run with a proper before/after comparison (e.g. clawbio_baseline_ref: 349fb98 = "April 2 vs today"). Historical runs suppress automatic issue creation. Both ref inputs are validated against a conservative character set before use.

Behaviour and tuning

Advisory exit codes. See Understanding Results. The reusable workflow treats exit 1 as advisory and only fails on exit ≥ 2.
Smoke-mode only for markdown rendering. The markdown renderer is designed for single-commit smoke-mode aggregates. Non-smoke runs still render but carry an explicit caveat: finding identity is (harness, test, category), which collapses duplicates across commits and can produce misleading diffs on regression sweeps. Also called out in Current Scope and Limitations.
Fork PRs. No comment is posted (the caller's GITHUB_TOKEN is read-only on forks). The report is still emitted to the job summary and as an artifact.
Baseline diffing. A rolling baseline is published by audit-baseline.yml nightly (04:17 UTC) and on every push to main in this repo, as the aggregate_report.json asset on the baseline-main release. Nightly cadence is intentional — the baseline audits ClawBio HEAD, which changes independently of this repository. The reusable workflow downloads the asset by HTTPS on each PR; if the download fails (stale asset, service hiccup, invalid JSON), it logs a warning and falls back to rendering absolute findings. Override or disable via the baseline_url input.
Pinning. Always pin clawbio_bench_ref to a tag or commit SHA in production. Leaving it on main means a change in this repo can silently alter your PR audit baseline. The clawbio_bench_ref input is validated against a conservative character set ([A-Za-z0-9._/-]) before being interpolated into pip install. Dependabot's github-actions ecosystem watches reusable-workflow pins and will open PRs to bump them automatically.
Comment safety. Verdict rationales can contain audited-repo stderr. The renderer collapses newlines, HTML-escapes tag-like content, and caps the "unchanged findings" section to stay under GitHub's 65,536-char comment limit.

Local preview. You can render the same markdown locally without pushing anything:

clawbio-bench --smoke --repo /path/to/ClawBio --output /tmp/results --allow-dirty
clawbio-bench --render-markdown /tmp/results/ --baseline /path/to/main-baseline.json

See docs/baseline-schema.md for the exact fields consumed during diffing, so downstream tooling can produce compatible baselines.

Confirmed Findings at ClawBio HEAD

Real bugs found by this suite in the audit target, audited against ClawBio HEAD e7590141 (2026-04-07). The suite reports 163/175 (93.1%) at this commit. The historical findings below are reproducible from their original test cases and have all been remediated upstream; the open findings underneath are what's still firing now.

Historical findings (remediated upstream)

ID	Finding	Harness Evidence	Status
C-06	FST labeled "Hudson" but computes Nei's GST	`eq_01`, `eq_02`: `fst_mislabeled`	fixed
U-2	HEIM unbounded with custom weights	`eq_09`: `heim_unbounded`	fixed
F-29	Haploid genotypes crash equity scorer	`eq_12`: `edge_crash`	fixed
M-3	PharmGx / NutriGx / metagenomics unreachable via orchestrator	`kw_16-18`: `unroutable_handled`	fixed
NEW	NutriGx hom-ref `allele_mismatch` bug	`ng_09`: `score_incorrect`	fixed
NEW	Metagenomics `exit_suppressed` (`critical=False` default)	`mg_05`: `exit_suppressed`	fixed
C-07	eq_15 ground truth bench bug (`0.500` should be `1.000`)	`eq_15`: bench-side `fst_incorrect`	fixed in PR #11
PGx	CPIC compliance audit: 44 tests, 13 genes, CPIC Level A scope	Multiple findings	43/44 at HEAD

Currently open findings (v0.1.4 against ClawBio `e7590141`)

ID	Finding	Harness Evidence	Severity
FM-20	SuSiE-inf advertises infinitesimal modeling but `tau²` never reaches the variance structure on realistic inputs	`fm_20`: `susie_inf_est_tausq_ignored`	critical (see spotlight below)
PGx-1	TPMT compound heterozygote (`3B/3C`) returns Indeterminate instead of Poor Metabolizer (PF-1)	`tpmt_compound_het`: `incorrect_indeterminate`	warning
CVR-1	Demo ACMG report omits gene–disease / inheritance context for P/LP/VUS/LB/B classifications	`cvr_01_demo_structure`: `gene_disease_context_missing`	warning
CVR-2	Demo ACMG report uses unversioned transcript accessions (`ENST00000231790`) instead of HGVS v21.1 versioned form	`cvr_10_hgvs_syntax_baseline`, `cvr_13_mane_select`, `cvr_15_transcript_versioning`: `transcript_selection_error` (×3)	warning

Advisory / known-limitation findings

These tests probe ACMG features that ClawBio's --demo mode does not currently emit (PVS1 strength modulation, calibrated PP3 strength, ENIGMA / InSiGHT VCEP citations). They're routed to the advisory criteria_not_machine_parseable bucket and will auto-flip to real verdicts the moment the demo grows the missing evidence (or the bench gains a per-variant input mode for CVR tests). Marked KNOWN_LIMITATION_DEMO_LACKS_EVIDENCE: true in their ground-truth files.

Test	What it probes	Why advisory
`cvr_25_pvs1_strength_mod`	PVS1_Moderate strength tier per Abou Tayoun 2018	Demo applies PVS1 only at default Very_Strong
`cvr_26_pp3_single_tool`	PP3 at calibrated REVEL strength per Pejaver 2022	Demo emits PP3 at Supporting only, no `strength` annotation
`cvr_30_vcep_brca1`	ENIGMA VCEP supersession for BRCA1/BRCA2	Demo cites no VCEPs
`cvr_31_vcep_lynch_mlh1`	InSiGHT VCEP supersession for Lynch syndrome genes	Demo cites no VCEPs

Gated / environmental findings (not real defects)

Test	Status	Notes
`inj_03_flock_routing_hijack`	`unroutable_crash` (expected)	Live-LLM injection test, gated on FLock provider credentials. Fires when creds absent — not a ClawBio defect.
`fm_12_susie_nonconvergence`	`harness_error` (env)	Missing `scipy` in the bench's driver subprocess interpreter. Install with `pip install -e ".[finemapping]"` or `".[dev]"`.

Spotlight: FM-20 (`susie_inf_est_tausq_ignored`)

fm_20 is the v0.1.4 SuSiE-inf activation honesty test. It detects two observationally identical failure modes that both nullify the infinitesimal component of "SuSiE-inf":

Dead code in the IBSS loop — _mom_update called with est_tausq=False hardcoded, OR run_susie_inf doesn't expose the est_tausq parameter at all, OR the parameter is never propagated from the public API into the MoM call site. Pre-237cbd9 ClawBio exhibited this defect.
Defensive threshold suppression — a "noise filter" zeroes out the correctly-estimated tau² before applying it to the variance structure (e.g. effective_tausq = tausq if tausq >= 1e-3 else 0.0). In practice the gentropy reference produces tau² estimates in the 1e-5 to 1e-4 range on realistic SuSiE-inf inputs, so any threshold above ~1e-4 nullifies activation across all geometries. Post-237cbd9 ClawBio exhibits this defect.

In both cases, calling run_susie_inf(z, R, n, est_tausq=True) produces output byte-equivalent to calling it with est_tausq=False. The user gets standard SuSiE-RSS while the tool advertises SuSiE-inf — a textbook honesty failure of the kind this benchmark exists to detect. Ground truth is derived from the gentropy port of FinucaneLab/fine-mapping-inf (vendored under scripts/_reference/gentropy_susie_inf.py and exercised offline by scripts/derive_finemapping_ground_truth.py).

References

Nei, M. (1973). Analysis of gene diversity in subdivided populations. PNAS, 70(12), 3321–3323.
Hudson, R.R. et al. (1992). Estimation of levels of gene flow from DNA sequence data. Genetics, 132(2), 583–589.
1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature, 526, 68–74.
CPIC Guidelines (2017–2024). cpicpgx.org
OWASP Command Injection Prevention Cheat Sheet (2024). owasp.org

Roadmap

Full roadmap: ROADMAP.md — consolidated tracking of all planned harnesses, framework features, audit-framework failure-class coverage, and ClawBio skill inventory.

What's done (v0.1.0–v0.1.4)

9 dedicated behavioral harnesses (orchestrator, pharmgx, equity, nutrigx, metagenomics, clinical-variant-reporter Phase 1/2c/2a, finemapping) covering 175 test cases.
Dynamic skill inventory, --skill NAME force-routing, --skills A,B,C composition mode, prompt-injection regression pins.
CYP2D6 CNV/hybrid/5/10, NUDT15, CYP2B6, CYP1A2, CYP2C9, G6PD, MT-RNR1, HLA-A31:01, HLA-B58:01 pharmgx tests.
scope_honest_indeterminate category split, --tagged-commits mode, 5-tier severity system, delta comparison in reports.

Next priorities (P1 — clinical safety)

Harness	ClawBio skill	Tier
`clinical-variant-reporter` Phase 2c/2a	`clinical-variant-reporter`	1
`variant-annotation`	`variant-annotation`	1
`clinpgx`	`clinpgx`	1
`gwas-prs`	`gwas-prs`	1
`clinical-trial-finder`	`clinical-trial-finder`	1
`wes-clinical-report`	`wes-clinical-report-en/es`	1
`target-validation-scorer`	`target-validation-scorer`	2
`genome-compare`	`genome-compare`	2
`methylation-clock`	`methylation-clock`	2

Framework priorities

YAML-only ground truth migration (plan)
Shared AST security sweep (core.ast_security_sweep())
Parallel execution (--jobs / -j)
Cross-harness Tier-1 safety gate (--tier1-only)
FST variance-aware Z-score, diplotype-level PGx validation

See ROADMAP.md for P2/P3 harnesses, failure-class coverage matrix, skills watchlist, and open questions.

Documentation

ROADMAP.md — Consolidated roadmap: planned harnesses, framework features, failure-class coverage, ClawBio skill inventory
docs/methodology.md — Audit methodology and rubric design
docs/ground-truth-derivation.md — How reference values are computed per harness
docs/baseline-schema.md — Fields consumed by the baseline diff renderer
CONTRIBUTING.md — How to add harnesses for new tools
CHANGELOG.md — Release history

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
.github		.github
baselines		baselines
docs		docs
schemas		schemas
scripts		scripts
src/clawbio_bench		src/clawbio_bench
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
clawbio_audit_report_20260406.md		clawbio_audit_report_20260406.md
clawbio_audit_report_20260406.pdf		clawbio_audit_report_20260406.pdf
clawbio_bench_intro.pdf		clawbio_bench_intro.pdf
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

clawbio_bench

The three audit dimensions

Table of Contents

Requirements

Install

Quick Start

Other common invocations

CLI flag reference

How It Works

Process Isolation — Does ClawBio Actually Run?

View 1 — process isolation at the OS level

View 2 — the matrix loop, zoomed out

View 3 — the three invocation patterns side by side

Why process isolation matters (the design rationale)

What It Produces

Example verdict JSON

--verify: three-layer chain of custody

Core Concepts

Coverage Scope

Harnesses

Bio-Orchestrator (54 tests, 7 categories)

Equity Scorer (15 tests, 10 categories)

PharmGx Reporter (44 tests, 7 categories)

NutriGx Advisor (10 tests, 9 categories)

Metagenomics Profiler (7 tests, 7 categories)

Fine-Mapping (21 tests)

Clinical Variant Reporter — Phase 1 (5 tests, 10 categories)

Ground Truth Formats

YAML frontmatter (preferred for new test cases)

Legacy # KEY: value (every existing test case)

Verdict Schema as External Contract

Core Capabilities

Implementation Safeguards

Runtime dependencies

Current Scope and Limitations

Design Principles

Understanding Results (Exit Codes)

Run Tests

Continuous Audit (GitHub Actions)

Workflows in clawbio_bench (this repo)

Workflow in ClawBio (the audited repo)

Daily automated audit (daily-audit.yml in clawbio_bench)

Behaviour and tuning

Confirmed Findings at ClawBio HEAD

Historical findings (remediated upstream)

Currently open findings (v0.1.4 against ClawBio e7590141)

Advisory / known-limitation findings

Gated / environmental findings (not real defects)

Spotlight: FM-20 (susie_inf_est_tausq_ignored)

References

Roadmap

What's done (v0.1.0–v0.1.4)

Next priorities (P1 — clinical safety)

Framework priorities

Documentation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`--verify`: three-layer chain of custody

Legacy `# KEY: value` (every existing test case)

Workflows in `clawbio_bench` (this repo)

Daily automated audit (`daily-audit.yml` in clawbio_bench)

Currently open findings (v0.1.4 against ClawBio `e7590141`)

Spotlight: FM-20 (`susie_inf_est_tausq_ignored`)

Packages