fix(eval): apr eval no longer reports fake pass@1=1.0 on broken models (PMAT-702) by noahgift · Pull Request #1874 · paiml/aprender

noahgift · 2026-05-22T07:47:41Z

Summary

Defect 3 from the PMAT-701 5-whys cascade. `apr eval --task humaneval` was silently reporting `pass@1 = 1.0` (164/164 problems "passed") whenever inference failed — a false positive that masked the Phase 4 Stage D no-KD training run from operators for 2 days.

The bug

When inference failed for ALL samples, the legacy code "fell back to structural validation" and marked every problem with a non-empty `canonical_solution` as `pass=1`. The `mode: "structural"` field was the only signal, easy to miss in automated JSON parsing.

Verified on gx10 against the known-gibberish 10K Stage D checkpoint (PMAT-701 origin):

```json
{
"passed": 164,
"pass_at_k": [{ "k": 1, "rate": 1.0 }, ...],
"mode": "structural"
}
```
exit code: `0`

Fix

`crates/apr-cli/src/commands/eval/inference.rs`:

`run_humaneval` (the bug site, ~line 134): emit structured `mode: "inference_failed"` JSON with `inference_error` populated and pass counters all zero, then return `CliError::InferenceFailed` (exit code 8).
`run_mbpp` (~line 1513): MBPP already returned Err on failure but didn't emit JSON. Aligned to the same shape per FT-EVAL-FAILURE-003 parity falsifier.

`crates/apr-cli/src/commands/eval/code_eval.rs:425-428`:

Removed misleading "falling back to structural validation" log line.

Contract

`contracts/apr-eval-humaneval-inference-failure-handling-v1.yaml` (validates clean via `pv validate`):

3 equations: pass@k definition, failure-signal coupling, per-problem counter invariant
4 falsifiers: FT-EVAL-FAILURE-001 (broken model→0), 002 (healthy model→real), 003 (humaneval/mbpp parity), 004 (no dataset-side marking)
2 Kani harnesses; qa_gate F-EVAL-FAILURE-001

Verification on gx10

`evidence/apr-eval-inference-failure-pmat-702/launch-after-fix.json`:

```json
{
"passed": 0,
"pass_at_k": [{ "k": 1, "rate": 0.0 }, { "k": 10, "rate": 0.0 }, { "k": 100, "rate": 0.0 }],
"mode": "inference_failed",
"inference_error": "No tokenizer found (no embedded tokenizer and no sibling tokenizer.json)"
}
```
exit code: `8` (`CliError::InferenceFailed`)

Cascade context

This closes the third defect from the original PMAT-701 finding. Combined with #1863 (allocator), #1869 (Q4K teacher), and #1871 (spec + dispatch default), the full distill → eval pipeline is now honest end-to-end on Grace Blackwell.

Test plan

`cargo build --release --features cuda -p apr-cli` — clean
`cargo fmt -p apr-cli --check` — clean
`pv validate contracts/apr-eval-humaneval-inference-failure-handling-v1.yaml` — clean
FT-EVAL-FAILURE-001 verified on gx10 (gibberish checkpoint → `pass@1=0.0`, exit 8)
CI: `ci / gate` + `workspace-test` green
Follow-up: run on a known-healthy 7B Q4K teacher to verify FT-EVAL-FAILURE-002 (real `pass@k` still works)

🤖 Generated with Claude Code

…@1=1.0 on broken models (PMAT-702) Defect 3 from the PMAT-701 5-whys cascade. The legacy `apr eval --task humaneval` path silently "fell back to structural validation" when inference failed for all samples — marking every problem with a non-empty canonical_solution as pass=1. On a completely broken model (e.g. the 10K Stage D gibberish checkpoint that hid the no-KD training run), this produced: pass_at_k[0].rate = 1.0 (164/164 problems "passed") mode = "structural" exit_code = 0 …with only the `mode` field signaling anything was wrong. JSON-parsing CI gates and eval dashboards that key off `pass@1` saw a fully-passing model when nothing had actually run. ## Fix `crates/apr-cli/src/commands/eval/inference.rs` * `run_humaneval` (the bug site, ~line 134): when inference fails for all samples, emit a structured `mode: "inference_failed"` JSON with `inference_error` populated and pass counters all zero, then return `CliError::InferenceFailed` (exit code 8). No more dataset-side pass-marking. * `run_mbpp` (~line 1513): MBPP was already returning Err on failure but did not emit JSON. Aligned to the same shape for parity per the contract's FT-EVAL-FAILURE-003 falsifier — `mode: "inference_failed"`, `inference_error` populated, exit code 8. `crates/apr-cli/src/commands/eval/code_eval.rs:425-428` * Removed misleading "falling back to structural validation" log line. The new caller path is the source of truth; the loop just aborts. ## Contract `contracts/apr-eval-humaneval-inference-failure-handling-v1.yaml` * 3 equations: `pass_at_k_definition`, `inference_failure_signal`, `per_problem_pass_counter_invariant`. * 4 falsifiers (FT-EVAL-FAILURE-001..004) — broken model → pass@1=0; healthy model → real pass@k; humaneval/mbpp parity; no dataset-side pass-marking. * 2 Kani harnesses; qa_gate F-EVAL-FAILURE-001. * Validates clean: `pv validate` reports 0 errors, 0 warnings. ## Verification on gx10 `evidence/apr-eval-inference-failure-pmat-702/launch-after-fix.json`: "passed": 0 "pass_at_k": [{ "k": 1, "rate": 0.0 }, { "k": 10, "rate": 0.0 }, ...] "mode": "inference_failed" "inference_error": "No tokenizer found (no embedded tokenizer and no sibling tokenizer.json)" …and exit code 8 (CliError::InferenceFailed). Pre-fix the same command emitted pass@1=1.0 with exit code 0. ## Why this matters This was the third defect in the original PMAT-701 finding. Without it, the Phase 4 Stage D no-KD training run would have continued to pass HumanEval gates with a 1.0 false positive even after PMAT-701 Bug A+B landed. Combined with PR #1863 (allocator) + PR #1869 (Q4K teacher) + PR #1871 (spec amendment + dispatch default), the full distill→eval pipeline is now honest end-to-end. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

scripts/dispatch-distill-phase-5-humaneval.sh runs apr eval --task humaneval on a Stage D output checkpoint. With PMAT-702 (PR #1874) in main, the eval no longer falls back to structural validation with a fake pass@1=1.0 false positive on broken models -- inference failure now returns exit code 8 with mode=inference_failed. Target per SPEC-DISTILL-001 AC-DISTILL-004: pass@1 >= 25% (loose ship threshold; competitive 0.5B 30-40%; upstream 7B teacher 91%). Env vars: CHECKPOINT (required), SAMPLES, DEVICE, TEMPERATURE, TOP_P, HUMANEVAL_JSONL, DRY_RUN. Estimated wall time on GB10: 5-8 h. Falsifier: mode=inference_failed or pass@1=0.0 with non-zero exit means re-train. With #1874 effective this signal is accurate; pre-fix the structural fallback masked it as pass@1=1.0. QA: bashrs lint 0 errors. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…g turn Adds a §87 amendment to SPEC-DISTILL-001 documenting the root cause of the PMAT-704 cascade fix: PR #1869 (Bug B / RealizarQ4KTeacher) was a wrong turn — the realizar `_cuda` forward path is CPU-bound and unusable as a distillation teacher on Grace Blackwell GB10. The 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU at 0% utilization — empirical proof of the defect. The amendment includes: * Full five-whys chain (cuMemAlloc 30 GB ceiling vs phantom OOM-killer SIGKILL on the explicit-managed path), with file/line citations pointing to the CPU-heavy ops in crates/aprender-serve/src/gguf/cuda/cuda.rs:18 * Root cause: conflated two failures, missed the cheap dispatch-flip experiment that would have rejected Bug B's hypothesis in 5 minutes. * Fix references: PR #1879 (PMAT-704) — cuBLAS default, RealizarQ4KTeacher demoted to APR_DISTILL_TEACHER_BACKEND=realizar-q4k opt-in fallback. * Contract changes: new `apr-distill-teacher-backend-selection-v1.yaml`, `cuda-q4k-frozen-teacher-v1.yaml` demoted (not retracted). * Methodology lesson: cheap-experiment-before-design discipline. * Cascade closure table covering PRs #1863, #1869, #1871, #1874, #1877, #1879. Spec version bumped 1.1.0 → 1.3.0 with changelog entries for both §86 (via PR #1871, also pending merge) and §87 (this PR). The amendment notes the §86 cross-reference and explains the order-of-operations in case readers see this on a build of main that predates #1871. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-23T04:37:19Z

Subsumed by #1897 (PMAT-702..705 distill cascade bundle for v0.35.x hiatus-prep). Squash-merge preserves the per-PR commit message — see commit log on #1897.

noahgift enabled auto-merge (squash) May 22, 2026 07:47

Merge branch 'main' into fix/apr-eval-inference-failure-pmat-702

cf0b4ea

Merge branch 'main' into fix/apr-eval-inference-failure-pmat-702

60b9df3

noahgift mentioned this pull request May 22, 2026

chore(distill): Phase 5 HumanEval dispatch wrapper #1886

Closed

Merge branch 'main' into fix/apr-eval-inference-failure-pmat-702

ccb5c04

noahgift mentioned this pull request May 23, 2026

chore: bundle PMAT-702..705 distill cascade (subsumes #1874, #1877, #1879, #1881) #1897

Closed

noahgift closed this May 23, 2026

auto-merge was automatically disabled May 23, 2026 04:37
Pull request was closed

This was referenced May 23, 2026

chore: mega-bundle hiatus close-out (subsumes #1880, #1883, #1886, #1891, #1896, #1897) #1898

Merged

release: v0.35.2 — hiatus close-out drain #1899

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): apr eval no longer reports fake pass@1=1.0 on broken models (PMAT-702)#1874

fix(eval): apr eval no longer reports fake pass@1=1.0 on broken models (PMAT-702)#1874
noahgift wants to merge 4 commits into
mainfrom
fix/apr-eval-inference-failure-pmat-702

noahgift commented May 22, 2026

Uh oh!

noahgift commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 22, 2026

Summary

The bug

Fix

Contract

Verification on gx10

Cascade context

Test plan

Uh oh!

noahgift commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant