Skip to content

fix(eval): apr eval no longer reports fake pass@1=1.0 on broken models (PMAT-702)#1874

Closed
noahgift wants to merge 4 commits into
mainfrom
fix/apr-eval-inference-failure-pmat-702
Closed

fix(eval): apr eval no longer reports fake pass@1=1.0 on broken models (PMAT-702)#1874
noahgift wants to merge 4 commits into
mainfrom
fix/apr-eval-inference-failure-pmat-702

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Defect 3 from the PMAT-701 5-whys cascade. `apr eval --task humaneval` was silently reporting `pass@1 = 1.0` (164/164 problems "passed") whenever inference failed — a false positive that masked the Phase 4 Stage D no-KD training run from operators for 2 days.

The bug

When inference failed for ALL samples, the legacy code "fell back to structural validation" and marked every problem with a non-empty `canonical_solution` as `pass=1`. The `mode: "structural"` field was the only signal, easy to miss in automated JSON parsing.

Verified on gx10 against the known-gibberish 10K Stage D checkpoint (PMAT-701 origin):

```json
{
"passed": 164,
"pass_at_k": [{ "k": 1, "rate": 1.0 }, ...],
"mode": "structural"
}
```
exit code: `0`

Fix

`crates/apr-cli/src/commands/eval/inference.rs`:

  • `run_humaneval` (the bug site, ~line 134): emit structured `mode: "inference_failed"` JSON with `inference_error` populated and pass counters all zero, then return `CliError::InferenceFailed` (exit code 8).
  • `run_mbpp` (~line 1513): MBPP already returned Err on failure but didn't emit JSON. Aligned to the same shape per FT-EVAL-FAILURE-003 parity falsifier.

`crates/apr-cli/src/commands/eval/code_eval.rs:425-428`:

  • Removed misleading "falling back to structural validation" log line.

Contract

`contracts/apr-eval-humaneval-inference-failure-handling-v1.yaml` (validates clean via `pv validate`):

  • 3 equations: pass@k definition, failure-signal coupling, per-problem counter invariant
  • 4 falsifiers: FT-EVAL-FAILURE-001 (broken model→0), 002 (healthy model→real), 003 (humaneval/mbpp parity), 004 (no dataset-side marking)
  • 2 Kani harnesses; qa_gate F-EVAL-FAILURE-001

Verification on gx10

`evidence/apr-eval-inference-failure-pmat-702/launch-after-fix.json`:

```json
{
"passed": 0,
"pass_at_k": [{ "k": 1, "rate": 0.0 }, { "k": 10, "rate": 0.0 }, { "k": 100, "rate": 0.0 }],
"mode": "inference_failed",
"inference_error": "No tokenizer found (no embedded tokenizer and no sibling tokenizer.json)"
}
```
exit code: `8` (`CliError::InferenceFailed`)

Cascade context

This closes the third defect from the original PMAT-701 finding. Combined with #1863 (allocator), #1869 (Q4K teacher), and #1871 (spec + dispatch default), the full distill → eval pipeline is now honest end-to-end on Grace Blackwell.

Test plan

  • `cargo build --release --features cuda -p apr-cli` — clean
  • `cargo fmt -p apr-cli --check` — clean
  • `pv validate contracts/apr-eval-humaneval-inference-failure-handling-v1.yaml` — clean
  • FT-EVAL-FAILURE-001 verified on gx10 (gibberish checkpoint → `pass@1=0.0`, exit 8)
  • CI: `ci / gate` + `workspace-test` green
  • Follow-up: run on a known-healthy 7B Q4K teacher to verify FT-EVAL-FAILURE-002 (real `pass@k` still works)

🤖 Generated with Claude Code

…@1=1.0 on broken models (PMAT-702)

Defect 3 from the PMAT-701 5-whys cascade. The legacy `apr eval --task humaneval`
path silently "fell back to structural validation" when inference failed for
all samples — marking every problem with a non-empty canonical_solution as
pass=1. On a completely broken model (e.g. the 10K Stage D gibberish
checkpoint that hid the no-KD training run), this produced:

  pass_at_k[0].rate = 1.0  (164/164 problems "passed")
  mode = "structural"
  exit_code = 0

…with only the `mode` field signaling anything was wrong. JSON-parsing CI
gates and eval dashboards that key off `pass@1` saw a fully-passing model
when nothing had actually run.

## Fix

`crates/apr-cli/src/commands/eval/inference.rs`

* `run_humaneval` (the bug site, ~line 134): when inference fails for all
  samples, emit a structured `mode: "inference_failed"` JSON with
  `inference_error` populated and pass counters all zero, then return
  `CliError::InferenceFailed` (exit code 8). No more dataset-side
  pass-marking.
* `run_mbpp` (~line 1513): MBPP was already returning Err on failure but
  did not emit JSON. Aligned to the same shape for parity per the
  contract's FT-EVAL-FAILURE-003 falsifier — `mode: "inference_failed"`,
  `inference_error` populated, exit code 8.

`crates/apr-cli/src/commands/eval/code_eval.rs:425-428`

* Removed misleading "falling back to structural validation" log line.
  The new caller path is the source of truth; the loop just aborts.

## Contract

`contracts/apr-eval-humaneval-inference-failure-handling-v1.yaml`

* 3 equations: `pass_at_k_definition`, `inference_failure_signal`,
  `per_problem_pass_counter_invariant`.
* 4 falsifiers (FT-EVAL-FAILURE-001..004) — broken model → pass@1=0;
  healthy model → real pass@k; humaneval/mbpp parity; no dataset-side
  pass-marking.
* 2 Kani harnesses; qa_gate F-EVAL-FAILURE-001.
* Validates clean: `pv validate` reports 0 errors, 0 warnings.

## Verification on gx10

`evidence/apr-eval-inference-failure-pmat-702/launch-after-fix.json`:

  "passed": 0
  "pass_at_k": [{ "k": 1, "rate": 0.0 }, { "k": 10, "rate": 0.0 }, ...]
  "mode": "inference_failed"
  "inference_error": "No tokenizer found (no embedded tokenizer and no sibling tokenizer.json)"

…and exit code 8 (CliError::InferenceFailed). Pre-fix the same command
emitted pass@1=1.0 with exit code 0.

## Why this matters

This was the third defect in the original PMAT-701 finding. Without it,
the Phase 4 Stage D no-KD training run would have continued to pass
HumanEval gates with a 1.0 false positive even after PMAT-701 Bug A+B
landed. Combined with PR #1863 (allocator) + PR #1869 (Q4K teacher) +
PR #1871 (spec amendment + dispatch default), the full distill→eval
pipeline is now honest end-to-end.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 22, 2026 07:47
noahgift added a commit that referenced this pull request May 22, 2026
scripts/dispatch-distill-phase-5-humaneval.sh runs apr eval --task
humaneval on a Stage D output checkpoint. With PMAT-702 (PR #1874) in
main, the eval no longer falls back to structural validation with a
fake pass@1=1.0 false positive on broken models -- inference failure
now returns exit code 8 with mode=inference_failed.

Target per SPEC-DISTILL-001 AC-DISTILL-004: pass@1 >= 25% (loose ship
threshold; competitive 0.5B 30-40%; upstream 7B teacher 91%).

Env vars: CHECKPOINT (required), SAMPLES, DEVICE, TEMPERATURE, TOP_P,
HUMANEVAL_JSONL, DRY_RUN. Estimated wall time on GB10: 5-8 h.

Falsifier: mode=inference_failed or pass@1=0.0 with non-zero exit
means re-train. With #1874 effective this signal is accurate; pre-fix
the structural fallback masked it as pass@1=1.0.

QA: bashrs lint 0 errors.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 22, 2026
…g turn

Adds a §87 amendment to SPEC-DISTILL-001 documenting the root cause of
the PMAT-704 cascade fix: PR #1869 (Bug B / RealizarQ4KTeacher) was a
wrong turn — the realizar `_cuda` forward path is CPU-bound and
unusable as a distillation teacher on Grace Blackwell GB10. The 7B
vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU
at 0% utilization — empirical proof of the defect.

The amendment includes:

* Full five-whys chain (cuMemAlloc 30 GB ceiling vs phantom OOM-killer
  SIGKILL on the explicit-managed path), with file/line citations
  pointing to the CPU-heavy ops in
  crates/aprender-serve/src/gguf/cuda/cuda.rs:18
* Root cause: conflated two failures, missed the cheap dispatch-flip
  experiment that would have rejected Bug B's hypothesis in 5 minutes.
* Fix references: PR #1879 (PMAT-704) — cuBLAS default,
  RealizarQ4KTeacher demoted to APR_DISTILL_TEACHER_BACKEND=realizar-q4k
  opt-in fallback.
* Contract changes: new `apr-distill-teacher-backend-selection-v1.yaml`,
  `cuda-q4k-frozen-teacher-v1.yaml` demoted (not retracted).
* Methodology lesson: cheap-experiment-before-design discipline.
* Cascade closure table covering PRs #1863, #1869, #1871, #1874, #1877,
  #1879.

Spec version bumped 1.1.0 → 1.3.0 with changelog entries for both §86
(via PR #1871, also pending merge) and §87 (this PR). The amendment
notes the §86 cross-reference and explains the order-of-operations
in case readers see this on a build of main that predates #1871.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift

Copy link
Copy Markdown
Contributor Author

Subsumed by #1897 (PMAT-702..705 distill cascade bundle for v0.35.x hiatus-prep). Squash-merge preserves the per-PR commit message — see commit log on #1897.

@noahgift noahgift closed this May 23, 2026
auto-merge was automatically disabled May 23, 2026 04:37

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant