fix(eval): apr eval no longer reports fake pass@1=1.0 on broken models (PMAT-702)#1874
Closed
noahgift wants to merge 4 commits into
Closed
fix(eval): apr eval no longer reports fake pass@1=1.0 on broken models (PMAT-702)#1874noahgift wants to merge 4 commits into
noahgift wants to merge 4 commits into
Conversation
…@1=1.0 on broken models (PMAT-702)
Defect 3 from the PMAT-701 5-whys cascade. The legacy `apr eval --task humaneval`
path silently "fell back to structural validation" when inference failed for
all samples — marking every problem with a non-empty canonical_solution as
pass=1. On a completely broken model (e.g. the 10K Stage D gibberish
checkpoint that hid the no-KD training run), this produced:
pass_at_k[0].rate = 1.0 (164/164 problems "passed")
mode = "structural"
exit_code = 0
…with only the `mode` field signaling anything was wrong. JSON-parsing CI
gates and eval dashboards that key off `pass@1` saw a fully-passing model
when nothing had actually run.
## Fix
`crates/apr-cli/src/commands/eval/inference.rs`
* `run_humaneval` (the bug site, ~line 134): when inference fails for all
samples, emit a structured `mode: "inference_failed"` JSON with
`inference_error` populated and pass counters all zero, then return
`CliError::InferenceFailed` (exit code 8). No more dataset-side
pass-marking.
* `run_mbpp` (~line 1513): MBPP was already returning Err on failure but
did not emit JSON. Aligned to the same shape for parity per the
contract's FT-EVAL-FAILURE-003 falsifier — `mode: "inference_failed"`,
`inference_error` populated, exit code 8.
`crates/apr-cli/src/commands/eval/code_eval.rs:425-428`
* Removed misleading "falling back to structural validation" log line.
The new caller path is the source of truth; the loop just aborts.
## Contract
`contracts/apr-eval-humaneval-inference-failure-handling-v1.yaml`
* 3 equations: `pass_at_k_definition`, `inference_failure_signal`,
`per_problem_pass_counter_invariant`.
* 4 falsifiers (FT-EVAL-FAILURE-001..004) — broken model → pass@1=0;
healthy model → real pass@k; humaneval/mbpp parity; no dataset-side
pass-marking.
* 2 Kani harnesses; qa_gate F-EVAL-FAILURE-001.
* Validates clean: `pv validate` reports 0 errors, 0 warnings.
## Verification on gx10
`evidence/apr-eval-inference-failure-pmat-702/launch-after-fix.json`:
"passed": 0
"pass_at_k": [{ "k": 1, "rate": 0.0 }, { "k": 10, "rate": 0.0 }, ...]
"mode": "inference_failed"
"inference_error": "No tokenizer found (no embedded tokenizer and no sibling tokenizer.json)"
…and exit code 8 (CliError::InferenceFailed). Pre-fix the same command
emitted pass@1=1.0 with exit code 0.
## Why this matters
This was the third defect in the original PMAT-701 finding. Without it,
the Phase 4 Stage D no-KD training run would have continued to pass
HumanEval gates with a 1.0 false positive even after PMAT-701 Bug A+B
landed. Combined with PR #1863 (allocator) + PR #1869 (Q4K teacher) +
PR #1871 (spec amendment + dispatch default), the full distill→eval
pipeline is now honest end-to-end.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 22, 2026
noahgift
added a commit
that referenced
this pull request
May 22, 2026
scripts/dispatch-distill-phase-5-humaneval.sh runs apr eval --task humaneval on a Stage D output checkpoint. With PMAT-702 (PR #1874) in main, the eval no longer falls back to structural validation with a fake pass@1=1.0 false positive on broken models -- inference failure now returns exit code 8 with mode=inference_failed. Target per SPEC-DISTILL-001 AC-DISTILL-004: pass@1 >= 25% (loose ship threshold; competitive 0.5B 30-40%; upstream 7B teacher 91%). Env vars: CHECKPOINT (required), SAMPLES, DEVICE, TEMPERATURE, TOP_P, HUMANEVAL_JSONL, DRY_RUN. Estimated wall time on GB10: 5-8 h. Falsifier: mode=inference_failed or pass@1=0.0 with non-zero exit means re-train. With #1874 effective this signal is accurate; pre-fix the structural fallback masked it as pass@1=1.0. QA: bashrs lint 0 errors. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 22, 2026
…g turn Adds a §87 amendment to SPEC-DISTILL-001 documenting the root cause of the PMAT-704 cascade fix: PR #1869 (Bug B / RealizarQ4KTeacher) was a wrong turn — the realizar `_cuda` forward path is CPU-bound and unusable as a distillation teacher on Grace Blackwell GB10. The 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU at 0% utilization — empirical proof of the defect. The amendment includes: * Full five-whys chain (cuMemAlloc 30 GB ceiling vs phantom OOM-killer SIGKILL on the explicit-managed path), with file/line citations pointing to the CPU-heavy ops in crates/aprender-serve/src/gguf/cuda/cuda.rs:18 * Root cause: conflated two failures, missed the cheap dispatch-flip experiment that would have rejected Bug B's hypothesis in 5 minutes. * Fix references: PR #1879 (PMAT-704) — cuBLAS default, RealizarQ4KTeacher demoted to APR_DISTILL_TEACHER_BACKEND=realizar-q4k opt-in fallback. * Contract changes: new `apr-distill-teacher-backend-selection-v1.yaml`, `cuda-q4k-frozen-teacher-v1.yaml` demoted (not retracted). * Methodology lesson: cheap-experiment-before-design discipline. * Cascade closure table covering PRs #1863, #1869, #1871, #1874, #1877, #1879. Spec version bumped 1.1.0 → 1.3.0 with changelog entries for both §86 (via PR #1871, also pending merge) and §87 (this PR). The amendment notes the §86 cross-reference and explains the order-of-operations in case readers see this on a build of main that predates #1871. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Author
auto-merge was automatically disabled
May 23, 2026 04:37
Pull request was closed
This was referenced May 23, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Defect 3 from the PMAT-701 5-whys cascade. `apr eval --task humaneval` was silently reporting `pass@1 = 1.0` (164/164 problems "passed") whenever inference failed — a false positive that masked the Phase 4 Stage D no-KD training run from operators for 2 days.
The bug
When inference failed for ALL samples, the legacy code "fell back to structural validation" and marked every problem with a non-empty `canonical_solution` as `pass=1`. The `mode: "structural"` field was the only signal, easy to miss in automated JSON parsing.
Verified on gx10 against the known-gibberish 10K Stage D checkpoint (PMAT-701 origin):
```json
{
"passed": 164,
"pass_at_k": [{ "k": 1, "rate": 1.0 }, ...],
"mode": "structural"
}
```
exit code: `0`
Fix
`crates/apr-cli/src/commands/eval/inference.rs`:
`crates/apr-cli/src/commands/eval/code_eval.rs:425-428`:
Contract
`contracts/apr-eval-humaneval-inference-failure-handling-v1.yaml` (validates clean via `pv validate`):
Verification on gx10
`evidence/apr-eval-inference-failure-pmat-702/launch-after-fix.json`:
```json
{
"passed": 0,
"pass_at_k": [{ "k": 1, "rate": 0.0 }, { "k": 10, "rate": 0.0 }, { "k": 100, "rate": 0.0 }],
"mode": "inference_failed",
"inference_error": "No tokenizer found (no embedded tokenizer and no sibling tokenizer.json)"
}
```
exit code: `8` (`CliError::InferenceFailed`)
Cascade context
This closes the third defect from the original PMAT-701 finding. Combined with #1863 (allocator), #1869 (Q4K teacher), and #1871 (spec + dispatch default), the full distill → eval pipeline is now honest end-to-end on Grace Blackwell.
Test plan
🤖 Generated with Claude Code