docs(spec): SPEC-DISTILL-001 §87 — PMAT-704 post-mortem on Bug B wrong turn#1880
Closed
noahgift wants to merge 4 commits into
Closed
docs(spec): SPEC-DISTILL-001 §87 — PMAT-704 post-mortem on Bug B wrong turn#1880noahgift wants to merge 4 commits into
noahgift wants to merge 4 commits into
Conversation
Closed
7 tasks
…g turn Adds a §87 amendment to SPEC-DISTILL-001 documenting the root cause of the PMAT-704 cascade fix: PR #1869 (Bug B / RealizarQ4KTeacher) was a wrong turn — the realizar `_cuda` forward path is CPU-bound and unusable as a distillation teacher on Grace Blackwell GB10. The 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU at 0% utilization — empirical proof of the defect. The amendment includes: * Full five-whys chain (cuMemAlloc 30 GB ceiling vs phantom OOM-killer SIGKILL on the explicit-managed path), with file/line citations pointing to the CPU-heavy ops in crates/aprender-serve/src/gguf/cuda/cuda.rs:18 * Root cause: conflated two failures, missed the cheap dispatch-flip experiment that would have rejected Bug B's hypothesis in 5 minutes. * Fix references: PR #1879 (PMAT-704) — cuBLAS default, RealizarQ4KTeacher demoted to APR_DISTILL_TEACHER_BACKEND=realizar-q4k opt-in fallback. * Contract changes: new `apr-distill-teacher-backend-selection-v1.yaml`, `cuda-q4k-frozen-teacher-v1.yaml` demoted (not retracted). * Methodology lesson: cheap-experiment-before-design discipline. * Cascade closure table covering PRs #1863, #1869, #1871, #1874, #1877, #1879. Spec version bumped 1.1.0 → 1.3.0 with changelog entries for both §86 (via PR #1871, also pending merge) and §87 (this PR). The amendment notes the §86 cross-reference and explains the order-of-operations in case readers see this on a build of main that predates #1871. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
e26ac1e to
4560521
Compare
Contributor
Author
auto-merge was automatically disabled
May 23, 2026 07:09
Pull request was closed
noahgift
added a commit
that referenced
this pull request
May 23, 2026
, #1896, #1897) (#1898) * docs(spec): SPEC-DISTILL-001 §87 — PMAT-704 post-mortem on Bug B wrong turn (#1880) * chore(distill): Stage D dispatch wrapper with PMAT-701 lessons baked in (#1883) * chore(distill): Phase 5 HumanEval dispatch wrapper (#1886) * chore: bundle PMAT-702..705 distill cascade + clippy fix (#1897) * fix(cli): point 7B qwen models to single-file GGUF artifacts and align caches (#1891) * fix(chat): preserve original path in FileNotFound for filesystem paths PR #1891 wrapped all path_arg through HF alias resolution. For inputs that look like filesystem paths (absolute or starts with ./, ../) and don't exist, the alias resolver was rewriting them as hf:// URIs and returning a mangled path in the FileNotFound error. Fix: short-circuit with the original path_arg in the error BEFORE alias resolution kicks in. Preserves the contract that test_run_file_not_found and test_run_nonexistent_path_without_trace assert. Closes the workspace-test failure on bundle PR #1898.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Documents the root cause of the PMAT-704 cascade fix (PR #1879). Adds §87 to `docs/specifications/aprender-train/distillation-epic-spec.md` explaining that PR #1869 (Bug B / `RealizarQ4KTeacher`) was a wrong turn — the realizar `_cuda` forward path is CPU-bound and unusable as a distillation teacher on Grace Blackwell GB10.
What the §87 amendment covers
Spec versioning
Test plan
🤖 Generated with Claude Code