Skip to content

docs(spec): SPEC-DISTILL-001 §87 — PMAT-704 post-mortem on Bug B wrong turn#1880

Closed
noahgift wants to merge 4 commits into
mainfrom
docs/spec-distill-postmortem-pmat-704
Closed

docs(spec): SPEC-DISTILL-001 §87 — PMAT-704 post-mortem on Bug B wrong turn#1880
noahgift wants to merge 4 commits into
mainfrom
docs/spec-distill-postmortem-pmat-704

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Documents the root cause of the PMAT-704 cascade fix (PR #1879). Adds §87 to `docs/specifications/aprender-train/distillation-epic-spec.md` explaining that PR #1869 (Bug B / `RealizarQ4KTeacher`) was a wrong turn — the realizar `_cuda` forward path is CPU-bound and unusable as a distillation teacher on Grace Blackwell GB10.

What the §87 amendment covers

Spec versioning

Test plan

  • Markdown renders cleanly
  • Section ordering matches existing § convention
  • CI: `ci / gate` + `workspace-test` green (docs-only PR; should be fast)

🤖 Generated with Claude Code

…g turn

Adds a §87 amendment to SPEC-DISTILL-001 documenting the root cause of
the PMAT-704 cascade fix: PR #1869 (Bug B / RealizarQ4KTeacher) was a
wrong turn — the realizar `_cuda` forward path is CPU-bound and
unusable as a distillation teacher on Grace Blackwell GB10. The 7B
vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU
at 0% utilization — empirical proof of the defect.

The amendment includes:

* Full five-whys chain (cuMemAlloc 30 GB ceiling vs phantom OOM-killer
  SIGKILL on the explicit-managed path), with file/line citations
  pointing to the CPU-heavy ops in
  crates/aprender-serve/src/gguf/cuda/cuda.rs:18
* Root cause: conflated two failures, missed the cheap dispatch-flip
  experiment that would have rejected Bug B's hypothesis in 5 minutes.
* Fix references: PR #1879 (PMAT-704) — cuBLAS default,
  RealizarQ4KTeacher demoted to APR_DISTILL_TEACHER_BACKEND=realizar-q4k
  opt-in fallback.
* Contract changes: new `apr-distill-teacher-backend-selection-v1.yaml`,
  `cuda-q4k-frozen-teacher-v1.yaml` demoted (not retracted).
* Methodology lesson: cheap-experiment-before-design discipline.
* Cascade closure table covering PRs #1863, #1869, #1871, #1874, #1877,
  #1879.

Spec version bumped 1.1.0 → 1.3.0 with changelog entries for both §86
(via PR #1871, also pending merge) and §87 (this PR). The amendment
notes the §86 cross-reference and explains the order-of-operations
in case readers see this on a build of main that predates #1871.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the docs/spec-distill-postmortem-pmat-704 branch from e26ac1e to 4560521 Compare May 22, 2026 14:12
@noahgift

Copy link
Copy Markdown
Contributor Author

Subsumed by #1898 (mega-bundle hiatus close-out). Squash-merge preserves the per-PR commit message — see #1898 commit log.

@noahgift noahgift closed this May 23, 2026
auto-merge was automatically disabled May 23, 2026 07:09

Pull request was closed

noahgift added a commit that referenced this pull request May 23, 2026
, #1896, #1897) (#1898)

* docs(spec): SPEC-DISTILL-001 §87 — PMAT-704 post-mortem on Bug B wrong turn (#1880)

* chore(distill): Stage D dispatch wrapper with PMAT-701 lessons baked in (#1883)

* chore(distill): Phase 5 HumanEval dispatch wrapper (#1886)

* chore: bundle PMAT-702..705 distill cascade + clippy fix (#1897)

* fix(cli): point 7B qwen models to single-file GGUF artifacts and align caches (#1891)

* fix(chat): preserve original path in FileNotFound for filesystem paths

PR #1891 wrapped all path_arg through HF alias resolution. For inputs
that look like filesystem paths (absolute or starts with ./, ../) and
don't exist, the alias resolver was rewriting them as hf:// URIs and
returning a mangled path in the FileNotFound error.

Fix: short-circuit with the original path_arg in the error BEFORE alias
resolution kicks in. Preserves the contract that test_run_file_not_found
and test_run_nonexistent_path_without_trace assert.

Closes the workspace-test failure on bundle PR #1898.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant