fix(distill): default Q4K teacher to CudaTrainerTeacher (cuBLAS) — revert Bug B's slow path (PMAT-704)#1879
Closed
noahgift wants to merge 3 commits into
Closed
fix(distill): default Q4K teacher to CudaTrainerTeacher (cuBLAS) — revert Bug B's slow path (PMAT-704)#1879noahgift wants to merge 3 commits into
noahgift wants to merge 3 commits into
Conversation
…LAS) — revert Bug B's slow path (PMAT-704) Post-mortem of the PMAT-701 cascade revealed PR #1869 (Bug B) routed Q4K teachers to `RealizarQ4KTeacher`, a CPU-heavy forward path (layer-norm + attention + softmax all on CPU; only Q4K matmuls dispatch to GPU). The 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU at 0% utilization — empirical proof the path is unusable as a real distillation teacher. Five-whys (evidence/distill-7b-cublas-cudatrainer/findings.json): * Why 1: 7B 500-step validation hung at step 0 → RealizarQ4KTeacher's forward runs most ops on CPU; GPU stayed at 0%. * Why 2: realizar's `OwnedQuantizedModelCuda::forward_cuda` is mostly CPU SIMD; only individual Q4K matmuls dispatch to GPU. * Why 3: PR #1869 picked that path to avoid an F32 dequant at upload (claimed "28 GB inflation + student → Linux OOM-kill"). * Why 4: That claim was based on a single SIGKILL observation with MANAGED_MEMORY=1 explicit, never verified via dmesg as actual OOM-killer, never re-tested under PMAT-701 Bug A's autodetect default. * Why 5: Cascade momentum + incomplete root-cause discipline. The cheap experiment (one-line dispatch flip) would have rejected the hypothesis in 5 minutes; a multi-PR architectural detour shipped instead. Root cause: conflated the cuMemAlloc 30 GB ceiling (real, fixed by Bug A) with a step-0 SIGKILL on the explicit-managed path (phantom, never verified). The F32 dequant fits in 128 GB unified memory; cuBLAS is the fast path and the right default. ## Fix `crates/apr-cli/src/commands/distill.rs::run_cuda_backend`: * New env var `APR_DISTILL_TEACHER_BACKEND` with values: - `auto` (default) → CudaTrainerTeacher (cuBLAS, F32 dequant) - `cudatrainer` → CudaTrainerTeacher (explicit) - `realizar-q4k` → RealizarQ4KTeacher (memory-constrained-device fallback) * Q4K detection still happens, but only controls the fallback path's availability — the DEFAULT dispatch is cuBLAS for all teacher types. * Generic `TruncatingTeacher` wrapper applies PMAT-703 vocab alignment uniformly to both backends. Replaces the per-backend truncation that PR #1877 was planning to add inside `RealizarQ4KTeacher::from_apr_path_with_target_vocab`. `contracts/apr-distill-teacher-backend-selection-v1.yaml`: * 3 equations (backend_dispatch, forward_latency_invariant, bug_b_demotion) * 4 falsifiers: default routes to CudaTrainer; env override reaches Realizar; 7B 500-step completes < 30 min; forward parity within Q4K noise floor * 2 Kani harnesses; qa_gate F-BACKEND-SELECT-001 * Validates clean: `pv validate` reports 0 errors, 0 warnings. ## Verification on gx10 Dispatch log shows the new path firing: [PMAT-704] backend=auto → CudaTrainerTeacher (cuBLAS) [override with APR_DISTILL_TEACHER_BACKEND=realizar-q4k for memory-constrained dGPU] [PMAT-703] vocab alignment: teacher native=152064, student=151936 → truncating teacher logits to student vocab [PMAT-704] teacher backend = CudaTrainerTeacher [Q4K/Q6K (dequant to F32 at GPU upload; cuBLAS GEMM)] [CUDA] cuBLAS initialized — forward TF32 tensor cores (41x vs SIMD) ✓ Loaded pre-trained weights successfully (APR) ✓ 28 transformer blocks uploaded to GPU GPU utilization observed at 96% during training (was 0% on the RealizarQ4KTeacher path). cuBLAS is dispatching correctly; the cascade is no longer hanging at the first step. ## Cascade context This is the fifth fix in the PMAT-701 family: - #1863 Bug A: allocator autodetect Grace Blackwell - #1869 Bug B: RealizarQ4KTeacher (now demoted to opt-in fallback) - #1874 Defect 3 / PMAT-702: apr eval no-fake-pass on broken models - #1877 Bug B's vocab alignment (superseded by TruncatingTeacher in this PR) - This PR: cuBLAS default + opt-in Realizar fallback (PMAT-704) The original Bug B contract (cuda-q4k-frozen-teacher-v1.yaml) is **demoted, not retracted**: its math is correct as a memory-constrained fallback path; its DEFAULT-PATH claim was wrong on unified-memory devices. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 22, 2026
4 tasks
noahgift
added a commit
that referenced
this pull request
May 22, 2026
…g turn Adds a §87 amendment to SPEC-DISTILL-001 documenting the root cause of the PMAT-704 cascade fix: PR #1869 (Bug B / RealizarQ4KTeacher) was a wrong turn — the realizar `_cuda` forward path is CPU-bound and unusable as a distillation teacher on Grace Blackwell GB10. The 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU at 0% utilization — empirical proof of the defect. The amendment includes: * Full five-whys chain (cuMemAlloc 30 GB ceiling vs phantom OOM-killer SIGKILL on the explicit-managed path), with file/line citations pointing to the CPU-heavy ops in crates/aprender-serve/src/gguf/cuda/cuda.rs:18 * Root cause: conflated two failures, missed the cheap dispatch-flip experiment that would have rejected Bug B's hypothesis in 5 minutes. * Fix references: PR #1879 (PMAT-704) — cuBLAS default, RealizarQ4KTeacher demoted to APR_DISTILL_TEACHER_BACKEND=realizar-q4k opt-in fallback. * Contract changes: new `apr-distill-teacher-backend-selection-v1.yaml`, `cuda-q4k-frozen-teacher-v1.yaml` demoted (not retracted). * Methodology lesson: cheap-experiment-before-design discipline. * Cascade closure table covering PRs #1863, #1869, #1871, #1874, #1877, #1879. Spec version bumped 1.1.0 → 1.3.0 with changelog entries for both §86 (via PR #1871, also pending merge) and §87 (this PR). The amendment notes the §86 cross-reference and explains the order-of-operations in case readers see this on a build of main that predates #1871. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Author
auto-merge was automatically disabled
May 23, 2026 04:37
Pull request was closed
This was referenced May 23, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Post-mortem of the PMAT-701 cascade revealed PR #1869 (Bug B) was a wrong turn. It routed Q4K teachers to `RealizarQ4KTeacher`, which runs layer-norm + attention + softmax on CPU (only Q4K matmuls dispatch to GPU). The 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU at 0% utilization — empirical proof the path is unusable as a real distillation teacher.
Five-whys
Root cause: conflated the cuMemAlloc 30 GB ceiling (real, fixed by Bug A) with a step-0 SIGKILL on the explicit-managed path (phantom, never verified). The F32 dequant fits in 128 GB unified memory; cuBLAS is the fast path and the right default.
Fix
`crates/apr-cli/src/commands/distill.rs::run_cuda_backend`:
Contract
`contracts/apr-distill-teacher-backend-selection-v1.yaml` (validates clean):
The original Bug B contract (`cuda-q4k-frozen-teacher-v1.yaml`) is demoted, not retracted — its math holds as a memory-constrained fallback path; its DEFAULT-PATH claim was wrong on unified-memory devices.
Verification on gx10
Dispatch log:
```
[PMAT-704] backend=auto → CudaTrainerTeacher (cuBLAS) [override with APR_DISTILL_TEACHER_BACKEND=realizar-q4k for memory-constrained dGPU]
[PMAT-703] vocab alignment: teacher native=152064, student=151936 → truncating teacher logits to student vocab
[PMAT-704] teacher backend = CudaTrainerTeacher [Q4K/Q6K (dequant to F32 at GPU upload; cuBLAS GEMM)]
[CUDA] cuBLAS initialized — forward TF32 tensor cores (41x vs SIMD)
✓ 28 transformer blocks uploaded to GPU
```
GPU utilization observed at 96% during training (was 0% on the Realizar path). cuBLAS is dispatching correctly; the cascade is no longer hanging at the first step.
Cascade context
Fifth fix in the PMAT-701 family:
Test plan
🤖 Generated with Claude Code