fix(distill): default Q4K teacher to CudaTrainerTeacher (cuBLAS) — revert Bug B's slow path (PMAT-704) by noahgift · Pull Request #1879 · paiml/aprender

noahgift · 2026-05-22T10:37:31Z

Summary

Post-mortem of the PMAT-701 cascade revealed PR #1869 (Bug B) was a wrong turn. It routed Q4K teachers to `RealizarQ4KTeacher`, which runs layer-norm + attention + softmax on CPU (only Q4K matmuls dispatch to GPU). The 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU at 0% utilization — empirical proof the path is unusable as a real distillation teacher.

Five-whys

7B 500-step validation hung at step 0 → `RealizarQ4KTeacher` forward is CPU-bound; GPU stayed at 0% the entire run.
`OwnedQuantizedModelCuda::forward_cuda` is mostly CPU SIMD → only individual Q4K matmuls dispatch to GPU.
PR fix(distill): RealizarQ4KTeacher — Q4K-native frozen-teacher path (PMAT-701 Bug B) #1869 picked that path → to avoid F32 dequant at upload (claimed 28 GB inflation → Linux OOM-kill).
That claim was wrong post-PMAT-701 → single SIGKILL observation with `MANAGED_MEMORY=1` was never verified as OOM-killer via dmesg; the cuBLAS path was never re-tested under PMAT-701 Bug A's autodetect default.
Why I committed before verifying → cascade momentum; a one-line dispatch flip would have rejected the hypothesis in 5 minutes; a multi-PR architectural detour shipped instead.

Root cause: conflated the cuMemAlloc 30 GB ceiling (real, fixed by Bug A) with a step-0 SIGKILL on the explicit-managed path (phantom, never verified). The F32 dequant fits in 128 GB unified memory; cuBLAS is the fast path and the right default.

Fix

`crates/apr-cli/src/commands/distill.rs::run_cuda_backend`:

New env var `APR_DISTILL_TEACHER_BACKEND` ∈ {`auto` (default), `cudatrainer`, `realizar-q4k`}.
Default dispatch: `CudaTrainerTeacher` (cuBLAS, F32 dequant) for all teacher types.
`RealizarQ4KTeacher` retained as opt-in fallback for memory-constrained dGPUs.
Generic `TruncatingTeacher` wrapper applies PMAT-703 vocab alignment uniformly to both backends (supersedes fix(distill): vocab-align teacher logits for Qwen2.5-Coder 7B → 0.5B KD (PMAT-703) #1877's planned per-backend truncation).

Contract

`contracts/apr-distill-teacher-backend-selection-v1.yaml` (validates clean):

3 equations: backend dispatch, forward latency invariant, Bug B demotion
4 falsifiers: default routes to CudaTrainer; env override reaches Realizar; 7B 500-step completes < 30 min; forward parity within Q4K noise floor
2 Kani harnesses; qa_gate F-BACKEND-SELECT-001

The original Bug B contract (`cuda-q4k-frozen-teacher-v1.yaml`) is demoted, not retracted — its math holds as a memory-constrained fallback path; its DEFAULT-PATH claim was wrong on unified-memory devices.

Verification on gx10

Dispatch log:

```
[PMAT-704] backend=auto → CudaTrainerTeacher (cuBLAS) [override with APR_DISTILL_TEACHER_BACKEND=realizar-q4k for memory-constrained dGPU]
[PMAT-703] vocab alignment: teacher native=152064, student=151936 → truncating teacher logits to student vocab
[PMAT-704] teacher backend = CudaTrainerTeacher [Q4K/Q6K (dequant to F32 at GPU upload; cuBLAS GEMM)]
[CUDA] cuBLAS initialized — forward TF32 tensor cores (41x vs SIMD)
✓ 28 transformer blocks uploaded to GPU
```

GPU utilization observed at 96% during training (was 0% on the Realizar path). cuBLAS is dispatching correctly; the cascade is no longer hanging at the first step.

Cascade context

Fifth fix in the PMAT-701 family:

feat(cuda): autodetect Grace Blackwell + Q4K frozen-teacher contract (PMAT-701) #1863 Bug A: allocator autodetect Grace Blackwell
fix(distill): RealizarQ4KTeacher — Q4K-native frozen-teacher path (PMAT-701 Bug B) #1869 Bug B: RealizarQ4KTeacher (now demoted to opt-in fallback)
fix(eval): apr eval no longer reports fake pass@1=1.0 on broken models (PMAT-702) #1874 Defect 3 / PMAT-702: apr eval no-fake-pass on broken models
fix(distill): vocab-align teacher logits for Qwen2.5-Coder 7B → 0.5B KD (PMAT-703) #1877 Bug B's vocab alignment (superseded by TruncatingTeacher here)
This PR: cuBLAS default + opt-in Realizar fallback (PMAT-704)

Test plan

`cargo build --release --features cuda -p apr-cli` — clean
`cargo fmt -p apr-cli --check` — clean
`pv validate contracts/apr-distill-teacher-backend-selection-v1.yaml` — clean
Dispatch markers fire correctly on gx10
GPU utilization 96% during training (vs 0% on Bug B's path)
CI: `ci / gate` + `workspace-test` green
Validation run completion + per-step loss capture as evidence (in flight; Monitor task beaokloec)

🤖 Generated with Claude Code

…LAS) — revert Bug B's slow path (PMAT-704) Post-mortem of the PMAT-701 cascade revealed PR #1869 (Bug B) routed Q4K teachers to `RealizarQ4KTeacher`, a CPU-heavy forward path (layer-norm + attention + softmax all on CPU; only Q4K matmuls dispatch to GPU). The 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU at 0% utilization — empirical proof the path is unusable as a real distillation teacher. Five-whys (evidence/distill-7b-cublas-cudatrainer/findings.json): * Why 1: 7B 500-step validation hung at step 0 → RealizarQ4KTeacher's forward runs most ops on CPU; GPU stayed at 0%. * Why 2: realizar's `OwnedQuantizedModelCuda::forward_cuda` is mostly CPU SIMD; only individual Q4K matmuls dispatch to GPU. * Why 3: PR #1869 picked that path to avoid an F32 dequant at upload (claimed "28 GB inflation + student → Linux OOM-kill"). * Why 4: That claim was based on a single SIGKILL observation with MANAGED_MEMORY=1 explicit, never verified via dmesg as actual OOM-killer, never re-tested under PMAT-701 Bug A's autodetect default. * Why 5: Cascade momentum + incomplete root-cause discipline. The cheap experiment (one-line dispatch flip) would have rejected the hypothesis in 5 minutes; a multi-PR architectural detour shipped instead. Root cause: conflated the cuMemAlloc 30 GB ceiling (real, fixed by Bug A) with a step-0 SIGKILL on the explicit-managed path (phantom, never verified). The F32 dequant fits in 128 GB unified memory; cuBLAS is the fast path and the right default. ## Fix `crates/apr-cli/src/commands/distill.rs::run_cuda_backend`: * New env var `APR_DISTILL_TEACHER_BACKEND` with values: - `auto` (default) → CudaTrainerTeacher (cuBLAS, F32 dequant) - `cudatrainer` → CudaTrainerTeacher (explicit) - `realizar-q4k` → RealizarQ4KTeacher (memory-constrained-device fallback) * Q4K detection still happens, but only controls the fallback path's availability — the DEFAULT dispatch is cuBLAS for all teacher types. * Generic `TruncatingTeacher` wrapper applies PMAT-703 vocab alignment uniformly to both backends. Replaces the per-backend truncation that PR #1877 was planning to add inside `RealizarQ4KTeacher::from_apr_path_with_target_vocab`. `contracts/apr-distill-teacher-backend-selection-v1.yaml`: * 3 equations (backend_dispatch, forward_latency_invariant, bug_b_demotion) * 4 falsifiers: default routes to CudaTrainer; env override reaches Realizar; 7B 500-step completes < 30 min; forward parity within Q4K noise floor * 2 Kani harnesses; qa_gate F-BACKEND-SELECT-001 * Validates clean: `pv validate` reports 0 errors, 0 warnings. ## Verification on gx10 Dispatch log shows the new path firing: [PMAT-704] backend=auto → CudaTrainerTeacher (cuBLAS) [override with APR_DISTILL_TEACHER_BACKEND=realizar-q4k for memory-constrained dGPU] [PMAT-703] vocab alignment: teacher native=152064, student=151936 → truncating teacher logits to student vocab [PMAT-704] teacher backend = CudaTrainerTeacher [Q4K/Q6K (dequant to F32 at GPU upload; cuBLAS GEMM)] [CUDA] cuBLAS initialized — forward TF32 tensor cores (41x vs SIMD) ✓ Loaded pre-trained weights successfully (APR) ✓ 28 transformer blocks uploaded to GPU GPU utilization observed at 96% during training (was 0% on the RealizarQ4KTeacher path). cuBLAS is dispatching correctly; the cascade is no longer hanging at the first step. ## Cascade context This is the fifth fix in the PMAT-701 family: - #1863 Bug A: allocator autodetect Grace Blackwell - #1869 Bug B: RealizarQ4KTeacher (now demoted to opt-in fallback) - #1874 Defect 3 / PMAT-702: apr eval no-fake-pass on broken models - #1877 Bug B's vocab alignment (superseded by TruncatingTeacher in this PR) - This PR: cuBLAS default + opt-in Realizar fallback (PMAT-704) The original Bug B contract (cuda-q4k-frozen-teacher-v1.yaml) is **demoted, not retracted**: its math is correct as a memory-constrained fallback path; its DEFAULT-PATH claim was wrong on unified-memory devices. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…g turn Adds a §87 amendment to SPEC-DISTILL-001 documenting the root cause of the PMAT-704 cascade fix: PR #1869 (Bug B / RealizarQ4KTeacher) was a wrong turn — the realizar `_cuda` forward path is CPU-bound and unusable as a distillation teacher on Grace Blackwell GB10. The 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU at 0% utilization — empirical proof of the defect. The amendment includes: * Full five-whys chain (cuMemAlloc 30 GB ceiling vs phantom OOM-killer SIGKILL on the explicit-managed path), with file/line citations pointing to the CPU-heavy ops in crates/aprender-serve/src/gguf/cuda/cuda.rs:18 * Root cause: conflated two failures, missed the cheap dispatch-flip experiment that would have rejected Bug B's hypothesis in 5 minutes. * Fix references: PR #1879 (PMAT-704) — cuBLAS default, RealizarQ4KTeacher demoted to APR_DISTILL_TEACHER_BACKEND=realizar-q4k opt-in fallback. * Contract changes: new `apr-distill-teacher-backend-selection-v1.yaml`, `cuda-q4k-frozen-teacher-v1.yaml` demoted (not retracted). * Methodology lesson: cheap-experiment-before-design discipline. * Cascade closure table covering PRs #1863, #1869, #1871, #1874, #1877, #1879. Spec version bumped 1.1.0 → 1.3.0 with changelog entries for both §86 (via PR #1871, also pending merge) and §87 (this PR). The amendment notes the §86 cross-reference and explains the order-of-operations in case readers see this on a build of main that predates #1871. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-23T04:37:23Z

Subsumed by #1897 (PMAT-702..705 distill cascade bundle for v0.35.x hiatus-prep). Squash-merge preserves the per-PR commit message — see commit log on #1897.

noahgift enabled auto-merge (squash) May 22, 2026 10:37

This was referenced May 22, 2026

docs(spec): SPEC-DISTILL-001 §87 — PMAT-704 post-mortem on Bug B wrong turn #1880

Closed

feat(distill): wire ProgressCallback into Pipeline — close training-monitoring gap (PMAT-705) #1881

Closed

Merge branch 'main' into fix/distill-teacher-backend-selection-pmat-704

6839ffc

noahgift mentioned this pull request May 22, 2026

chore(distill): Stage D dispatch wrapper with PMAT-701 lessons baked in #1883

Closed

4 tasks

Merge branch 'main' into fix/distill-teacher-backend-selection-pmat-704

ed58fb1

noahgift mentioned this pull request May 23, 2026

chore: bundle PMAT-702..705 distill cascade (subsumes #1874, #1877, #1879, #1881) #1897

Closed

noahgift closed this May 23, 2026

auto-merge was automatically disabled May 23, 2026 04:37
Pull request was closed

This was referenced May 23, 2026

chore: mega-bundle hiatus close-out (subsumes #1880, #1883, #1886, #1891, #1896, #1897) #1898

Merged

release: v0.35.2 — hiatus close-out drain #1899

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(distill): default Q4K teacher to CudaTrainerTeacher (cuBLAS) — revert Bug B's slow path (PMAT-704)#1879

fix(distill): default Q4K teacher to CudaTrainerTeacher (cuBLAS) — revert Bug B's slow path (PMAT-704)#1879
noahgift wants to merge 3 commits into
mainfrom
fix/distill-teacher-backend-selection-pmat-704

noahgift commented May 22, 2026

Uh oh!

noahgift commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 22, 2026

Summary

Five-whys

Fix

Contract

Verification on gx10

Cascade context

Test plan

Uh oh!

noahgift commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant