feat(cuda): autodetect Grace Blackwell + Q4K frozen-teacher contract (PMAT-701)#1863
Merged
Merged
Conversation
…her (PMAT-701)
Two compounding bugs prevented the MODEL-1 7B teacher (paiml/qwen2.5-coder-7b-apache-q4k-v1)
from being usable in `apr distill --backend cuda` on Grace Blackwell GB10 despite
the device having 128 GB unified memory.
5-whys investigation (evidence/distill-7b-teacher-loadtest-gx10/findings.json):
Bug A (this PR, FIXED): trueno-gpu allocator default used cuMemAlloc (device-only,
~30 GB cap on GB10) instead of cuMemAllocManaged (full 128 GB unified pool). The
PMAT-394 managed path existed but was gated behind opt-in `MANAGED_MEMORY=1` env
var with no device-class autodetection.
Bug B (contract authored, implementation deferred): cuda training backend
dequantizes Q4K teacher weights to F32 at GPU upload (4 GB → 28 GB inflation),
making 7B teachers infeasible even with managed memory due to Linux OOM-kill.
No Q4K-native frozen-teacher forward path exists.
This PR:
1. Adds `classify_device_memory` + `should_use_managed_memory` to GpuBuffer
(crates/aprender-gpu/src/driver/memory/buffer.rs). Default allocator
queries CU_DEVICE_ATTRIBUTE_INTEGRATED via cuDeviceGetAttribute; integrated
GPUs (Grace, Tegra) route to cuMemAllocManaged, discrete dGPUs (Ada, Hopper,
Ampere) keep cuMemAlloc. Legacy MANAGED_MEMORY=1/0 env var override preserved.
2. Authors `contracts/trueno-gpu/cuda-unified-memory-allocator-v1.yaml`:
classifies device memory architecture, defines 6 falsification tests
(FT-ALLOC-AUTODETECT-001/002, FT-ALLOC-DISPATCH-003/004/005,
FT-ALLOC-DISTILL-7B-006), 2 Kani harnesses, qa_gate F-ALLOC-UNIFIED-001.
3. Authors `contracts/cuda-q4k-frozen-teacher-v1.yaml`: codifies the Q4K
frozen-teacher invariant (no dequant at GPU upload, Q4K-native forward
kernels reuse realizar inference path, type-level no_grad invariant on
CudaBlock::Q4K). 5 falsification tests, 2 Kani harnesses. Implementation
is multi-PR scope (trueno Q4K backward + aprender-train CudaBlock::Q4K
enum variant); contract is the design spec for that follow-up work.
4. Captures 5-whys analysis with verified evidence at three points:
- launch.log: pre-fix OOM at Block 27/28 (cuMemAlloc ceiling)
- launch-managed.log: explicit MANAGED_MEMORY=1, all blocks upload, then
OOM-kill during step (Bug B confirmed independent of Bug A)
- launch-after-fix-a.log: NO env var, all 28 blocks upload via autodetect
(Fix A verified end-to-end on gx10 GB10)
Falsifier FT-ALLOC-DISTILL-7B-006 PASSES post-fix: 7B teacher loads all 28
transformer blocks on GB10 without MANAGED_MEMORY env var. The remaining
SIGKILL is Bug B territory (cuda-q4k-frozen-teacher-v1.yaml).
Practical impact: 1.5B and smaller teachers now usable in `apr distill --backend
cuda` on Grace Blackwell with default settings — unblocking the Phase 4 distill
fix that's been carrying TEACHER=STUDENT=0.5B as a smoke-mode workaround.
Both contracts validate clean: `pv validate` reports 0 errors, 0 warnings.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 22, 2026
noahgift
added a commit
that referenced
this pull request
May 22, 2026
…AT-701 Bug B) (#1869) Implements `contracts/cuda-q4k-frozen-teacher-v1.yaml` (landed in PR #1863). Before this fix, `apr distill --backend cuda` with a Q4K teacher (e.g. the MODEL-1 teacher `paiml/qwen2.5-coder-7b-apache-q4k-v1`) was killed by the Linux OOM killer at the first training step. Even with PR #1863's allocator fix, the legacy `CudaTrainerTeacher` dequantizes Q4K weights to F32 at GPU upload (4 GB on disk → 28 GB F32), and that inflation plus student grads + Adam + activations exceeded the OOM threshold on Grace Blackwell GB10. ## What this PR adds `crates/apr-cli/src/commands/distill_q4k_teacher.rs`: * `RealizarQ4KTeacher` — wraps realizar's `OwnedQuantizedModelCuda` (the same inference-time path validated by `apr run`). Weights live on the GPU in their native Q4K format; forward GEMM uses Q4K-native CUDA kernels. No F32 dequantization at upload, no gradient/optimizer state. * Implements `entrenar_distill::teacher_provider::TeacherLogitsProvider`: `logits_for_batch` delegates per-element to `cuda_model.forward_cuda`. `crates/apr-cli/src/commands/distill.rs`: * `run_cuda_backend` now inspects the teacher .apr's tensor dtype histogram. If any tensor is Q4K or Q6K, route to `RealizarQ4KTeacher`. F32/F16/BF16 teachers continue to use `CudaTrainerTeacher` (the dequant path is harmless for those types). `Cargo.toml`: * Adds `aprender-train-common` (`entrenar_common`) as an optional dep on apr-cli's `training` feature to surface `EntrenarError::Internal` in the new teacher impl. ## Verification on gx10 (Grace Blackwell GB10, sm_121) Captured in `evidence/distill-7b-teacher-loadtest-gx10/launch-after-fix-b.log`: * `[PMAT-701] Q4K/Q6K teacher detected → RealizarQ4KTeacher` — dispatch fires. * `[PMAT-701] RealizarQ4KTeacher: pre-uploaded ... MB to GPU (Q4K-native, no F32 dequant)` — teacher staged. * `✓ 24 transformer blocks uploaded to GPU` + `✓ GPU training state allocated (LM head: 544.5 MB)` — student loaded. * `✓ Fused gradient clipping: 1506 partials (5.9 KB)` — training state ready. Process then ran for 15 minutes of stable training at **~36 GB system memory** (well under the 122 GB MemAvailable ceiling), terminated by the test's `timeout 900` SIGTERM. **No OOM-kill, no `[PMAT-333] Dequantizing` log, no `Killed` log.** Before this fix, the run SIGKILL'd within seconds of `Fused gradient clipping` due to the F32 dequant memory pressure. ## Falsifier mapping (`cuda-q4k-frozen-teacher-v1.yaml`) * FT-Q4K-TEACHER-001 PASS: no `[PMAT-333] Dequantizing` line in the log. * FT-Q4K-TEACHER-002 partial: 36 GB total includes student F32 + grads + Adam; teacher contribution is dominated by Q4K blocks, not F32 inflation. * FT-Q4K-TEACHER-005 partial: process completes >1 training step without OOM-kill. (Full 1-epoch completion deferred — teacher forward via realizar is slow enough that 31 steps exceeds the test's 15-minute timeout. Throughput tuning is separate work; the contract's "no OOM" invariant is satisfied.) ## Practical impact Phase 4 distillation dispatch can now select the MODEL-1 7B teacher (`paiml/qwen2.5-coder-7b-apache-q4k-v1`) on GB10 without the smoke-mode TEACHER=STUDENT=0.5B workaround. Combined with PR #1863, the cuda distill backend on Grace Blackwell now matches the practical expectations of "128 GB unified memory means I can train with a real teacher." Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 22, 2026
noahgift
added a commit
that referenced
this pull request
May 22, 2026
…(PMAT-701 follow-up) (#1871) The Phase 4 Stage D 50K + 10K runs (2026-05-20/21) silently inherited the Phase 3 smoke workaround of TEACHER_REPO == STUDENT_INIT == 0.5B. Result: no KD signal, 30 h of compute that fine-tuned the base model toward gibberish on a small corpus. Documented in `evidence/distill-7b-teacher-loadtest-gx10/findings.json` + this spec amendment. Now that PMAT-701 Bug A (PR #1863) and Bug B (PR #1869) have landed, the 7B Q4K teacher is feasible on Grace Blackwell GB10: * PR #1863: trueno-gpu allocator autodetects unified-memory devices (Grace, Tegra) and routes to cuMemAllocManaged so the full 128 GB pool is reachable. * PR #1869: new RealizarQ4KTeacher keeps Q4K teacher weights quantized on the GPU (no F32 dequant at upload), eliminating the OOM-kill that was killing the first training step. This PR flips the dispatch script's default and codifies the why in spec §86: * `scripts/dispatch-distill-phase-3-gx10.sh` — TEACHER_REPO default changes from `Qwen/Qwen2.5-Coder-0.5B-Instruct` (smoke fallback) to `paiml/qwen2.5-coder-7b-apache-q4k-v1` (the MODEL-1 teacher the spec was designed around). Smoke-only callers override with the env var. * `docs/specifications/aprender-train/distillation-epic-spec.md` — adds §86 documenting the 5-whys, the fix references, and a new falsifier F-DISTILL-V2-001-TEACHER-DIVERGENCE that rejects future Phase-4-class dispatches where teacher == student unless an explicit override is set. * Spec version bumped to 1.2.0 with changelog entry. The §86 amendment also notes that the existing 50K + 10K Stage D runs do NOT count toward AC-DISTILL-003 — they're discharged as no-KD baselines, and a re-dispatched 50K run with the 7B teacher is required for a real Phase 4 verdict. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 22, 2026
…g turn Adds a §87 amendment to SPEC-DISTILL-001 documenting the root cause of the PMAT-704 cascade fix: PR #1869 (Bug B / RealizarQ4KTeacher) was a wrong turn — the realizar `_cuda` forward path is CPU-bound and unusable as a distillation teacher on Grace Blackwell GB10. The 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU at 0% utilization — empirical proof of the defect. The amendment includes: * Full five-whys chain (cuMemAlloc 30 GB ceiling vs phantom OOM-killer SIGKILL on the explicit-managed path), with file/line citations pointing to the CPU-heavy ops in crates/aprender-serve/src/gguf/cuda/cuda.rs:18 * Root cause: conflated two failures, missed the cheap dispatch-flip experiment that would have rejected Bug B's hypothesis in 5 minutes. * Fix references: PR #1879 (PMAT-704) — cuBLAS default, RealizarQ4KTeacher demoted to APR_DISTILL_TEACHER_BACKEND=realizar-q4k opt-in fallback. * Contract changes: new `apr-distill-teacher-backend-selection-v1.yaml`, `cuda-q4k-frozen-teacher-v1.yaml` demoted (not retracted). * Methodology lesson: cheap-experiment-before-design discipline. * Cascade closure table covering PRs #1863, #1869, #1871, #1874, #1877, #1879. Spec version bumped 1.1.0 → 1.3.0 with changelog entries for both §86 (via PR #1871, also pending merge) and §87 (this PR). The amendment notes the §86 cross-reference and explains the order-of-operations in case readers see this on a build of main that predates #1871. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two compounding bugs prevented the MODEL-1 7B teacher (paiml/qwen2.5-coder-7b-apache-q4k-v1) from being usable in `apr distill --backend cuda` on Grace Blackwell GB10 despite the device having 128 GB unified memory.
5-whys evidence chain
Captured in `evidence/distill-7b-teacher-loadtest-gx10/findings.json`. Verified at three points on gx10 GB10:
What this PR contains
Code (Fix A) — `crates/aprender-gpu/src/driver/memory/buffer.rs`:
Tests — `crates/aprender-gpu/src/driver/memory/tests.rs`:
Contracts (both validate `0 errors, 0 warnings` via `pv validate`):
Practical impact
Phase 4 distill dispatch can now use a real teacher (1.5B Qwen2.5-Coder-Instruct) on GB10 with default settings, replacing the smoke-mode TEACHER=STUDENT=0.5B workaround that produced no real KD signal (see commit context). 7B teacher still gated on Bug B implementation (separate PR).
Test plan
🤖 Generated with Claude Code