feat(cuda): autodetect Grace Blackwell + Q4K frozen-teacher contract (PMAT-701) by noahgift · Pull Request #1863 · paiml/aprender

noahgift · 2026-05-22T06:35:13Z

Summary

Two compounding bugs prevented the MODEL-1 7B teacher (paiml/qwen2.5-coder-7b-apache-q4k-v1) from being usable in `apr distill --backend cuda` on Grace Blackwell GB10 despite the device having 128 GB unified memory.

Bug A (FIXED): trueno-gpu allocator default used `cuMemAlloc` (device-only, ~30 GB cap on GB10) instead of `cuMemAllocManaged` (full 128 GB unified). PMAT-394 managed path existed but was opt-in only.
Bug B (contract authored, implementation deferred): cuda training backend dequantizes Q4K teacher weights to F32 at GPU upload (4 GB → 28 GB inflation), making 7B teachers OOM-kill even with managed memory. No Q4K-native frozen-teacher path exists.

5-whys evidence chain

Captured in `evidence/distill-7b-teacher-loadtest-gx10/findings.json`. Verified at three points on gx10 GB10:

Pre-fix (`launch.log`): `apr distill` 7B teacher OOMs at Block 27/28 — `cuMemAlloc` ceiling hit.
Explicit `MANAGED_MEMORY=1` (`launch-managed.log`): all 28 blocks upload, then SIGKILL during step 0 — confirms Bug B independent of Bug A.
Post-fix-A (`launch-after-fix-a.log`): NO env var, all 28 blocks upload via autodetect — falsifier FT-ALLOC-DISTILL-7B-006 PASSES.

What this PR contains

Code (Fix A) — `crates/aprender-gpu/src/driver/memory/buffer.rs`:

Adds `DeviceMemoryClass` enum (UnifiedMemory | ClassicDevice) and `classify_device_memory(ctx)` querying `CU_DEVICE_ATTRIBUTE_INTEGRATED` via `cuDeviceGetAttribute`.
Adds `should_use_managed_memory(ctx)` that honors `MANAGED_MEMORY=1` / `=0` env var as explicit override, otherwise follows device classification.
`GpuBuffer::new` dispatches to `cuMemAllocManaged` on integrated GPUs (Grace, Tegra), `cuMemAlloc` on discrete dGPUs (Ada, Hopper, Ampere) — preserves prior behavior for non-Grace.

Tests — `crates/aprender-gpu/src/driver/memory/tests.rs`:

`allocator_tests::classify_gb10_unified` (FT-ALLOC-AUTODETECT-001)
`allocator_tests::classify_rtx4090_classic` (FT-ALLOC-AUTODETECT-002)
`allocator_tests::env_override_managed_forced` (FT-ALLOC-DISPATCH-004)
`allocator_tests::env_override_device_only` (FT-ALLOC-DISPATCH-005)

Contracts (both validate `0 errors, 0 warnings` via `pv validate`):

`contracts/trueno-gpu/cuda-unified-memory-allocator-v1.yaml` — Bug A spec, 6 falsifiers, 2 Kani harnesses, qa_gate F-ALLOC-UNIFIED-001.
`contracts/cuda-q4k-frozen-teacher-v1.yaml` — Bug B design spec, 5 falsifiers, 2 Kani harnesses, qa_gate F-Q4K-TEACHER-001. Implementation is multi-PR scope.

Practical impact

Phase 4 distill dispatch can now use a real teacher (1.5B Qwen2.5-Coder-Instruct) on GB10 with default settings, replacing the smoke-mode TEACHER=STUDENT=0.5B workaround that produced no real KD signal (see commit context). 7B teacher still gated on Bug B implementation (separate PR).

Test plan

`cargo build --release -p aprender-gpu --features cuda` — clean
`cargo build --release -p apr-cli --features cuda` on gx10 — clean
`pv validate contracts/trueno-gpu/cuda-unified-memory-allocator-v1.yaml` — clean
`pv validate contracts/cuda-q4k-frozen-teacher-v1.yaml` — clean
`cargo test -p aprender-contracts --lib lint::` — 191 passed
FT-ALLOC-DISTILL-7B-006 verified on gx10 GB10 (`launch-after-fix-a.log`)
CI: `ci / gate` + `workspace-test` green
Follow-up: file PMAT-701 Tier 3 for Bug B implementation

🤖 Generated with Claude Code

…her (PMAT-701) Two compounding bugs prevented the MODEL-1 7B teacher (paiml/qwen2.5-coder-7b-apache-q4k-v1) from being usable in `apr distill --backend cuda` on Grace Blackwell GB10 despite the device having 128 GB unified memory. 5-whys investigation (evidence/distill-7b-teacher-loadtest-gx10/findings.json): Bug A (this PR, FIXED): trueno-gpu allocator default used cuMemAlloc (device-only, ~30 GB cap on GB10) instead of cuMemAllocManaged (full 128 GB unified pool). The PMAT-394 managed path existed but was gated behind opt-in `MANAGED_MEMORY=1` env var with no device-class autodetection. Bug B (contract authored, implementation deferred): cuda training backend dequantizes Q4K teacher weights to F32 at GPU upload (4 GB → 28 GB inflation), making 7B teachers infeasible even with managed memory due to Linux OOM-kill. No Q4K-native frozen-teacher forward path exists. This PR: 1. Adds `classify_device_memory` + `should_use_managed_memory` to GpuBuffer (crates/aprender-gpu/src/driver/memory/buffer.rs). Default allocator queries CU_DEVICE_ATTRIBUTE_INTEGRATED via cuDeviceGetAttribute; integrated GPUs (Grace, Tegra) route to cuMemAllocManaged, discrete dGPUs (Ada, Hopper, Ampere) keep cuMemAlloc. Legacy MANAGED_MEMORY=1/0 env var override preserved. 2. Authors `contracts/trueno-gpu/cuda-unified-memory-allocator-v1.yaml`: classifies device memory architecture, defines 6 falsification tests (FT-ALLOC-AUTODETECT-001/002, FT-ALLOC-DISPATCH-003/004/005, FT-ALLOC-DISTILL-7B-006), 2 Kani harnesses, qa_gate F-ALLOC-UNIFIED-001. 3. Authors `contracts/cuda-q4k-frozen-teacher-v1.yaml`: codifies the Q4K frozen-teacher invariant (no dequant at GPU upload, Q4K-native forward kernels reuse realizar inference path, type-level no_grad invariant on CudaBlock::Q4K). 5 falsification tests, 2 Kani harnesses. Implementation is multi-PR scope (trueno Q4K backward + aprender-train CudaBlock::Q4K enum variant); contract is the design spec for that follow-up work. 4. Captures 5-whys analysis with verified evidence at three points: - launch.log: pre-fix OOM at Block 27/28 (cuMemAlloc ceiling) - launch-managed.log: explicit MANAGED_MEMORY=1, all blocks upload, then OOM-kill during step (Bug B confirmed independent of Bug A) - launch-after-fix-a.log: NO env var, all 28 blocks upload via autodetect (Fix A verified end-to-end on gx10 GB10) Falsifier FT-ALLOC-DISTILL-7B-006 PASSES post-fix: 7B teacher loads all 28 transformer blocks on GB10 without MANAGED_MEMORY env var. The remaining SIGKILL is Bug B territory (cuda-q4k-frozen-teacher-v1.yaml). Practical impact: 1.5B and smaller teachers now usable in `apr distill --backend cuda` on Grace Blackwell with default settings — unblocking the Phase 4 distill fix that's been carrying TEACHER=STUDENT=0.5B as a smoke-mode workaround. Both contracts validate clean: `pv validate` reports 0 errors, 0 warnings. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…AT-701 Bug B) (#1869) Implements `contracts/cuda-q4k-frozen-teacher-v1.yaml` (landed in PR #1863). Before this fix, `apr distill --backend cuda` with a Q4K teacher (e.g. the MODEL-1 teacher `paiml/qwen2.5-coder-7b-apache-q4k-v1`) was killed by the Linux OOM killer at the first training step. Even with PR #1863's allocator fix, the legacy `CudaTrainerTeacher` dequantizes Q4K weights to F32 at GPU upload (4 GB on disk → 28 GB F32), and that inflation plus student grads + Adam + activations exceeded the OOM threshold on Grace Blackwell GB10. ## What this PR adds `crates/apr-cli/src/commands/distill_q4k_teacher.rs`: * `RealizarQ4KTeacher` — wraps realizar's `OwnedQuantizedModelCuda` (the same inference-time path validated by `apr run`). Weights live on the GPU in their native Q4K format; forward GEMM uses Q4K-native CUDA kernels. No F32 dequantization at upload, no gradient/optimizer state. * Implements `entrenar_distill::teacher_provider::TeacherLogitsProvider`: `logits_for_batch` delegates per-element to `cuda_model.forward_cuda`. `crates/apr-cli/src/commands/distill.rs`: * `run_cuda_backend` now inspects the teacher .apr's tensor dtype histogram. If any tensor is Q4K or Q6K, route to `RealizarQ4KTeacher`. F32/F16/BF16 teachers continue to use `CudaTrainerTeacher` (the dequant path is harmless for those types). `Cargo.toml`: * Adds `aprender-train-common` (`entrenar_common`) as an optional dep on apr-cli's `training` feature to surface `EntrenarError::Internal` in the new teacher impl. ## Verification on gx10 (Grace Blackwell GB10, sm_121) Captured in `evidence/distill-7b-teacher-loadtest-gx10/launch-after-fix-b.log`: * `[PMAT-701] Q4K/Q6K teacher detected → RealizarQ4KTeacher` — dispatch fires. * `[PMAT-701] RealizarQ4KTeacher: pre-uploaded ... MB to GPU (Q4K-native, no F32 dequant)` — teacher staged. * `✓ 24 transformer blocks uploaded to GPU` + `✓ GPU training state allocated (LM head: 544.5 MB)` — student loaded. * `✓ Fused gradient clipping: 1506 partials (5.9 KB)` — training state ready. Process then ran for 15 minutes of stable training at **~36 GB system memory** (well under the 122 GB MemAvailable ceiling), terminated by the test's `timeout 900` SIGTERM. **No OOM-kill, no `[PMAT-333] Dequantizing` log, no `Killed` log.** Before this fix, the run SIGKILL'd within seconds of `Fused gradient clipping` due to the F32 dequant memory pressure. ## Falsifier mapping (`cuda-q4k-frozen-teacher-v1.yaml`) * FT-Q4K-TEACHER-001 PASS: no `[PMAT-333] Dequantizing` line in the log. * FT-Q4K-TEACHER-002 partial: 36 GB total includes student F32 + grads + Adam; teacher contribution is dominated by Q4K blocks, not F32 inflation. * FT-Q4K-TEACHER-005 partial: process completes >1 training step without OOM-kill. (Full 1-epoch completion deferred — teacher forward via realizar is slow enough that 31 steps exceeds the test's 15-minute timeout. Throughput tuning is separate work; the contract's "no OOM" invariant is satisfied.) ## Practical impact Phase 4 distillation dispatch can now select the MODEL-1 7B teacher (`paiml/qwen2.5-coder-7b-apache-q4k-v1`) on GB10 without the smoke-mode TEACHER=STUDENT=0.5B workaround. Combined with PR #1863, the cuda distill backend on Grace Blackwell now matches the practical expectations of "128 GB unified memory means I can train with a real teacher." Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…(PMAT-701 follow-up) (#1871) The Phase 4 Stage D 50K + 10K runs (2026-05-20/21) silently inherited the Phase 3 smoke workaround of TEACHER_REPO == STUDENT_INIT == 0.5B. Result: no KD signal, 30 h of compute that fine-tuned the base model toward gibberish on a small corpus. Documented in `evidence/distill-7b-teacher-loadtest-gx10/findings.json` + this spec amendment. Now that PMAT-701 Bug A (PR #1863) and Bug B (PR #1869) have landed, the 7B Q4K teacher is feasible on Grace Blackwell GB10: * PR #1863: trueno-gpu allocator autodetects unified-memory devices (Grace, Tegra) and routes to cuMemAllocManaged so the full 128 GB pool is reachable. * PR #1869: new RealizarQ4KTeacher keeps Q4K teacher weights quantized on the GPU (no F32 dequant at upload), eliminating the OOM-kill that was killing the first training step. This PR flips the dispatch script's default and codifies the why in spec §86: * `scripts/dispatch-distill-phase-3-gx10.sh` — TEACHER_REPO default changes from `Qwen/Qwen2.5-Coder-0.5B-Instruct` (smoke fallback) to `paiml/qwen2.5-coder-7b-apache-q4k-v1` (the MODEL-1 teacher the spec was designed around). Smoke-only callers override with the env var. * `docs/specifications/aprender-train/distillation-epic-spec.md` — adds §86 documenting the 5-whys, the fix references, and a new falsifier F-DISTILL-V2-001-TEACHER-DIVERGENCE that rejects future Phase-4-class dispatches where teacher == student unless an explicit override is set. * Spec version bumped to 1.2.0 with changelog entry. The §86 amendment also notes that the existing 50K + 10K Stage D runs do NOT count toward AC-DISTILL-003 — they're discharged as no-KD baselines, and a re-dispatched 50K run with the 7B teacher is required for a real Phase 4 verdict. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…g turn Adds a §87 amendment to SPEC-DISTILL-001 documenting the root cause of the PMAT-704 cascade fix: PR #1869 (Bug B / RealizarQ4KTeacher) was a wrong turn — the realizar `_cuda` forward path is CPU-bound and unusable as a distillation teacher on Grace Blackwell GB10. The 7B vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU at 0% utilization — empirical proof of the defect. The amendment includes: * Full five-whys chain (cuMemAlloc 30 GB ceiling vs phantom OOM-killer SIGKILL on the explicit-managed path), with file/line citations pointing to the CPU-heavy ops in crates/aprender-serve/src/gguf/cuda/cuda.rs:18 * Root cause: conflated two failures, missed the cheap dispatch-flip experiment that would have rejected Bug B's hypothesis in 5 minutes. * Fix references: PR #1879 (PMAT-704) — cuBLAS default, RealizarQ4KTeacher demoted to APR_DISTILL_TEACHER_BACKEND=realizar-q4k opt-in fallback. * Contract changes: new `apr-distill-teacher-backend-selection-v1.yaml`, `cuda-q4k-frozen-teacher-v1.yaml` demoted (not retracted). * Methodology lesson: cheap-experiment-before-design discipline. * Cascade closure table covering PRs #1863, #1869, #1871, #1874, #1877, #1879. Spec version bumped 1.1.0 → 1.3.0 with changelog entries for both §86 (via PR #1871, also pending merge) and §87 (this PR). The amendment notes the §86 cross-reference and explains the order-of-operations in case readers see this on a build of main that predates #1871. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 22, 2026 06:35

noahgift merged commit 5b3a862 into main May 22, 2026
11 checks passed

noahgift deleted the fix/cuda-unified-memory-allocator-pmat-701 branch May 22, 2026 07:01

This was referenced May 22, 2026

fix(distill): RealizarQ4KTeacher — Q4K-native frozen-teacher path (PMAT-701 Bug B) #1869

Merged

chore(distill): default to MODEL-1 7B teacher + SPEC-DISTILL-001 §86 (PMAT-701 follow-up) #1871

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cuda): autodetect Grace Blackwell + Q4K frozen-teacher contract (PMAT-701)#1863

feat(cuda): autodetect Grace Blackwell + Q4K frozen-teacher contract (PMAT-701)#1863
noahgift merged 1 commit into
mainfrom
fix/cuda-unified-memory-allocator-pmat-701

noahgift commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 22, 2026

Summary

5-whys evidence chain

What this PR contains

Practical impact

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant