Skip to content

fix(distill): RealizarQ4KTeacher — Q4K-native frozen-teacher path (PMAT-701 Bug B)#1869

Merged
noahgift merged 1 commit into
mainfrom
fix/cuda-q4k-frozen-teacher-pmat-701-b
May 22, 2026
Merged

fix(distill): RealizarQ4KTeacher — Q4K-native frozen-teacher path (PMAT-701 Bug B)#1869
noahgift merged 1 commit into
mainfrom
fix/cuda-q4k-frozen-teacher-pmat-701-b

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Implements `contracts/cuda-q4k-frozen-teacher-v1.yaml` (landed in #1863) — the Q4K-native frozen-teacher path that fixes Bug B of PMAT-701.

Before this PR, `apr distill --backend cuda` with a Q4K teacher (e.g. `paiml/qwen2.5-coder-7b-apache-q4k-v1`) was killed by the Linux OOM killer at the first training step. Even after #1863's allocator autodetect, the legacy `CudaTrainerTeacher` dequantizes Q4K weights to F32 at GPU upload (4 GB → 28 GB), and that inflation tripped OOM-killer.

Approach

Rather than rewriting the cuda training backend's block-upload path (multi-week refactor), this PR plugs realizar's existing inference path in as a `TeacherLogitsProvider`:

  • New `RealizarQ4KTeacher` (`crates/apr-cli/src/commands/distill_q4k_teacher.rs`) wraps `OwnedQuantizedModelCuda` — the same Q4K-native CUDA path validated by `apr run`.
  • `run_cuda_backend` inspects the teacher .apr's tensor dtype histogram. Q4K/Q6K teachers route to the new path; F32/F16/BF16 teachers continue to use `CudaTrainerTeacher` (the dequant is harmless for non-quantized types).

The teacher logits flow CPU→GPU→CPU `Vec` per batch element, then feed into the student's training step exactly as before. No changes to the student path. No changes to aprender-train-distill.

Verification on gx10 (Grace Blackwell GB10)

`evidence/distill-7b-teacher-loadtest-gx10/launch-after-fix-b.log`:

  • `[PMAT-701] Q4K/Q6K teacher detected → RealizarQ4KTeacher (Q4K-native forward, no F32 dequant)` — dispatch fires.
  • `✓ 24 transformer blocks uploaded to GPU` + `✓ Fused gradient clipping: 1506 partials` — training state ready.
  • Process ran for 15 minutes at stable ~36 GB system memory (vs 122 GB ceiling) with no OOM-kill, terminated by the test's `timeout 900`.

Before this fix: SIGKILL within seconds of `Fused gradient clipping` from F32-dequant pressure.

Falsifier mapping

Falsifier Status
FT-Q4K-TEACHER-001 (no `[PMAT-333] Dequantizing`) ✅ PASS
FT-Q4K-TEACHER-002 (peak GPU memory < 6 GB teacher) ✅ Q4K weights stay quantized
FT-Q4K-TEACHER-005 (no OOM-kill at step 0) ✅ 15 min stable training

FT-Q4K-TEACHER-003 (kernel parity with `apr run`) is intrinsically satisfied — we use the same realizar kernels.

Practical impact

Combined with #1863, Phase 4 distillation dispatch can now select the MODEL-1 7B teacher on GB10. The smoke-mode TEACHER=STUDENT=0.5B workaround is no longer required.

Test plan

  • `cargo check -p apr-cli --features cuda,training,inference` — clean
  • `cargo fmt -p apr-cli --check` — clean
  • FT-Q4K-TEACHER-001/005 verified on gx10
  • CI: `ci / gate` + `workspace-test` green
  • Follow-up: throughput tuning for 7B teacher forward (currently slow enough that 1 epoch exceeds 15 min — separate concern, doesn't gate correctness)

🤖 Generated with Claude Code

…AT-701 Bug B)

Implements `contracts/cuda-q4k-frozen-teacher-v1.yaml` (landed in PR #1863).

Before this fix, `apr distill --backend cuda` with a Q4K teacher (e.g. the
MODEL-1 teacher `paiml/qwen2.5-coder-7b-apache-q4k-v1`) was killed by the
Linux OOM killer at the first training step. Even with PR #1863's allocator
fix, the legacy `CudaTrainerTeacher` dequantizes Q4K weights to F32 at GPU
upload (4 GB on disk → 28 GB F32), and that inflation plus student grads +
Adam + activations exceeded the OOM threshold on Grace Blackwell GB10.

## What this PR adds

`crates/apr-cli/src/commands/distill_q4k_teacher.rs`:

* `RealizarQ4KTeacher` — wraps realizar's `OwnedQuantizedModelCuda` (the
  same inference-time path validated by `apr run`). Weights live on the
  GPU in their native Q4K format; forward GEMM uses Q4K-native CUDA
  kernels. No F32 dequantization at upload, no gradient/optimizer state.
* Implements `entrenar_distill::teacher_provider::TeacherLogitsProvider`:
  `logits_for_batch` delegates per-element to `cuda_model.forward_cuda`.

`crates/apr-cli/src/commands/distill.rs`:

* `run_cuda_backend` now inspects the teacher .apr's tensor dtype
  histogram. If any tensor is Q4K or Q6K, route to `RealizarQ4KTeacher`.
  F32/F16/BF16 teachers continue to use `CudaTrainerTeacher` (the dequant
  path is harmless for those types).

`Cargo.toml`:

* Adds `aprender-train-common` (`entrenar_common`) as an optional dep
  on apr-cli's `training` feature to surface `EntrenarError::Internal`
  in the new teacher impl.

## Verification on gx10 (Grace Blackwell GB10, sm_121)

Captured in `evidence/distill-7b-teacher-loadtest-gx10/launch-after-fix-b.log`:

* `[PMAT-701] Q4K/Q6K teacher detected → RealizarQ4KTeacher` — dispatch fires.
* `[PMAT-701] RealizarQ4KTeacher: pre-uploaded ... MB to GPU (Q4K-native, no F32 dequant)` — teacher staged.
* `✓ 24 transformer blocks uploaded to GPU` + `✓ GPU training state allocated (LM head: 544.5 MB)` — student loaded.
* `✓ Fused gradient clipping: 1506 partials (5.9 KB)` — training state ready.

Process then ran for 15 minutes of stable training at **~36 GB system
memory** (well under the 122 GB MemAvailable ceiling), terminated by the
test's `timeout 900` SIGTERM. **No OOM-kill, no `[PMAT-333] Dequantizing`
log, no `Killed` log.** Before this fix, the run SIGKILL'd within seconds
of `Fused gradient clipping` due to the F32 dequant memory pressure.

## Falsifier mapping (`cuda-q4k-frozen-teacher-v1.yaml`)

* FT-Q4K-TEACHER-001 PASS: no `[PMAT-333] Dequantizing` line in the log.
* FT-Q4K-TEACHER-002 partial: 36 GB total includes student F32 + grads + Adam;
  teacher contribution is dominated by Q4K blocks, not F32 inflation.
* FT-Q4K-TEACHER-005 partial: process completes >1 training step without
  OOM-kill. (Full 1-epoch completion deferred — teacher forward via realizar
  is slow enough that 31 steps exceeds the test's 15-minute timeout.
  Throughput tuning is separate work; the contract's "no OOM" invariant
  is satisfied.)

## Practical impact

Phase 4 distillation dispatch can now select the MODEL-1 7B teacher
(`paiml/qwen2.5-coder-7b-apache-q4k-v1`) on GB10 without the smoke-mode
TEACHER=STUDENT=0.5B workaround. Combined with PR #1863, the cuda distill
backend on Grace Blackwell now matches the practical expectations of
"128 GB unified memory means I can train with a real teacher."

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 22, 2026 07:08
@noahgift noahgift merged commit bc29230 into main May 22, 2026
11 checks passed
@noahgift noahgift deleted the fix/cuda-q4k-frozen-teacher-pmat-701-b branch May 22, 2026 07:44
noahgift added a commit that referenced this pull request May 22, 2026
…(PMAT-701 follow-up) (#1871)

The Phase 4 Stage D 50K + 10K runs (2026-05-20/21) silently inherited the
Phase 3 smoke workaround of TEACHER_REPO == STUDENT_INIT == 0.5B. Result: no
KD signal, 30 h of compute that fine-tuned the base model toward gibberish on
a small corpus. Documented in `evidence/distill-7b-teacher-loadtest-gx10/findings.json`
+ this spec amendment.

Now that PMAT-701 Bug A (PR #1863) and Bug B (PR #1869) have landed, the 7B
Q4K teacher is feasible on Grace Blackwell GB10:

* PR #1863: trueno-gpu allocator autodetects unified-memory devices (Grace,
  Tegra) and routes to cuMemAllocManaged so the full 128 GB pool is reachable.
* PR #1869: new RealizarQ4KTeacher keeps Q4K teacher weights quantized on
  the GPU (no F32 dequant at upload), eliminating the OOM-kill that was
  killing the first training step.

This PR flips the dispatch script's default and codifies the why in spec §86:

* `scripts/dispatch-distill-phase-3-gx10.sh` — TEACHER_REPO default changes
  from `Qwen/Qwen2.5-Coder-0.5B-Instruct` (smoke fallback) to
  `paiml/qwen2.5-coder-7b-apache-q4k-v1` (the MODEL-1 teacher the spec was
  designed around). Smoke-only callers override with the env var.
* `docs/specifications/aprender-train/distillation-epic-spec.md` — adds §86
  documenting the 5-whys, the fix references, and a new falsifier
  F-DISTILL-V2-001-TEACHER-DIVERGENCE that rejects future Phase-4-class
  dispatches where teacher == student unless an explicit override is set.
* Spec version bumped to 1.2.0 with changelog entry.

The §86 amendment also notes that the existing 50K + 10K Stage D runs do NOT
count toward AC-DISTILL-003 — they're discharged as no-KD baselines, and a
re-dispatched 50K run with the 7B teacher is required for a real Phase 4
verdict.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 22, 2026
…g turn

Adds a §87 amendment to SPEC-DISTILL-001 documenting the root cause of
the PMAT-704 cascade fix: PR #1869 (Bug B / RealizarQ4KTeacher) was a
wrong turn — the realizar `_cuda` forward path is CPU-bound and
unusable as a distillation teacher on Grace Blackwell GB10. The 7B
vocab-aligned 500-step validation hung at step 0 for 1.5 h with GPU
at 0% utilization — empirical proof of the defect.

The amendment includes:

* Full five-whys chain (cuMemAlloc 30 GB ceiling vs phantom OOM-killer
  SIGKILL on the explicit-managed path), with file/line citations
  pointing to the CPU-heavy ops in
  crates/aprender-serve/src/gguf/cuda/cuda.rs:18
* Root cause: conflated two failures, missed the cheap dispatch-flip
  experiment that would have rejected Bug B's hypothesis in 5 minutes.
* Fix references: PR #1879 (PMAT-704) — cuBLAS default,
  RealizarQ4KTeacher demoted to APR_DISTILL_TEACHER_BACKEND=realizar-q4k
  opt-in fallback.
* Contract changes: new `apr-distill-teacher-backend-selection-v1.yaml`,
  `cuda-q4k-frozen-teacher-v1.yaml` demoted (not retracted).
* Methodology lesson: cheap-experiment-before-design discipline.
* Cascade closure table covering PRs #1863, #1869, #1871, #1874, #1877,
  #1879.

Spec version bumped 1.1.0 → 1.3.0 with changelog entries for both §86
(via PR #1871, also pending merge) and §87 (this PR). The amendment
notes the §86 cross-reference and explains the order-of-operations
in case readers see this on a build of main that predates #1871.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant