Skip to content

spec(blackwell): SPEC-BLACKWELL-FIX-001 — GB10 training enablement (PMAT-700)#1803

Merged
noahgift merged 2 commits into
mainfrom
spec/blackwell-backend-fix-pmat-700
May 19, 2026
Merged

spec(blackwell): SPEC-BLACKWELL-FIX-001 — GB10 training enablement (PMAT-700)#1803
noahgift merged 2 commits into
mainfrom
spec/blackwell-backend-fix-pmat-700

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

GB10 (sm_121) hits CUDA_ERROR_OUT_OF_MEMORY at "Block 0 upload" during apr distill --backend cuda, even with a 0.5B model on a 128GB unified memory pool. The root cause is PTX JIT cache memory pressure — pre-warming 27+ custom kernels for sm_121 allocates enough VRAM that the subsequent transformer block upload has no headroom.

Design — three coordinated backend changes

Fix What Effort Standalone unblocks GB10?
#1 PTX precompilation Build-time compile PTX for sm_75/89/90/120/121, embed as binary blobs in trueno-gpu. Runtime cache loads blobs instead of JIT-compiling 2-3 days ✅ Yes
#2 cuBLAS for backward GEMM Migrate 7 backward kernels per layer (q/k/v/o/gate/up/down) from custom PTX → cuBLAS. Eliminates 168 JIT modules on a 24-layer model 3-4 days ✅ Yes — and a perf win on every host
#3 wgpu backward fallback When CUDA backward fails (OOM/JIT), fall through to wgpu WGSL backward shaders (already exist per #1582). Correctness safety net 1-2 days ❌ Insurance only

Each fix has its own falsifier: F-BLACKWELL-PTX-001, F-BLACKWELL-CUBLAS-001, F-BLACKWELL-WGPU-FALLBACK-001.

Why not just wait for trueno 0.4.36?

The memory rule says upstream trueno will fix Blackwell training, but timeline is uncertain. This spec ships the in-tree workaround with the same backends already in tree, giving us a 7-10 day path to gx10 dispatch working — independent of upstream timeline.

Evidence

Live OOM traces from gx10 dispatch attempts (Qwen 0.5B teacher, Qwen 0.5B student):

  • evidence/distill-phase-3-gb10-oom/dispatch-v6-0.5b-oom.txt
  • evidence/distill-phase-3-gb10-oom/remote-traces.txt

Pre-warm contract that documents the JIT pressure: crates/aprender-train/src/finetune/classify_pipeline/gpu.rs:56-59 (C-PREWARM-001).

Recommendation

Start with Fix #2 (cuBLAS backward GEMM). Three reasons:

  1. Drop-in replacement with parity test — low risk
  2. Standalone perf win on every host (not just Blackwell)
  3. Reduces JIT pressure by ~40% per layer, likely sufficient for GB10 alone

If #2 doesn't fully unblock GB10, layer in Fix #1 (PTX precomp) for a complete elimination of runtime JIT. Fix #3 lands last as cross-vendor safety net.

Rollout

Test plan

This PR is a spec document; the falsifiers fire on the implementation PRs (PMAT-700-B through -D). Acceptance for Phase E:

  • STEPS=50 ./scripts/dispatch-distill-phase-3-gx10.sh completes without OOM
  • final_loss < initial_loss (F-DISTILL-SMOKE-001 dischargeable)
  • gx10 throughput ≥ 0.5× lambda-vector RTX 4090
  • No regression on RTX 4090

🤖 Generated with Claude Code

…sign (PMAT-700)

GB10 (sm_121) hits CUDA_ERROR_OUT_OF_MEMORY at "Block 0 upload" during
apr distill --backend cuda, even with a 0.5B teacher on 128GB unified
memory. Root cause is PTX JIT cache memory pressure: pre-warming 27+
custom kernels for sm_121 allocates so much VRAM that the subsequent
block upload has no headroom. Live evidence in
evidence/distill-phase-3-gb10-oom/.

Design surfaces a three-pronged root-cause fix using the existing
cuBLAS / PTX / wgpu backends, not a workaround:

  Fix #1 — PTX precompilation for sm_121 (eliminates JIT entirely)
  Fix #2 — Migrate backward GEMMs to cuBLAS (eliminates 7 PTX kernels
          per layer; 168 fewer JIT modules for a 24-layer model)
  Fix #3 — wgpu backward fallback (cross-vendor safety net)

Each fix has its own falsifier (F-BLACKWELL-PTX-001, F-BLACKWELL-
CUBLAS-001, F-BLACKWELL-WGPU-FALLBACK-001) and standalone value.
Fix #2 alone is likely sufficient for the immediate gx10 unblock
(3-4 day effort, drop-in, also a perf win on every host).

Rollout: spec → Fix #2Fix #1Fix #3 → re-dispatch Phase 3 on
gx10. 7-10 days from spec-land to gx10 dispatch working, vs waiting
on upstream trueno 0.4.36 (uncertain timeline).

Evidence:
- evidence/distill-phase-3-gb10-oom/dispatch-v6-0.5b-oom.txt
- evidence/distill-phase-3-gb10-oom/remote-traces.txt

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit d0aa67c into main May 19, 2026
10 checks passed
@noahgift noahgift deleted the spec/blackwell-backend-fix-pmat-700 branch May 19, 2026 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant