spec(blackwell): SPEC-BLACKWELL-FIX-001 — GB10 training enablement (PMAT-700) by noahgift · Pull Request #1803 · paiml/aprender

noahgift · 2026-05-19T05:55:44Z

Summary

GB10 (sm_121) hits CUDA_ERROR_OUT_OF_MEMORY at "Block 0 upload" during apr distill --backend cuda, even with a 0.5B model on a 128GB unified memory pool. The root cause is PTX JIT cache memory pressure — pre-warming 27+ custom kernels for sm_121 allocates enough VRAM that the subsequent transformer block upload has no headroom.

Design — three coordinated backend changes

Fix	What	Effort	Standalone unblocks GB10?
#1 PTX precompilation	Build-time compile PTX for sm_75/89/90/120/121, embed as binary blobs in trueno-gpu. Runtime cache loads blobs instead of JIT-compiling	2-3 days	✅ Yes
#2 cuBLAS for backward GEMM	Migrate 7 backward kernels per layer (q/k/v/o/gate/up/down) from custom PTX → cuBLAS. Eliminates 168 JIT modules on a 24-layer model	3-4 days	✅ Yes — and a perf win on every host
#3 wgpu backward fallback	When CUDA backward fails (OOM/JIT), fall through to wgpu WGSL backward shaders (already exist per #1582). Correctness safety net	1-2 days	❌ Insurance only

Each fix has its own falsifier: F-BLACKWELL-PTX-001, F-BLACKWELL-CUBLAS-001, F-BLACKWELL-WGPU-FALLBACK-001.

Why not just wait for trueno 0.4.36?

The memory rule says upstream trueno will fix Blackwell training, but timeline is uncertain. This spec ships the in-tree workaround with the same backends already in tree, giving us a 7-10 day path to gx10 dispatch working — independent of upstream timeline.

Evidence

Live OOM traces from gx10 dispatch attempts (Qwen 0.5B teacher, Qwen 0.5B student):

evidence/distill-phase-3-gb10-oom/dispatch-v6-0.5b-oom.txt
evidence/distill-phase-3-gb10-oom/remote-traces.txt

Pre-warm contract that documents the JIT pressure: crates/aprender-train/src/finetune/classify_pipeline/gpu.rs:56-59 (C-PREWARM-001).

Recommendation

Start with Fix #2 (cuBLAS backward GEMM). Three reasons:

Drop-in replacement with parity test — low risk
Standalone perf win on every host (not just Blackwell)
Reduces JIT pressure by ~40% per layer, likely sufficient for GB10 alone

If #2 doesn't fully unblock GB10, layer in Fix #1 (PTX precomp) for a complete elimination of runtime JIT. Fix #3 lands last as cross-vendor safety net.

Rollout

Phase A (this PR): spec lands
Phase B: PMAT-700-B — Fix Feature Request: Cross-Validation Utilities #2 cuBLAS backward GEMM
Phase C: PMAT-700-C — Fix Feature Request: Decision Tree & Random Forest for Classification Tasks #1 PTX precompilation
Phase D: PMAT-700-D — Fix Feature Request: Model Serialization (Save/Load) #3 wgpu fallback
Phase E: PMAT-701 — re-attempt Phase 3 dispatch on gx10

Test plan

This PR is a spec document; the falsifiers fire on the implementation PRs (PMAT-700-B through -D). Acceptance for Phase E:

STEPS=50 ./scripts/dispatch-distill-phase-3-gx10.sh completes without OOM
final_loss < initial_loss (F-DISTILL-SMOKE-001 dischargeable)
gx10 throughput ≥ 0.5× lambda-vector RTX 4090
No regression on RTX 4090

🤖 Generated with Claude Code

…sign (PMAT-700) GB10 (sm_121) hits CUDA_ERROR_OUT_OF_MEMORY at "Block 0 upload" during apr distill --backend cuda, even with a 0.5B teacher on 128GB unified memory. Root cause is PTX JIT cache memory pressure: pre-warming 27+ custom kernels for sm_121 allocates so much VRAM that the subsequent block upload has no headroom. Live evidence in evidence/distill-phase-3-gb10-oom/. Design surfaces a three-pronged root-cause fix using the existing cuBLAS / PTX / wgpu backends, not a workaround: Fix #1 — PTX precompilation for sm_121 (eliminates JIT entirely) Fix #2 — Migrate backward GEMMs to cuBLAS (eliminates 7 PTX kernels per layer; 168 fewer JIT modules for a 24-layer model) Fix #3 — wgpu backward fallback (cross-vendor safety net) Each fix has its own falsifier (F-BLACKWELL-PTX-001, F-BLACKWELL- CUBLAS-001, F-BLACKWELL-WGPU-FALLBACK-001) and standalone value. Fix #2 alone is likely sufficient for the immediate gx10 unblock (3-4 day effort, drop-in, also a perf win on every host). Rollout: spec → Fix #2 → Fix #1 → Fix #3 → re-dispatch Phase 3 on gx10. 7-10 days from spec-land to gx10 dispatch working, vs waiting on upstream trueno 0.4.36 (uncertain timeline). Evidence: - evidence/distill-phase-3-gb10-oom/dispatch-v6-0.5b-oom.txt - evidence/distill-phase-3-gb10-oom/remote-traces.txt Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 19, 2026 05:55

noahgift mentioned this pull request May 19, 2026

feat(blackwell): skip PTX GEMM pre-warm when cuBLAS is active (PMAT-700-B) #1804

Merged

5 tasks

Merge branch 'main' into spec/blackwell-backend-fix-pmat-700

f760482

noahgift merged commit d0aa67c into main May 19, 2026
10 checks passed

noahgift deleted the spec/blackwell-backend-fix-pmat-700 branch May 19, 2026 06:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec(blackwell): SPEC-BLACKWELL-FIX-001 — GB10 training enablement (PMAT-700)#1803

spec(blackwell): SPEC-BLACKWELL-FIX-001 — GB10 training enablement (PMAT-700)#1803
noahgift merged 2 commits into
mainfrom
spec/blackwell-backend-fix-pmat-700

noahgift commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 19, 2026

Summary

Design — three coordinated backend changes

Why not just wait for trueno 0.4.36?

Evidence

Recommendation

Rollout

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant