Five-Whys
- Why does CUDA give cosine=-0.005 on sm_121? → FP32 accumulation compounds through 280 operations.
- Why does the same PTX via Python give cosine=1.0? → The Python test only tested ONE kernel (RMSNorm) in isolation, not the full 28-layer pipeline.
- Why does the full pipeline diverge? → Each kernel has ~0.1% FP32 ordering error that compounds multiplicatively: (1.001)^280 ≈ 1.32.
- Why does GPU have different ordering than CPU? → GPU uses 32 parallel threads accumulating partial sums; CPU accumulates sequentially.
- Fix: Match CPU accumulation precision by using Kahan compensation in ALL kernels (RMSNorm, GEMV, attention, residual, SwiGLU), not just GEMV.
Evidence
- Our exact PTX RMSNorm via Python ctypes → cosine=1.0 (single kernel)
- Our RMSNorm via trueno-gpu → diff=5e-7 per element (CORRECT individually)
- Q4K GEMV → ~1% per operation (FP32 rounding)
- Kahan in GEMV only → cosine=-0.005178 (not enough — error is in ALL kernels)
- Need Kahan in ALL accumulation-heavy kernels
Contract
gpu-multi-backend-parity-v1.yaml equation jit_compilation_correctness:
cosine(jit_sass(ptx, device), reference_sass(ptx, device)) >= 0.9999
Currently violated. Fix target: cosine >= 0.98 for CUDA path.
Acceptance Criteria
- CUDA parity gate passes (cosine >= 0.98) on sm_121
- No wgpu fallback needed for correct GPU inference
- All 18 contract tests still pass
Five-Whys
Evidence
Contract
gpu-multi-backend-parity-v1.yamlequationjit_compilation_correctness:Currently violated. Fix target: cosine >= 0.98 for CUDA path.
Acceptance Criteria