Skip to content

GH-561: Fix CUDA inference — our PTX works via Python but fails via Rust #561

@noahgift

Description

@noahgift

Five-Whys

  1. Why does CUDA give cosine=-0.005 on sm_121? → FP32 accumulation compounds through 280 operations.
  2. Why does the same PTX via Python give cosine=1.0? → The Python test only tested ONE kernel (RMSNorm) in isolation, not the full 28-layer pipeline.
  3. Why does the full pipeline diverge? → Each kernel has ~0.1% FP32 ordering error that compounds multiplicatively: (1.001)^280 ≈ 1.32.
  4. Why does GPU have different ordering than CPU? → GPU uses 32 parallel threads accumulating partial sums; CPU accumulates sequentially.
  5. Fix: Match CPU accumulation precision by using Kahan compensation in ALL kernels (RMSNorm, GEMV, attention, residual, SwiGLU), not just GEMV.

Evidence

  • Our exact PTX RMSNorm via Python ctypes → cosine=1.0 (single kernel)
  • Our RMSNorm via trueno-gpu → diff=5e-7 per element (CORRECT individually)
  • Q4K GEMV → ~1% per operation (FP32 rounding)
  • Kahan in GEMV only → cosine=-0.005178 (not enough — error is in ALL kernels)
  • Need Kahan in ALL accumulation-heavy kernels

Contract

gpu-multi-backend-parity-v1.yaml equation jit_compilation_correctness:

cosine(jit_sass(ptx, device), reference_sass(ptx, device)) >= 0.9999

Currently violated. Fix target: cosine >= 0.98 for CUDA path.

Acceptance Criteria

  • CUDA parity gate passes (cosine >= 0.98) on sm_121
  • No wgpu fallback needed for correct GPU inference
  • All 18 contract tests still pass

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions