Root Cause (Five-Whys)
CORRECTNESS-011 layer trace output:
PARITY-GATE FAILED: GPU computes a DIFFERENT function than CPU.
Cosine similarity: -0.005182 (required: ≥0.98)
CPU argmax: 334 | GPU argmax: 8127
Max absolute logit difference: 19.5080
- Why GPU ≠ CPU? Cosine similarity -0.005 (completely uncorrelated, not FP rounding)
- Why completely wrong? GPU forward pass computes an entirely different function
- Why different function? Model dimensions
hidden=3584, heads=28, kv_heads=4 (GQA)
- Which kernel? GQA attention with 28:4 head ratio (7 query heads per KV head)
- Root cause: GQA CUDA kernel handles non-power-of-2 head ratios incorrectly
Diagnosis
- This is NOT FP rounding (cosine would be ~0.999)
- This is NOT a driver issue (driver 590 same result as 580)
- This is a logic bug in the GQA attention kernel for head_ratio=7
- The cosine of -0.005 means GPU output is essentially random relative to CPU
Model Dimensions
- hidden_dim: 3584
- num_heads: 28
- num_kv_heads: 4
- head_dim: 128 (3584/28)
- head_ratio: 7 (28/4) — non-power-of-2
Hardware
NVIDIA GB10 (Blackwell sm_121), driver 590.48.01, CUDA 13.1
Provable Contract
ptx-target-parity-v1.yaml equation target_parity is violated.
gpu-context-health-v1.yaml — the GPU produces wrong results, not just context poisoning.
Fix Required
Fix the GQA attention kernel in trueno-gpu for head_ratio=7 on sm_121.
Verify with CORRECTNESS-011 layer trace: cosine must be ≥0.98.
Root Cause (Five-Whys)
CORRECTNESS-011 layer trace output:
hidden=3584, heads=28, kv_heads=4(GQA)Diagnosis
Model Dimensions
Hardware
NVIDIA GB10 (Blackwell sm_121), driver 590.48.01, CUDA 13.1
Provable Contract
ptx-target-parity-v1.yamlequationtarget_parityis violated.gpu-context-health-v1.yaml— the GPU produces wrong results, not just context poisoning.Fix Required
Fix the GQA attention kernel in trueno-gpu for head_ratio=7 on sm_121.
Verify with CORRECTNESS-011 layer trace: cosine must be ≥0.98.