Skip to content

GPU parity FAILED: cosine=-0.005 for Qwen2.5-7B GQA (hidden=3584, heads=28, kv=4) on sm_121 #559

@noahgift

Description

@noahgift

Root Cause (Five-Whys)

CORRECTNESS-011 layer trace output:

PARITY-GATE FAILED: GPU computes a DIFFERENT function than CPU.
Cosine similarity: -0.005182 (required: ≥0.98)
CPU argmax: 334 | GPU argmax: 8127
Max absolute logit difference: 19.5080
  1. Why GPU ≠ CPU? Cosine similarity -0.005 (completely uncorrelated, not FP rounding)
  2. Why completely wrong? GPU forward pass computes an entirely different function
  3. Why different function? Model dimensions hidden=3584, heads=28, kv_heads=4 (GQA)
  4. Which kernel? GQA attention with 28:4 head ratio (7 query heads per KV head)
  5. Root cause: GQA CUDA kernel handles non-power-of-2 head ratios incorrectly

Diagnosis

  • This is NOT FP rounding (cosine would be ~0.999)
  • This is NOT a driver issue (driver 590 same result as 580)
  • This is a logic bug in the GQA attention kernel for head_ratio=7
  • The cosine of -0.005 means GPU output is essentially random relative to CPU

Model Dimensions

  • hidden_dim: 3584
  • num_heads: 28
  • num_kv_heads: 4
  • head_dim: 128 (3584/28)
  • head_ratio: 7 (28/4) — non-power-of-2

Hardware

NVIDIA GB10 (Blackwell sm_121), driver 590.48.01, CUDA 13.1

Provable Contract

ptx-target-parity-v1.yaml equation target_parity is violated.
gpu-context-health-v1.yaml — the GPU produces wrong results, not just context poisoning.

Fix Required

Fix the GQA attention kernel in trueno-gpu for head_ratio=7 on sm_121.
Verify with CORRECTNESS-011 layer trace: cosine must be ≥0.98.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions