Skip to content

realizar: Q4K CUDA kernel produces garbage for hidden_dim=3584 (Qwen2.5-7B) #374

@noahgift

Description

@noahgift

Summary

The Q4K CUDA dequantization/matmul kernels produce incorrect results for Qwen2.5-7B's dimensions (hidden_dim=3584, num_heads=28, num_kv_heads=4). The parity gate correctly blocks GPU inference, forcing CPU fallback at ~2 min/problem instead of ~5 sec/problem.

Reproduction

apr run checkpoints/qwen2.5-coder-7b-instruct-q4k.apr --prompt "hello" --max-tokens 8 --verbose

Output:

PARITY-GATE FAILED: GPU computes a DIFFERENT function than CPU.

Cosine similarity: -0.004439 (required: ≥0.99)
CPU argmax: 334 | GPU argmax: 112348
Max absolute logit difference: 13151.0156

This model's dimensions (hidden=3584, heads=28, kv_heads=4) cause
GPU forward pass to diverge from CPU. The GPU CANNOT serve this model.

Analysis

  • 1.5B model (hidden_dim=1536) works fine on GPU
  • 7B model (hidden_dim=3584) fails parity gate
  • Cosine similarity is -0.004 (essentially random noise, required ≥0.99)
  • GPU picks token 112348 vs CPU's correct token 334
  • Max absolute logit difference: 13151 — massive divergence

Suspected Root Cause

Q4K super-blocks are 256 elements. hidden_dim=3584 = 14 × 256, so block alignment is fine. The bug is likely in the GQA (Grouped Query Attention) head dimension stride: head_dim = 3584/28 = 128, kv_dim = 4 × 128 = 512. The CUDA kernel may assume num_kv_heads divides evenly into a tile size, or the Q/K/V weight slicing uses wrong strides for GQA with repeat factor 7 (num_heads/num_kv_heads = 28/4 = 7).

Impact

HumanEval evaluation takes ~6 hours on CPU instead of ~30 minutes on GPU.

Acceptance Criteria

  • apr run on Qwen2.5-7B-Instruct Q4K passes parity gate
  • GPU cosine similarity ≥0.99 for hidden_dim=3584 with GQA ratio 7
  • Provable contract: parity gate test covers all supported GQA ratios (1, 2, 4, 7, 8)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions