Summary
The Q4K CUDA dequantization/matmul kernels produce incorrect results for Qwen2.5-7B's dimensions (hidden_dim=3584, num_heads=28, num_kv_heads=4). The parity gate correctly blocks GPU inference, forcing CPU fallback at ~2 min/problem instead of ~5 sec/problem.
Reproduction
apr run checkpoints/qwen2.5-coder-7b-instruct-q4k.apr --prompt "hello" --max-tokens 8 --verbose
Output:
PARITY-GATE FAILED: GPU computes a DIFFERENT function than CPU.
Cosine similarity: -0.004439 (required: ≥0.99)
CPU argmax: 334 | GPU argmax: 112348
Max absolute logit difference: 13151.0156
This model's dimensions (hidden=3584, heads=28, kv_heads=4) cause
GPU forward pass to diverge from CPU. The GPU CANNOT serve this model.
Analysis
- 1.5B model (hidden_dim=1536) works fine on GPU
- 7B model (hidden_dim=3584) fails parity gate
- Cosine similarity is -0.004 (essentially random noise, required ≥0.99)
- GPU picks token 112348 vs CPU's correct token 334
- Max absolute logit difference: 13151 — massive divergence
Suspected Root Cause
Q4K super-blocks are 256 elements. hidden_dim=3584 = 14 × 256, so block alignment is fine. The bug is likely in the GQA (Grouped Query Attention) head dimension stride: head_dim = 3584/28 = 128, kv_dim = 4 × 128 = 512. The CUDA kernel may assume num_kv_heads divides evenly into a tile size, or the Q/K/V weight slicing uses wrong strides for GQA with repeat factor 7 (num_heads/num_kv_heads = 28/4 = 7).
Impact
HumanEval evaluation takes ~6 hours on CPU instead of ~30 minutes on GPU.
Acceptance Criteria
Summary
The Q4K CUDA dequantization/matmul kernels produce incorrect results for Qwen2.5-7B's dimensions (
hidden_dim=3584, num_heads=28, num_kv_heads=4). The parity gate correctly blocks GPU inference, forcing CPU fallback at ~2 min/problem instead of ~5 sec/problem.Reproduction
apr run checkpoints/qwen2.5-coder-7b-instruct-q4k.apr --prompt "hello" --max-tokens 8 --verboseOutput:
Analysis
Suspected Root Cause
Q4K super-blocks are 256 elements.
hidden_dim=3584 = 14 × 256, so block alignment is fine. The bug is likely in the GQA (Grouped Query Attention) head dimension stride:head_dim = 3584/28 = 128,kv_dim = 4 × 128 = 512. The CUDA kernel may assumenum_kv_headsdivides evenly into a tile size, or the Q/K/V weight slicing uses wrong strides for GQA with repeat factor 7 (num_heads/num_kv_heads = 28/4 = 7).Impact
HumanEval evaluation takes ~6 hours on CPU instead of ~30 minutes on GPU.
Acceptance Criteria
apr runon Qwen2.5-7B-Instruct Q4K passes parity gate