realizar: Q4K CUDA kernel produces garbage for hidden_dim=3584 (Qwen2.5-7B)

## Summary

The Q4K CUDA dequantization/matmul kernels produce incorrect results for Qwen2.5-7B's dimensions (`hidden_dim=3584, num_heads=28, num_kv_heads=4`). The parity gate correctly blocks GPU inference, forcing CPU fallback at ~2 min/problem instead of ~5 sec/problem.

## Reproduction

```bash
apr run checkpoints/qwen2.5-coder-7b-instruct-q4k.apr --prompt "hello" --max-tokens 8 --verbose
```

Output:
```
PARITY-GATE FAILED: GPU computes a DIFFERENT function than CPU.

Cosine similarity: -0.004439 (required: ≥0.99)
CPU argmax: 334 | GPU argmax: 112348
Max absolute logit difference: 13151.0156

This model's dimensions (hidden=3584, heads=28, kv_heads=4) cause
GPU forward pass to diverge from CPU. The GPU CANNOT serve this model.
```

## Analysis

- 1.5B model (hidden_dim=1536) works fine on GPU
- 7B model (hidden_dim=3584) fails parity gate
- Cosine similarity is -0.004 (essentially random noise, required ≥0.99)
- GPU picks token 112348 vs CPU's correct token 334
- Max absolute logit difference: 13151 — massive divergence

## Suspected Root Cause

Q4K super-blocks are 256 elements. `hidden_dim=3584 = 14 × 256`, so block alignment is fine. The bug is likely in the GQA (Grouped Query Attention) head dimension stride: `head_dim = 3584/28 = 128`, `kv_dim = 4 × 128 = 512`. The CUDA kernel may assume `num_kv_heads` divides evenly into a tile size, or the Q/K/V weight slicing uses wrong strides for GQA with repeat factor 7 (`num_heads/num_kv_heads = 28/4 = 7`).

## Impact

HumanEval evaluation takes ~6 hours on CPU instead of ~30 minutes on GPU.

## Acceptance Criteria

- [ ] `apr run` on Qwen2.5-7B-Instruct Q4K passes parity gate
- [ ] GPU cosine similarity ≥0.99 for hidden_dim=3584 with GQA ratio 7
- [ ] Provable contract: parity gate test covers all supported GQA ratios (1, 2, 4, 7, 8)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

realizar: Q4K CUDA kernel produces garbage for hidden_dim=3584 (Qwen2.5-7B) #374

Summary

Reproduction

Analysis

Suspected Root Cause

Impact

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

realizar: Q4K CUDA kernel produces garbage for hidden_dim=3584 (Qwen2.5-7B) #374

Description

Summary

Reproduction

Analysis

Suspected Root Cause

Impact

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions