GQA/MQA attention broken — only MHA (Q_heads == KV_heads) produces coherent output

## Description

Only models with Multi-Head Attention (MHA), where `Q_heads == KV_heads`, produce coherent inference output. All models using Grouped-Query Attention (GQA) or Multi-Query Attention (MQA) generate garbage text despite loading successfully.

## Compatibility Matrix

| Model | Arch | self_attn | Q/KV Heads | Attention Type | Inference |
|-------|------|-----------|------------|----------------|-----------|
| SmolLM2-1.7B | llama | 24 | 32/32 | **MHA** | **Coherent** |
| SmolLM2-135M | llama | 30 | 9/3 | GQA | Garbage |
| Llama-3.2-1B | llama | 16 | 32/8 | GQA | Garbage |
| Qwen3.5-0.8B | qwen35 | 6 | 8/2 | Hybrid | Garbage |
| Phi-3.5-mini | phi3 | 0 | 32/32 | (undetected) | Garbage |
| Gemma-4-E2B | gemma4 | 35 | 8/1 | MQA | Garbage |

## Steps to Reproduce

```bash
# Start server with a GQA model
./build-metal/quant-server ~/.cache/quantcpp/llama-3.2-1b-instruct-q4_k_m.gguf -p 8080

# Send request
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"What is gravity?"}],"temperature":0.0,"max_tokens":60}'

# Output contains garbage like:
# "ggravity is a hypothetical form of gravity..."
# "<line>assistant</line>\n</s><s><p>"
```

```bash
# Compare with MHA model (SmolLM2-1.7B, Q=32, KV=32)
./build-metal/quant-server ~/dev/projects/TurboQuant.cpp/models/SmolLM2-1.7B-Instruct-Q8_0.gguf -p 8080

# Clean, coherent output:
# "Gravity is the force that attracts two objects with mass towards each other."
```

## Impact

- **Severity: P0** — Most popular models use GQA (Llama 3.x, Mistral, Qwen, Gemma)
- Only SmolLM2-1.7B (MHA) works correctly among all tested models
- The README lists Llama-3.2-3B and Gemma-4 as supported, but they produce garbage

## Root Cause Analysis

The GQA attention path likely has a bug in how KV heads are expanded/repeated to match Q heads. When `n_kv_heads < n_heads`, the KV cache indexing or head repetition logic may be incorrect, causing attention scores to be computed against wrong key/value vectors.

Key area to investigate: the attention computation in `tq_transformer.c` where `n_heads != n_kv_heads`.

## Suggested Fix

1. Review the GQA head repetition logic in `tq_transformer.c`
2. Add a unit test that compares MHA vs GQA attention output for the same input
3. Verify KV cache indexing when `n_kv_heads < n_heads`

## Environment

- quant.cpp: latest main (49c6605)
- Build: cmake -DTQ_BUILD_METAL=ON
- OS: macOS 15 (Apple M3, 16GB)
- All models from GGUF format (bartowski / HuggingFace quantizations)

---
*Reported by ClawTeam Claw-5 (Researcher persona)*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GQA/MQA attention broken — only MHA (Q_heads == KV_heads) produces coherent output #61

Description

Compatibility Matrix

Steps to Reproduce

Impact

Root Cause Analysis

Suggested Fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	Arch	self_attn	Q/KV Heads	Attention Type	Inference
SmolLM2-1.7B	llama	24	32/32	MHA	Coherent
SmolLM2-135M	llama	30	9/3	GQA	Garbage
Llama-3.2-1B	llama	16	32/8	GQA	Garbage
Qwen3.5-0.8B	qwen35	6	8/2	Hybrid	Garbage
Phi-3.5-mini	phi3	0	32/32	(undetected)	Garbage
Gemma-4-E2B	gemma4	35	8/1	MQA	Garbage

GQA/MQA attention broken — only MHA (Q_heads == KV_heads) produces coherent output #61

Description

Description

Compatibility Matrix

Steps to Reproduce

Impact

Root Cause Analysis

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions