Description
Only models with Multi-Head Attention (MHA), where Q_heads == KV_heads, produce coherent inference output. All models using Grouped-Query Attention (GQA) or Multi-Query Attention (MQA) generate garbage text despite loading successfully.
Compatibility Matrix
| Model |
Arch |
self_attn |
Q/KV Heads |
Attention Type |
Inference |
| SmolLM2-1.7B |
llama |
24 |
32/32 |
MHA |
Coherent |
| SmolLM2-135M |
llama |
30 |
9/3 |
GQA |
Garbage |
| Llama-3.2-1B |
llama |
16 |
32/8 |
GQA |
Garbage |
| Qwen3.5-0.8B |
qwen35 |
6 |
8/2 |
Hybrid |
Garbage |
| Phi-3.5-mini |
phi3 |
0 |
32/32 |
(undetected) |
Garbage |
| Gemma-4-E2B |
gemma4 |
35 |
8/1 |
MQA |
Garbage |
Steps to Reproduce
# Start server with a GQA model
./build-metal/quant-server ~/.cache/quantcpp/llama-3.2-1b-instruct-q4_k_m.gguf -p 8080
# Send request
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"What is gravity?"}],"temperature":0.0,"max_tokens":60}'
# Output contains garbage like:
# "ggravity is a hypothetical form of gravity..."
# "<line>assistant</line>\n</s><s><p>"
# Compare with MHA model (SmolLM2-1.7B, Q=32, KV=32)
./build-metal/quant-server ~/dev/projects/TurboQuant.cpp/models/SmolLM2-1.7B-Instruct-Q8_0.gguf -p 8080
# Clean, coherent output:
# "Gravity is the force that attracts two objects with mass towards each other."
Impact
- Severity: P0 — Most popular models use GQA (Llama 3.x, Mistral, Qwen, Gemma)
- Only SmolLM2-1.7B (MHA) works correctly among all tested models
- The README lists Llama-3.2-3B and Gemma-4 as supported, but they produce garbage
Root Cause Analysis
The GQA attention path likely has a bug in how KV heads are expanded/repeated to match Q heads. When n_kv_heads < n_heads, the KV cache indexing or head repetition logic may be incorrect, causing attention scores to be computed against wrong key/value vectors.
Key area to investigate: the attention computation in tq_transformer.c where n_heads != n_kv_heads.
Suggested Fix
- Review the GQA head repetition logic in
tq_transformer.c
- Add a unit test that compares MHA vs GQA attention output for the same input
- Verify KV cache indexing when
n_kv_heads < n_heads
Environment
- quant.cpp: latest main (49c6605)
- Build: cmake -DTQ_BUILD_METAL=ON
- OS: macOS 15 (Apple M3, 16GB)
- All models from GGUF format (bartowski / HuggingFace quantizations)
Reported by ClawTeam Claw-5 (Researcher persona)
Description
Only models with Multi-Head Attention (MHA), where
Q_heads == KV_heads, produce coherent inference output. All models using Grouped-Query Attention (GQA) or Multi-Query Attention (MQA) generate garbage text despite loading successfully.Compatibility Matrix
Steps to Reproduce
Impact
Root Cause Analysis
The GQA attention path likely has a bug in how KV heads are expanded/repeated to match Q heads. When
n_kv_heads < n_heads, the KV cache indexing or head repetition logic may be incorrect, causing attention scores to be computed against wrong key/value vectors.Key area to investigate: the attention computation in
tq_transformer.cwheren_heads != n_kv_heads.Suggested Fix
tq_transformer.cn_kv_heads < n_headsEnvironment
Reported by ClawTeam Claw-5 (Researcher persona)