Skip to content

GQA/MQA attention broken — only MHA (Q_heads == KV_heads) produces coherent output #61

@unamedkr

Description

@unamedkr

Description

Only models with Multi-Head Attention (MHA), where Q_heads == KV_heads, produce coherent inference output. All models using Grouped-Query Attention (GQA) or Multi-Query Attention (MQA) generate garbage text despite loading successfully.

Compatibility Matrix

Model Arch self_attn Q/KV Heads Attention Type Inference
SmolLM2-1.7B llama 24 32/32 MHA Coherent
SmolLM2-135M llama 30 9/3 GQA Garbage
Llama-3.2-1B llama 16 32/8 GQA Garbage
Qwen3.5-0.8B qwen35 6 8/2 Hybrid Garbage
Phi-3.5-mini phi3 0 32/32 (undetected) Garbage
Gemma-4-E2B gemma4 35 8/1 MQA Garbage

Steps to Reproduce

# Start server with a GQA model
./build-metal/quant-server ~/.cache/quantcpp/llama-3.2-1b-instruct-q4_k_m.gguf -p 8080

# Send request
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"What is gravity?"}],"temperature":0.0,"max_tokens":60}'

# Output contains garbage like:
# "ggravity is a hypothetical form of gravity..."
# "<line>assistant</line>\n</s><s><p>"
# Compare with MHA model (SmolLM2-1.7B, Q=32, KV=32)
./build-metal/quant-server ~/dev/projects/TurboQuant.cpp/models/SmolLM2-1.7B-Instruct-Q8_0.gguf -p 8080

# Clean, coherent output:
# "Gravity is the force that attracts two objects with mass towards each other."

Impact

  • Severity: P0 — Most popular models use GQA (Llama 3.x, Mistral, Qwen, Gemma)
  • Only SmolLM2-1.7B (MHA) works correctly among all tested models
  • The README lists Llama-3.2-3B and Gemma-4 as supported, but they produce garbage

Root Cause Analysis

The GQA attention path likely has a bug in how KV heads are expanded/repeated to match Q heads. When n_kv_heads < n_heads, the KV cache indexing or head repetition logic may be incorrect, causing attention scores to be computed against wrong key/value vectors.

Key area to investigate: the attention computation in tq_transformer.c where n_heads != n_kv_heads.

Suggested Fix

  1. Review the GQA head repetition logic in tq_transformer.c
  2. Add a unit test that compares MHA vs GQA attention output for the same input
  3. Verify KV cache indexing when n_kv_heads < n_heads

Environment

  • quant.cpp: latest main (49c6605)
  • Build: cmake -DTQ_BUILD_METAL=ON
  • OS: macOS 15 (Apple M3, 16GB)
  • All models from GGUF format (bartowski / HuggingFace quantizations)

Reported by ClawTeam Claw-5 (Researcher persona)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions