Skip to content

Asymmetric q8_0-K / turbo3-V produces corrupt output on Qwen3.5-122B (head_dim=256) #47

@TheTom

Description

@TheTom

Summary

Asymmetric KV cache configuration (-ctk q8_0 -ctv turbo3) produces corrupt output on Qwen3.5-122B-A10B Q5_K_S. The model outputs literal U+003F (?) characters at full speed (61.1 t/s). Symmetric turbo3/turbo3 works correctly on the same hardware, same binary.

Reporter

@sjoerdmaessenDiscussion comment

Environment

  • Model: Qwen3.5-122B-A10B Q5_K_S (Unsloth imatrix, 86.4GB)
  • Hardware: 2x NVIDIA L40S 48GB (SM89, Ada Lovelace), AMD EPYC 9354P
  • head_dim: 256
  • Architecture: Hybrid MoE — 12 attention layers (GQA) + 36 recurrent layers (Gated DeltaNet), 256 experts, 10B active/token
  • Branch: feature/turboquant-kv-cache

Reproduction

# CORRUPT: asymmetric q8_0-K / turbo3-V
llama-server -m model.gguf -ctk q8_0 -ctv turbo3 -fa on
# Output: literal ? characters, correct speed

# WORKS: symmetric turbo3 / turbo3
llama-server -m model.gguf -ctk turbo3 -ctv turbo3 -fa on
# Output: coherent, 58 t/s

Distinguishing Factor: head_dim=256

Every successful asymmetric validation to date uses head_dim=128:

  • HyperionMS2040: RTX 3090, head_dim=128, PPL verified ✅
  • Madreag: 4x CUDA GPUs, head_dim=128, PPL verified ✅
  • AMD HIP: RX 9070 XT, head_dim=128, PPL verified ✅
  • Metal testing: head_dim=128, PPL verified ✅

This is the first asymmetric test on head_dim=256.

Investigation Summary

Code paths verified correct:

  1. Pre-dequant in launch_fattn (fattn-common.cuh:1268): Uses V->type independently — correctly identifies turbo3 V even when K is q8_0 ✅
  2. Turbo3→f16 conversion (convert.cu:760): Just reconstructs centroid values, does NOT apply inverse WHT. Output is correctly WHT-rotated f16 ✅
  3. Graph-level inverse WHT (llama-graph.cpp:1887): Correctly gated on v->type, not k->type
  4. CUDA WHT kernel (turbo-wht.cu:141): Handles nullptr innerq_scale correctly ✅
  5. CUDA FA template instances: q8_0/turbo3 at D=256 exists (fattn-vec-instance-q8_0-turbo3_0.cu) ✅
  6. FA kernel dispatch: MMA used for prefill, VEC for decode — both have correct paths ✅
  7. Math: WHT is linear, factors out of attention weighted sum — asymmetric is mathematically correct ✅

Known bug (may not be root cause):

llama-kv-cache.cpp:332 — Rotation matrices (turbo_rotation, turbo_rotation_inv, turbo_innerq_scale_inv) are only allocated when type_k is turbo. With asymmetric q8_0-K/turbo3-V, these tensors are never created. However, the kernel-level WHT (ggml_turbo_wht) does not depend on these matrices, and the innerq_scale nullptr is handled correctly.

Not yet investigated:

  • Hybrid memory context (llama_memory_hybrid) type propagation for turbo operations
  • Whether the D=256 VEC FA kernel dequant handles 2 WHT groups (128+128) correctly in asymmetric mode
  • Whether multi-GPU tensor split interacts with the pre-dequant scratch buffer allocation

Requested Diagnostic

@sjoerdmaessen — if you have time:

  1. Test asymmetric -ctk q8_0 -ctv turbo3 on a smaller head_dim=128 model (any Qwen 7B/27B) to isolate the head_dim variable. Just verify output coherence, no speed data needed.
  2. If possible, run llama-perplexity with asymmetric on the 122B (even a short run) to see if PPL is catastrophic or normal.

Speed Data (accurate, content corrupt)

Config TG (t/s) Content
q8_0 / q8_0 61.1 correct
q8_0 / turbo3 61.1 CORRUPT
q8_0 / turbo2 61.3 likely corrupt
turbo3 / turbo3 58.0 correct

Current Workaround

Use symmetric turbo3/turbo3 instead of asymmetric:

--cache-type-k turbo3 --cache-type-v turbo3

Sjoerd's production config: turbo3/turbo3, 2x104K dual-slot, MTMD_BACKEND_DEVICE=CUDA1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions