-
Notifications
You must be signed in to change notification settings - Fork 89
Asymmetric q8_0-K / turbo3-V produces corrupt output on Qwen3.5-122B (head_dim=256) #47
Description
Summary
Asymmetric KV cache configuration (-ctk q8_0 -ctv turbo3) produces corrupt output on Qwen3.5-122B-A10B Q5_K_S. The model outputs literal U+003F (?) characters at full speed (61.1 t/s). Symmetric turbo3/turbo3 works correctly on the same hardware, same binary.
Reporter
@sjoerdmaessen — Discussion comment
Environment
- Model: Qwen3.5-122B-A10B Q5_K_S (Unsloth imatrix, 86.4GB)
- Hardware: 2x NVIDIA L40S 48GB (SM89, Ada Lovelace), AMD EPYC 9354P
- head_dim: 256
- Architecture: Hybrid MoE — 12 attention layers (GQA) + 36 recurrent layers (Gated DeltaNet), 256 experts, 10B active/token
- Branch:
feature/turboquant-kv-cache
Reproduction
# CORRUPT: asymmetric q8_0-K / turbo3-V
llama-server -m model.gguf -ctk q8_0 -ctv turbo3 -fa on
# Output: literal ? characters, correct speed
# WORKS: symmetric turbo3 / turbo3
llama-server -m model.gguf -ctk turbo3 -ctv turbo3 -fa on
# Output: coherent, 58 t/sDistinguishing Factor: head_dim=256
Every successful asymmetric validation to date uses head_dim=128:
- HyperionMS2040: RTX 3090, head_dim=128, PPL verified ✅
- Madreag: 4x CUDA GPUs, head_dim=128, PPL verified ✅
- AMD HIP: RX 9070 XT, head_dim=128, PPL verified ✅
- Metal testing: head_dim=128, PPL verified ✅
This is the first asymmetric test on head_dim=256.
Investigation Summary
Code paths verified correct:
- Pre-dequant in
launch_fattn(fattn-common.cuh:1268): UsesV->typeindependently — correctly identifies turbo3 V even when K is q8_0 ✅ - Turbo3→f16 conversion (convert.cu:760): Just reconstructs centroid values, does NOT apply inverse WHT. Output is correctly WHT-rotated f16 ✅
- Graph-level inverse WHT (llama-graph.cpp:1887): Correctly gated on
v->type, notk->type✅ - CUDA WHT kernel (turbo-wht.cu:141): Handles nullptr
innerq_scalecorrectly ✅ - CUDA FA template instances: q8_0/turbo3 at D=256 exists (fattn-vec-instance-q8_0-turbo3_0.cu) ✅
- FA kernel dispatch: MMA used for prefill, VEC for decode — both have correct paths ✅
- Math: WHT is linear, factors out of attention weighted sum — asymmetric is mathematically correct ✅
Known bug (may not be root cause):
llama-kv-cache.cpp:332 — Rotation matrices (turbo_rotation, turbo_rotation_inv, turbo_innerq_scale_inv) are only allocated when type_k is turbo. With asymmetric q8_0-K/turbo3-V, these tensors are never created. However, the kernel-level WHT (ggml_turbo_wht) does not depend on these matrices, and the innerq_scale nullptr is handled correctly.
Not yet investigated:
- Hybrid memory context (llama_memory_hybrid) type propagation for turbo operations
- Whether the D=256 VEC FA kernel dequant handles 2 WHT groups (128+128) correctly in asymmetric mode
- Whether multi-GPU tensor split interacts with the pre-dequant scratch buffer allocation
Requested Diagnostic
@sjoerdmaessen — if you have time:
- Test asymmetric
-ctk q8_0 -ctv turbo3on a smaller head_dim=128 model (any Qwen 7B/27B) to isolate the head_dim variable. Just verify output coherence, no speed data needed. - If possible, run
llama-perplexitywith asymmetric on the 122B (even a short run) to see if PPL is catastrophic or normal.
Speed Data (accurate, content corrupt)
| Config | TG (t/s) | Content |
|---|---|---|
| q8_0 / q8_0 | 61.1 | correct |
| q8_0 / turbo3 | 61.1 | CORRUPT |
| q8_0 / turbo2 | 61.3 | likely corrupt |
| turbo3 / turbo3 | 58.0 | correct |
Current Workaround
Use symmetric turbo3/turbo3 instead of asymmetric:
--cache-type-k turbo3 --cache-type-v turbo3Sjoerd's production config: turbo3/turbo3, 2x104K dual-slot, MTMD_BACKEND_DEVICE=CUDA1.