Asymmetric q8_0-K / turbo3-V produces corrupt output on Qwen3.5-122B (head_dim=256)

## Summary

Asymmetric KV cache configuration (`-ctk q8_0 -ctv turbo3`) produces corrupt output on Qwen3.5-122B-A10B Q5_K_S. The model outputs literal U+003F (`?`) characters at full speed (61.1 t/s). Symmetric turbo3/turbo3 works correctly on the same hardware, same binary.

## Reporter

@sjoerdmaessen — [Discussion comment](https://github.com/ggml-org/llama.cpp/discussions/20969#discussioncomment-16418601)

## Environment

- **Model:** Qwen3.5-122B-A10B Q5_K_S (Unsloth imatrix, 86.4GB)
- **Hardware:** 2x NVIDIA L40S 48GB (SM89, Ada Lovelace), AMD EPYC 9354P
- **head_dim:** 256
- **Architecture:** Hybrid MoE — 12 attention layers (GQA) + 36 recurrent layers (Gated DeltaNet), 256 experts, 10B active/token
- **Branch:** `feature/turboquant-kv-cache`

## Reproduction

```bash
# CORRUPT: asymmetric q8_0-K / turbo3-V
llama-server -m model.gguf -ctk q8_0 -ctv turbo3 -fa on
# Output: literal ? characters, correct speed

# WORKS: symmetric turbo3 / turbo3
llama-server -m model.gguf -ctk turbo3 -ctv turbo3 -fa on
# Output: coherent, 58 t/s
```

## Distinguishing Factor: head_dim=256

Every successful asymmetric validation to date uses **head_dim=128**:
- HyperionMS2040: RTX 3090, head_dim=128, PPL verified ✅
- Madreag: 4x CUDA GPUs, head_dim=128, PPL verified ✅  
- AMD HIP: RX 9070 XT, head_dim=128, PPL verified ✅
- Metal testing: head_dim=128, PPL verified ✅

This is the **first asymmetric test on head_dim=256**.

## Investigation Summary

### Code paths verified correct:
1. **Pre-dequant in `launch_fattn`** (fattn-common.cuh:1268): Uses `V->type` independently — correctly identifies turbo3 V even when K is q8_0 ✅
2. **Turbo3→f16 conversion** (convert.cu:760): Just reconstructs centroid values, does NOT apply inverse WHT. Output is correctly WHT-rotated f16 ✅
3. **Graph-level inverse WHT** (llama-graph.cpp:1887): Correctly gated on `v->type`, not `k->type` ✅
4. **CUDA WHT kernel** (turbo-wht.cu:141): Handles nullptr `innerq_scale` correctly ✅
5. **CUDA FA template instances**: q8_0/turbo3 at D=256 exists (fattn-vec-instance-q8_0-turbo3_0.cu) ✅
6. **FA kernel dispatch**: MMA used for prefill, VEC for decode — both have correct paths ✅
7. **Math**: WHT is linear, factors out of attention weighted sum — asymmetric is mathematically correct ✅

### Known bug (may not be root cause):
**llama-kv-cache.cpp:332** — Rotation matrices (`turbo_rotation`, `turbo_rotation_inv`, `turbo_innerq_scale_inv`) are only allocated when `type_k` is turbo. With asymmetric q8_0-K/turbo3-V, these tensors are never created. However, the kernel-level WHT (ggml_turbo_wht) does not depend on these matrices, and the innerq_scale nullptr is handled correctly.

### Not yet investigated:
- Hybrid memory context (llama_memory_hybrid) type propagation for turbo operations
- Whether the D=256 VEC FA kernel dequant handles 2 WHT groups (128+128) correctly in asymmetric mode
- Whether multi-GPU tensor split interacts with the pre-dequant scratch buffer allocation

## Requested Diagnostic

@sjoerdmaessen — if you have time:
1. Test asymmetric `-ctk q8_0 -ctv turbo3` on a **smaller head_dim=128 model** (any Qwen 7B/27B) to isolate the head_dim variable. Just verify output coherence, no speed data needed.
2. If possible, run `llama-perplexity` with asymmetric on the 122B (even a short run) to see if PPL is catastrophic or normal.

## Speed Data (accurate, content corrupt)

| Config | TG (t/s) | Content |
|--------|----------|---------|
| q8_0 / q8_0 | 61.1 | correct |
| q8_0 / turbo3 | 61.1 | **CORRUPT** |
| q8_0 / turbo2 | 61.3 | likely corrupt |
| turbo3 / turbo3 | 58.0 | correct |

## Current Workaround

Use symmetric turbo3/turbo3 instead of asymmetric:
```bash
--cache-type-k turbo3 --cache-type-v turbo3
```

Sjoerd's production config: turbo3/turbo3, 2x104K dual-slot, `MTMD_BACKEND_DEVICE=CUDA1`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asymmetric q8_0-K / turbo3-V produces corrupt output on Qwen3.5-122B (head_dim=256) #47

Summary

Reporter

Environment

Reproduction

Distinguishing Factor: head_dim=256

Investigation Summary

Code paths verified correct:

Known bug (may not be root cause):

Not yet investigated:

Requested Diagnostic

Speed Data (accurate, content corrupt)

Current Workaround

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Config	TG (t/s)	Content
q8_0 / q8_0	61.1	correct
q8_0 / turbo3	61.1	CORRUPT
q8_0 / turbo2	61.3	likely corrupt
turbo3 / turbo3	58.0	correct

Asymmetric q8_0-K / turbo3-V produces corrupt output on Qwen3.5-122B (head_dim=256) #47

Description

Summary

Reporter

Environment

Reproduction

Distinguishing Factor: head_dim=256

Investigation Summary

Code paths verified correct:

Known bug (may not be root cause):

Not yet investigated:

Requested Diagnostic

Speed Data (accurate, content corrupt)

Current Workaround

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions