[bug] turbo4 crashes on CUDA (SET_ROWS unported) + mixed K/V types cause ~11x prefill regression

## Setup

- RTX 3090 24GB, CUDA 13 (nvcc 13.0.88), driver 580.126.09, Ubuntu 24.04
- Branch: `feature/turboquant-kv-cache` @ `43f7d3d20`
- Model: Qwen3.5-9B Q4_K_M (`unsloth/Qwen3.5-9B-GGUF`)
- Build: `cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc)`

## Benchmark results

llama-bench pp512/tg256 ×3, PPL on wikitext-2-raw 580 chunks n_ctx=512:

| config | cache-k | cache-v | PPL | decode t/s | prefill t/s | KV cache (n_ctx=2048) |
|---|---|---|---|---|---|---|
| baseline | q8_0 | q8_0 | 8.2018 | 101.69 | 3773.30 | 235 MiB |
| turbo3/3 | turbo3 | turbo3 | 8.3124 | 99.79 | 3727.14 | 215 MiB |
| turbo2/2 | turbo2 | turbo2 | 8.6639 | 100.73 | 3701.82 | 211 MiB |
| turbo4/4 | turbo4 | turbo4 | — | — | — | — |
| mixed t3k/t2v | turbo3 | turbo2 | 8.5312 | 87.70 | 329.16 | 213 MiB |
| mixed t2k/t3v | turbo2 | turbo3 | 8.4356 | 87.83 | 321.41 | 213 MiB |

---

## Bug 1: turbo4 crashes on init

Command:
```
./build/bin/llama-bench -m MODEL -ngl 99 --flash-attn 1 --cache-type-k turbo4 --cache-type-v turbo4 -p 512 -n 256 -r 3
```

Error:
```
ggml/src/ggml-backend.cpp:809: pre-allocated tensor (cache_k_l3 (view)) in a buffer (CUDA0) that cannot run the operation (SET_ROWS)
```

Full backtrace:
```
#3  ggml_backend_sched_backend_id_from_cur(ggml_backend_sched*, ggml_tensor*)
#4  ggml_backend_sched_split_graph()
#5  llama_context::graph_reserve(...)
#6  llama_context::sched_reserve()
#7  llama_context::llama_context(llama_model const&, llama_context_params)
#8  llama_init_from_model()
```

Process aborts (core dumped) before any inference runs.

---

## Bug 2: mixed K/V types cause ~11.5x prefill regression

When K and V use different turbo types (any combination), prefill drops from ~3700 t/s to ~325 t/s. Decode also regresses slightly (~101 → ~88 t/s). Symmetric configs (turbo3/3, turbo2/2) show no regression.

This suggests there's no optimized CUDA kernel path when K and V have different quantization types — likely falling back to a slow dequant path during attention.

---

## Observation: V may be more quality-sensitive than K

turbo2-K / turbo3-V (PPL 8.4356) outperforms turbo3-K / turbo2-V (PPL 8.5312), contrary to the community assumption that K is more sensitive. May be model-specific (Qwen3.5-9B uses GQA with large head dims, 256).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] turbo4 crashes on CUDA (SET_ROWS unported) + mixed K/V types cause ~11x prefill regression #25

Setup

Benchmark results

Bug 1: turbo4 crashes on init

Bug 2: mixed K/V types cause ~11.5x prefill regression

Observation: V may be more quality-sensitive than K

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

config	cache-k	cache-v	PPL	decode t/s	prefill t/s	KV cache (n_ctx=2048)
baseline	q8_0	q8_0	8.2018	101.69	3773.30	235 MiB
turbo3/3	turbo3	turbo3	8.3124	99.79	3727.14	215 MiB
turbo2/2	turbo2	turbo2	8.6639	100.73	3701.82	211 MiB
turbo4/4	turbo4	turbo4	—	—	—	—
mixed t3k/t2v	turbo3	turbo2	8.5312	87.70	329.16	213 MiB
mixed t2k/t3v	turbo2	turbo3	8.4356	87.83	321.41	213 MiB

[bug] turbo4 crashes on CUDA (SET_ROWS unported) + mixed K/V types cause ~11x prefill regression #25

Description

Setup

Benchmark results

Bug 1: turbo4 crashes on init

Bug 2: mixed K/V types cause ~11.5x prefill regression

Observation: V may be more quality-sensitive than K

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions