forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 89
[bug] turbo4 crashes on CUDA (SET_ROWS unported) + mixed K/V types cause ~11x prefill regression #25
Copy link
Copy link
Closed
Description
Setup
- RTX 3090 24GB, CUDA 13 (nvcc 13.0.88), driver 580.126.09, Ubuntu 24.04
- Branch:
feature/turboquant-kv-cache@43f7d3d20 - Model: Qwen3.5-9B Q4_K_M (
unsloth/Qwen3.5-9B-GGUF) - Build:
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc)
Benchmark results
llama-bench pp512/tg256 ×3, PPL on wikitext-2-raw 580 chunks n_ctx=512:
| config | cache-k | cache-v | PPL | decode t/s | prefill t/s | KV cache (n_ctx=2048) |
|---|---|---|---|---|---|---|
| baseline | q8_0 | q8_0 | 8.2018 | 101.69 | 3773.30 | 235 MiB |
| turbo3/3 | turbo3 | turbo3 | 8.3124 | 99.79 | 3727.14 | 215 MiB |
| turbo2/2 | turbo2 | turbo2 | 8.6639 | 100.73 | 3701.82 | 211 MiB |
| turbo4/4 | turbo4 | turbo4 | — | — | — | — |
| mixed t3k/t2v | turbo3 | turbo2 | 8.5312 | 87.70 | 329.16 | 213 MiB |
| mixed t2k/t3v | turbo2 | turbo3 | 8.4356 | 87.83 | 321.41 | 213 MiB |
Bug 1: turbo4 crashes on init
Command:
./build/bin/llama-bench -m MODEL -ngl 99 --flash-attn 1 --cache-type-k turbo4 --cache-type-v turbo4 -p 512 -n 256 -r 3
Error:
ggml/src/ggml-backend.cpp:809: pre-allocated tensor (cache_k_l3 (view)) in a buffer (CUDA0) that cannot run the operation (SET_ROWS)
Full backtrace:
#3 ggml_backend_sched_backend_id_from_cur(ggml_backend_sched*, ggml_tensor*)
#4 ggml_backend_sched_split_graph()
#5 llama_context::graph_reserve(...)
#6 llama_context::sched_reserve()
#7 llama_context::llama_context(llama_model const&, llama_context_params)
#8 llama_init_from_model()
Process aborts (core dumped) before any inference runs.
Bug 2: mixed K/V types cause ~11.5x prefill regression
When K and V use different turbo types (any combination), prefill drops from ~3700 t/s to ~325 t/s. Decode also regresses slightly (~101 → ~88 t/s). Symmetric configs (turbo3/3, turbo2/2) show no regression.
This suggests there's no optimized CUDA kernel path when K and V have different quantization types — likely falling back to a slow dequant path during attention.
Observation: V may be more quality-sensitive than K
turbo2-K / turbo3-V (PPL 8.4356) outperforms turbo3-K / turbo2-V (PPL 8.5312), contrary to the community assumption that K is more sensitive. May be model-specific (Qwen3.5-9B uses GQA with large head dims, 256).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels