Skip to content

[bug] turbo4 crashes on CUDA (SET_ROWS unported) + mixed K/V types cause ~11x prefill regression #25

@Jaker86

Description

@Jaker86

Setup

  • RTX 3090 24GB, CUDA 13 (nvcc 13.0.88), driver 580.126.09, Ubuntu 24.04
  • Branch: feature/turboquant-kv-cache @ 43f7d3d20
  • Model: Qwen3.5-9B Q4_K_M (unsloth/Qwen3.5-9B-GGUF)
  • Build: cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc)

Benchmark results

llama-bench pp512/tg256 ×3, PPL on wikitext-2-raw 580 chunks n_ctx=512:

config cache-k cache-v PPL decode t/s prefill t/s KV cache (n_ctx=2048)
baseline q8_0 q8_0 8.2018 101.69 3773.30 235 MiB
turbo3/3 turbo3 turbo3 8.3124 99.79 3727.14 215 MiB
turbo2/2 turbo2 turbo2 8.6639 100.73 3701.82 211 MiB
turbo4/4 turbo4 turbo4
mixed t3k/t2v turbo3 turbo2 8.5312 87.70 329.16 213 MiB
mixed t2k/t3v turbo2 turbo3 8.4356 87.83 321.41 213 MiB

Bug 1: turbo4 crashes on init

Command:

./build/bin/llama-bench -m MODEL -ngl 99 --flash-attn 1 --cache-type-k turbo4 --cache-type-v turbo4 -p 512 -n 256 -r 3

Error:

ggml/src/ggml-backend.cpp:809: pre-allocated tensor (cache_k_l3 (view)) in a buffer (CUDA0) that cannot run the operation (SET_ROWS)

Full backtrace:

#3  ggml_backend_sched_backend_id_from_cur(ggml_backend_sched*, ggml_tensor*)
#4  ggml_backend_sched_split_graph()
#5  llama_context::graph_reserve(...)
#6  llama_context::sched_reserve()
#7  llama_context::llama_context(llama_model const&, llama_context_params)
#8  llama_init_from_model()

Process aborts (core dumped) before any inference runs.


Bug 2: mixed K/V types cause ~11.5x prefill regression

When K and V use different turbo types (any combination), prefill drops from ~3700 t/s to ~325 t/s. Decode also regresses slightly (~101 → ~88 t/s). Symmetric configs (turbo3/3, turbo2/2) show no regression.

This suggests there's no optimized CUDA kernel path when K and V have different quantization types — likely falling back to a slow dequant path during attention.


Observation: V may be more quality-sensitive than K

turbo2-K / turbo3-V (PPL 8.4356) outperforms turbo3-K / turbo2-V (PPL 8.5312), contrary to the community assumption that K is more sensitive. May be model-specific (Qwen3.5-9B uses GQA with large head dims, 256).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions