-
Notifications
You must be signed in to change notification settings - Fork 95
CUDA SET_ROWS turbo3: GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) fails when row width is 576 (e.g. GLM-4.7 Flash / deepseek2 K heads) #13
Description
FYI: I am not experienced with CUDA development, so I rather report here for people like @signalnine etc. to analyze.
Also, I have found this issue currently with:
- GLM-4.7 Flash
Following models are loading and working fine:
-
Mistral's: Devstral-2, Devstral-Small, ministral3-arch (Ministral3-14B / 3B)
-
OpenAi's gptoss-arch: gpt-oss 120B , gpt-oss 20B
-
MiniMax 2.1
-
gemma3-arch: Essential AI: rnj-1 , gemma-3-12b
-
Qwen3-next arch: Qwen3-Coder-Next , Qwen3-Next-80B-A3B-Instruct
-
qwen35moe arch: Qwen3.5-122B-A10B, Qwen3.5-4B, Qwen3.5-27B
-
llama: llama3.3 8B
-
nemotron_h_moe-Arch: Nemotron-3-Nano-30B-A3B , Nemotron-3-Super-120B-A12B
-
olmo2-arch: Allenai Olmo-3-7B, Olmo-3.1-32B
────────────────────────────────────────
Summary
With KV cache types -ctk turbo3 -ctv turbo3, llama-server aborts during the default warmup decode (common_init_from_params). The failure is a failed assertion in set_rows_cuda_turbo3: the first dimension of the source tensor (ne00)
must be a multiple ofQK_TURBO3_GROUP(128), but for this model it is 576, so 576 % 128 = 64 and the assert fires.────────────────────────────────────────
Environment
• OS: Linux x86_64 (e.g. Ubuntu 24.04)
• GPU: e.g. NVIDIA RTX 3080, CUDA enabled
• Build: llama-server with CUDA, debug or relwithdebinfo (asserts enabled)
• Model: GLM-4.7 Flash GGUF (e.g. GLM-4.7-Flash-Q4_K_M.gguf, arch reported as deepseek2, n_embd_head_k = 576)────────────────────────────────────────
Steps to reproduce
CUDA_VISIBLE_DEVICES=0 build/bin/llama-server
--webui-mcp-proxy
--alias llamacpp-model
-m /path/to/GLM-4.7-Flash-Q4_K_M.gguf
-ctk turbo3 -ctv turbo3No HTTP traffic required; crash happens during model init / warmup.
────────────────────────────────────────
Expected behavior
Server loads, completes warmup decode, and listens (or fails with a clear user-facing error if turbo3 is unsupported for this tensor layout).
────────────────────────────────────────
Actual behavior
Process aborts with:
ggml/src/ggml-cuda/set-rows.cu:387: GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) failed
Exit code 134 (SIGABRT).
────────────────────────────────────────
Root cause (analysis)
• QK_TURBO3_GROUP is 128 (ggml-common.h).
• set_rows_cuda_turbo3 assumesne00is divisible by 128 before launching k_set_rows_turbo3 (one block per 128-element group).
• For this model/graph, theSET_ROWSop usesne00 = 576(consistent with K head dimension for deepseek2-style config), and 576 is not a multiple of 128.────────────────────────────────────────
Stack trace (representative)
set_rows_cuda_turbo3 (set-rows.cu:387)
set_rows_cuda<float, long> (set-rows.cu:512)
ggml_cuda_op_set_rows (set-rows.cu:527)
ggml_cuda_compute_forward (ggml-cuda.cu)
ggml_backend_sched_graph_compute_async
llama_context::graph_compute
llama_context::process_ubatch / decode
common_init_from_params (warmup llama_decode)
server_context_impl::load_model
main────────────────────────────────────────
Additional data
• Without -ctk turbo3 -ctv turbo3, the same model/server path can succeed (f16 KV), so this is specific to the turbo3
SET_ROWSpath and tensor shapes.
• A core dump analyzed in GDB showsne00 = 576,ne01 = 2, at the failing frame.────────────────────────────────────────
Suggested directions for a fix
- Kernel / launcher: Support row sizes where
ne00 % 128 != 0(e.g. full groups + partial tail block, or pad to a multiple of 128 with a defined layout). - Graph / scheduling: Avoid emitting CUDA
SET_ROWSwithGGML_TYPE_TURBO3_0for tensors whose leading dim is not a multiple of 128, or reshape/split the op so each row chunk is 128-aligned. - UX: If unsupported combinations are detected early, fail at init with a message naming head dim vs turbo3 group size, instead of GGML_ASSERT in CUDA code.
────────────────────────────────────────
Labels (suggestion)
bug, cuda, turboquant / kv-cache, assert
- Kernel / launcher: Support row sizes where
Operating systems
Linux
GGML backends
CUDA
Hardware
Ryzen 5500 GT + RTX 3080 20GB
Models
Unsloth https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF
Problem description & steps to reproduce
CUDA_VISIBLE_DEVICES=0 build/bin/llama-server
--webui-mcp-proxy
--alias llamacpp-model
-m /path/to/GLM-4.7-Flash-Q4_K_M.gguf
-ctk turbo3 -ctv turbo3
No HTTP traffic required; crash happens during model init / warmup.
First Bad Commit
No response