CUDA `SET_ROWS` turbo3: `GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0)` fails when row width is 576 (e.g. GLM-4.7 Flash / deepseek2 K heads)

FYI: I am not experienced with CUDA development, so I rather report here for people like [@signalnine](http://github.com/signalnine)  etc. to analyze. 

**Also, I have found this issue currently with:** 
- GLM-4.7 Flash  

**Following models are loading and working fine:** 
- Mistral's: Devstral-2, Devstral-Small, ministral3-arch (Ministral3-14B / 3B) 
- OpenAi's gptoss-arch: gpt-oss 120B , gpt-oss 20B
- MiniMax 2.1 
- gemma3-arch: Essential AI: rnj-1 , gemma-3-12b 
- Qwen3-next arch: Qwen3-Coder-Next , Qwen3-Next-80B-A3B-Instruct 
- qwen35moe arch: Qwen3.5-122B-A10B, Qwen3.5-4B, Qwen3.5-27B
- llama: llama3.3 8B 
- nemotron_h_moe-Arch: Nemotron-3-Nano-30B-A3B , Nemotron-3-Super-120B-A12B 
- olmo2-arch: Allenai Olmo-3-7B, Olmo-3.1-32B 

 


  ────────────────────────────────────────

  Summary

  With KV cache types -ctk turbo3 -ctv turbo3, llama-server aborts during the default warmup decode (common_init_from_params). The failure is a failed assertion in set_rows_cuda_turbo3: the first dimension of the source tensor (ne00)
  must be a multiple of `QK_TURBO3_GROUP` (128), but for this model it is 576, so 576 % 128 = 64 and the assert fires.

  ────────────────────────────────────────



  Environment

  • OS: Linux x86_64 (e.g. Ubuntu 24.04)
  • GPU: e.g. NVIDIA RTX 3080, CUDA enabled
  • Build: llama-server with CUDA, debug or relwithdebinfo (asserts enabled)
  • Model: GLM-4.7 Flash GGUF (e.g. GLM-4.7-Flash-Q4_K_M.gguf, arch reported as deepseek2, n_embd_head_k = 576)


  ────────────────────────────────────────



  Steps to reproduce


  CUDA_VISIBLE_DEVICES=0 build/bin/llama-server \
    --webui-mcp-proxy \
    --alias llamacpp-model \
    -m /path/to/GLM-4.7-Flash-Q4_K_M.gguf \
    -ctk turbo3 -ctv turbo3

  No HTTP traffic required; crash happens during model init / warmup.

  ────────────────────────────────────────



  Expected behavior

  Server loads, completes warmup decode, and listens (or fails with a clear user-facing error if turbo3 is unsupported for this tensor layout).

  ────────────────────────────────────────



  Actual behavior

  Process aborts with:

  ggml/src/ggml-cuda/set-rows.cu:387: GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) failed

  Exit code 134 (SIGABRT).

  ────────────────────────────────────────



  Root cause (analysis)

  • QK_TURBO3_GROUP is 128 (ggml-common.h).
  • set_rows_cuda_turbo3 assumes `ne00` is divisible by 128 before launching k_set_rows_turbo3 (one block per 128-element group).
  • For this model/graph, the `SET_ROWS` op uses `ne00 = 576` (consistent with K head dimension for deepseek2-style config), and 576 is not a multiple of 128.


  ────────────────────────────────────────



  Stack trace (representative)


  set_rows_cuda_turbo3<long>     (set-rows.cu:387)
  set_rows_cuda<float, long>     (set-rows.cu:512)
  ggml_cuda_op_set_rows          (set-rows.cu:527)
  ggml_cuda_compute_forward      (ggml-cuda.cu)
  ggml_backend_sched_graph_compute_async
  llama_context::graph_compute
  llama_context::process_ubatch / decode
  common_init_from_params        (warmup llama_decode)
  server_context_impl::load_model
  main


  ────────────────────────────────────────



  Additional data

  • Without -ctk turbo3 -ctv turbo3, the same model/server path can succeed (f16 KV), so this is specific to the turbo3 `SET_ROWS` path and tensor shapes.
  • A core dump analyzed in GDB shows `ne00 = 576`, `ne01 = 2`, at the failing frame.


  ────────────────────────────────────────



  Suggested directions for a fix

  1. Kernel / launcher: Support row sizes where `ne00 % 128 != 0` (e.g. full groups + partial tail block, or pad to a multiple of 128 with a defined layout).
  2. Graph / scheduling: Avoid emitting CUDA `SET_ROWS` with `GGML_TYPE_TURBO3_0` for tensors whose leading dim is not a multiple of 128, or reshape/split the op so each row chunk is 128-aligned.
  3. UX: If unsupported combinations are detected early, fail at init with a message naming head dim vs turbo3 group size, instead of GGML_ASSERT in CUDA code.


  ────────────────────────────────────────



  Labels (suggestion)

  bug, cuda, turboquant / kv-cache, assert


### Operating systems

Linux

### GGML backends

CUDA

### Hardware

Ryzen 5500 GT + RTX 3080 20GB 

### Models

Unsloth https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF    


### Problem description & steps to reproduce


  CUDA_VISIBLE_DEVICES=0 build/bin/llama-server \
    --webui-mcp-proxy \
    --alias llamacpp-model \
    -m /path/to/GLM-4.7-Flash-Q4_K_M.gguf \
    -ctk turbo3 -ctv turbo3

  No HTTP traffic required; crash happens during model init / warmup.


### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


```console

```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA `SET_ROWS` turbo3: `GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0)` fails when row width is 576 (e.g. GLM-4.7 Flash / deepseek2 K heads) #13

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

CUDA SET_ROWS turbo3: GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) fails when row width is 576 (e.g. GLM-4.7 Flash / deepseek2 K heads) #13

Description

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

CUDA `SET_ROWS` turbo3: `GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0)` fails when row width is 576 (e.g. GLM-4.7 Flash / deepseek2 K heads) #13