Skip to content

CUDA SET_ROWS turbo3: GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) fails when row width is 576 (e.g. GLM-4.7 Flash / deepseek2 K heads) #13

@dan-and

Description

@dan-and

FYI: I am not experienced with CUDA development, so I rather report here for people like @signalnine etc. to analyze.

Also, I have found this issue currently with:

  • GLM-4.7 Flash

Following models are loading and working fine:

  • Mistral's: Devstral-2, Devstral-Small, ministral3-arch (Ministral3-14B / 3B)

  • OpenAi's gptoss-arch: gpt-oss 120B , gpt-oss 20B

  • MiniMax 2.1

  • gemma3-arch: Essential AI: rnj-1 , gemma-3-12b

  • Qwen3-next arch: Qwen3-Coder-Next , Qwen3-Next-80B-A3B-Instruct

  • qwen35moe arch: Qwen3.5-122B-A10B, Qwen3.5-4B, Qwen3.5-27B

  • llama: llama3.3 8B

  • nemotron_h_moe-Arch: Nemotron-3-Nano-30B-A3B , Nemotron-3-Super-120B-A12B

  • olmo2-arch: Allenai Olmo-3-7B, Olmo-3.1-32B

    ────────────────────────────────────────

    Summary

    With KV cache types -ctk turbo3 -ctv turbo3, llama-server aborts during the default warmup decode (common_init_from_params). The failure is a failed assertion in set_rows_cuda_turbo3: the first dimension of the source tensor (ne00)
    must be a multiple of QK_TURBO3_GROUP (128), but for this model it is 576, so 576 % 128 = 64 and the assert fires.

    ────────────────────────────────────────

    Environment

    • OS: Linux x86_64 (e.g. Ubuntu 24.04)
    • GPU: e.g. NVIDIA RTX 3080, CUDA enabled
    • Build: llama-server with CUDA, debug or relwithdebinfo (asserts enabled)
    • Model: GLM-4.7 Flash GGUF (e.g. GLM-4.7-Flash-Q4_K_M.gguf, arch reported as deepseek2, n_embd_head_k = 576)

    ────────────────────────────────────────

    Steps to reproduce

    CUDA_VISIBLE_DEVICES=0 build/bin/llama-server
    --webui-mcp-proxy
    --alias llamacpp-model
    -m /path/to/GLM-4.7-Flash-Q4_K_M.gguf
    -ctk turbo3 -ctv turbo3

    No HTTP traffic required; crash happens during model init / warmup.

    ────────────────────────────────────────

    Expected behavior

    Server loads, completes warmup decode, and listens (or fails with a clear user-facing error if turbo3 is unsupported for this tensor layout).

    ────────────────────────────────────────

    Actual behavior

    Process aborts with:

    ggml/src/ggml-cuda/set-rows.cu:387: GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) failed

    Exit code 134 (SIGABRT).

    ────────────────────────────────────────

    Root cause (analysis)

    • QK_TURBO3_GROUP is 128 (ggml-common.h).
    • set_rows_cuda_turbo3 assumes ne00 is divisible by 128 before launching k_set_rows_turbo3 (one block per 128-element group).
    • For this model/graph, the SET_ROWS op uses ne00 = 576 (consistent with K head dimension for deepseek2-style config), and 576 is not a multiple of 128.

    ────────────────────────────────────────

    Stack trace (representative)

    set_rows_cuda_turbo3 (set-rows.cu:387)
    set_rows_cuda<float, long> (set-rows.cu:512)
    ggml_cuda_op_set_rows (set-rows.cu:527)
    ggml_cuda_compute_forward (ggml-cuda.cu)
    ggml_backend_sched_graph_compute_async
    llama_context::graph_compute
    llama_context::process_ubatch / decode
    common_init_from_params (warmup llama_decode)
    server_context_impl::load_model
    main

    ────────────────────────────────────────

    Additional data

    • Without -ctk turbo3 -ctv turbo3, the same model/server path can succeed (f16 KV), so this is specific to the turbo3 SET_ROWS path and tensor shapes.
    • A core dump analyzed in GDB shows ne00 = 576, ne01 = 2, at the failing frame.

    ────────────────────────────────────────

    Suggested directions for a fix

    1. Kernel / launcher: Support row sizes where ne00 % 128 != 0 (e.g. full groups + partial tail block, or pad to a multiple of 128 with a defined layout).
    2. Graph / scheduling: Avoid emitting CUDA SET_ROWS with GGML_TYPE_TURBO3_0 for tensors whose leading dim is not a multiple of 128, or reshape/split the op so each row chunk is 128-aligned.
    3. UX: If unsupported combinations are detected early, fail at init with a message naming head dim vs turbo3 group size, instead of GGML_ASSERT in CUDA code.

    ────────────────────────────────────────

    Labels (suggestion)

    bug, cuda, turboquant / kv-cache, assert

Operating systems

Linux

GGML backends

CUDA

Hardware

Ryzen 5500 GT + RTX 3080 20GB

Models

Unsloth https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF

Problem description & steps to reproduce

CUDA_VISIBLE_DEVICES=0 build/bin/llama-server
--webui-mcp-proxy
--alias llamacpp-model
-m /path/to/GLM-4.7-Flash-Q4_K_M.gguf
-ctk turbo3 -ctv turbo3

No HTTP traffic required; crash happens during model init / warmup.

First Bad Commit

No response

Relevant log output

Logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions