graph: guard iswa kq_mask on its own buffer by ServeurpersoCom · Pull Request #24294 · ggml-org/llama.cpp

ServeurpersoCom · 2026-06-08T08:08:53Z

Overview

Fix load crash for draft-mtp models with a SWA-only draft head (e.g. StepFun Step-3.7-Flash). The draft's base (non-SWA) sub-cache has no layers, so its kq_mask buffer stays null and set_input_kq_mask asserts during the seq_rm probe at load. Guard each kq_mask on its own buffer in set_input and can_reuse, base and swa.

Additional information

Following #23398 (Gemma 4 MTP), regression on StepFun Step-3.7-Flash loading reported by @vbooka1, confirmed by @forforever73. Thanks @ggerganov for the can_reuse guards; guarding on the mask's own buffer (not self_k_idxs_swa) covers the SWA-only case too. Tested on Step-3.7-Flash (Q2_K_XL + Q8/BF16 draft, q8_0 and f16 KV): loads clean, greedy output identical with/without MTP. Needs --spec-draft-n-max 1 (Step MTP head is single-token).

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES

A SWA-only draft head (e.g. StepFun MTP) leaves the base sub-cache empty, so its kq_mask buffer stays null and asserts at load. Guard each mask on its own buffer in set_input and can_reuse, base and swa. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Two fixes so a standalone MTP draft GGUF whose layers are ALL nextn (gemma4-assistant: n_layer_all == n_layer_nextn, so n_layer() == 0) initializes and engages speculative decoding: 1. llama-kv-cache.cpp: the ctor iterated hparams.n_layer() (excludes nextn layers) for the per-layer KV loop; the ggml-org#24060 reconciliation wired it to the nextn-excluding method, but upstream loops the full hparams.n_layer member. With n_layer() == 0 the draft registered ZERO KV layers -> map_layer_ids empty -> get_k(0) threw std::out_of_range during draft-context reserve. Loop over hparams.n_layer_all instead; has_kv() still gates per-layer. 2. llama-graph.cpp: port upstream ggml-org#24294 - guard the iSWA kq_mask on its own buffer in set_input/can_reuse (base and swa). A SWA-only draft head leaves the base sub-cache empty, so its mask buffer is null. Verified on RDNA4/Vulkan: gemma4-12B MTP assistant loads, drafts at ~0.56-0.75 acceptance with q8_0 K / turbo4 V.

ServeurpersoCom requested a review from CISC as a code owner June 8, 2026 08:08

ggerganov approved these changes Jun 8, 2026

View reviewed changes

CISC approved these changes Jun 8, 2026

View reviewed changes

ServeurpersoCom mentioned this pull request Jun 8, 2026

llama : add Gemma4 MTP #23398

Merged

ServeurpersoCom merged commit a66d505 into ggml-org:master Jun 8, 2026
25 of 26 checks passed

TheTom mentioned this pull request Jun 9, 2026

Gemma 4 MTP: bring in the upstream MTP lineage (qwen35 post-norm + gemma4) on TurboQuant+ TheTom/llama-cpp-turboquant#172

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graph: guard iswa kq_mask on its own buffer#24294

graph: guard iswa kq_mask on its own buffer#24294
ServeurpersoCom merged 1 commit into
ggml-org:masterfrom
ServeurpersoCom:fix/iswa-kqmask-null-buffer

ServeurpersoCom commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ServeurpersoCom commented Jun 8, 2026

Overview

Additional information

Requirements

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants