Skip to content

graph: guard iswa kq_mask on its own buffer#24294

Merged
ServeurpersoCom merged 1 commit into
ggml-org:masterfrom
ServeurpersoCom:fix/iswa-kqmask-null-buffer
Jun 8, 2026
Merged

graph: guard iswa kq_mask on its own buffer#24294
ServeurpersoCom merged 1 commit into
ggml-org:masterfrom
ServeurpersoCom:fix/iswa-kqmask-null-buffer

Conversation

@ServeurpersoCom

Copy link
Copy Markdown
Contributor

Overview

Fix load crash for draft-mtp models with a SWA-only draft head (e.g. StepFun Step-3.7-Flash). The draft's base (non-SWA) sub-cache has no layers, so its kq_mask buffer stays null and set_input_kq_mask asserts during the seq_rm probe at load. Guard each kq_mask on its own buffer in set_input and can_reuse, base and swa.

Additional information

Following #23398 (Gemma 4 MTP), regression on StepFun Step-3.7-Flash loading reported by @vbooka1, confirmed by @forforever73. Thanks @ggerganov for the can_reuse guards; guarding on the mask's own buffer (not self_k_idxs_swa) covers the SWA-only case too. Tested on Step-3.7-Flash (Q2_K_XL + Q8/BF16 draft, q8_0 and f16 KV): loads clean, greedy output identical with/without MTP. Needs --spec-draft-n-max 1 (Step MTP head is single-token).

Requirements

A SWA-only draft head (e.g. StepFun MTP) leaves the base sub-cache
empty, so its kq_mask buffer stays null and asserts at load. Guard
each mask on its own buffer in set_input and can_reuse, base and swa.

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@ServeurpersoCom ServeurpersoCom requested a review from CISC as a code owner June 8, 2026 08:08
@ServeurpersoCom ServeurpersoCom merged commit a66d505 into ggml-org:master Jun 8, 2026
25 of 26 checks passed
TheTom added a commit to TheTom/llama-cpp-turboquant that referenced this pull request Jun 9, 2026
Two fixes so a standalone MTP draft GGUF whose layers are ALL nextn
(gemma4-assistant: n_layer_all == n_layer_nextn, so n_layer() == 0)
initializes and engages speculative decoding:

1. llama-kv-cache.cpp: the ctor iterated hparams.n_layer() (excludes
   nextn layers) for the per-layer KV loop; the ggml-org#24060 reconciliation
   wired it to the nextn-excluding method, but upstream loops the full
   hparams.n_layer member. With n_layer() == 0 the draft registered ZERO
   KV layers -> map_layer_ids empty -> get_k(0) threw std::out_of_range
   during draft-context reserve. Loop over hparams.n_layer_all instead;
   has_kv() still gates per-layer.

2. llama-graph.cpp: port upstream ggml-org#24294 - guard the iSWA kq_mask on its
   own buffer in set_input/can_reuse (base and swa). A SWA-only draft
   head leaves the base sub-cache empty, so its mask buffer is null.

Verified on RDNA4/Vulkan: gemma4-12B MTP assistant loads, drafts at
~0.56-0.75 acceptance with q8_0 K / turbo4 V.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants