graph: guard iswa kq_mask on its own buffer#24294
Merged
ServeurpersoCom merged 1 commit intoJun 8, 2026
Merged
Conversation
A SWA-only draft head (e.g. StepFun MTP) leaves the base sub-cache empty, so its kq_mask buffer stays null and asserts at load. Guard each mask on its own buffer in set_input and can_reuse, base and swa. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
ggerganov
approved these changes
Jun 8, 2026
CISC
approved these changes
Jun 8, 2026
TheTom
added a commit
to TheTom/llama-cpp-turboquant
that referenced
this pull request
Jun 9, 2026
Two fixes so a standalone MTP draft GGUF whose layers are ALL nextn (gemma4-assistant: n_layer_all == n_layer_nextn, so n_layer() == 0) initializes and engages speculative decoding: 1. llama-kv-cache.cpp: the ctor iterated hparams.n_layer() (excludes nextn layers) for the per-layer KV loop; the ggml-org#24060 reconciliation wired it to the nextn-excluding method, but upstream loops the full hparams.n_layer member. With n_layer() == 0 the draft registered ZERO KV layers -> map_layer_ids empty -> get_k(0) threw std::out_of_range during draft-context reserve. Loop over hparams.n_layer_all instead; has_kv() still gates per-layer. 2. llama-graph.cpp: port upstream ggml-org#24294 - guard the iSWA kq_mask on its own buffer in set_input/can_reuse (base and swa). A SWA-only draft head leaves the base sub-cache empty, so its mask buffer is null. Verified on RDNA4/Vulkan: gemma4-12B MTP assistant loads, drafts at ~0.56-0.75 acceptance with q8_0 K / turbo4 V.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Fix load crash for draft-mtp models with a SWA-only draft head (e.g. StepFun Step-3.7-Flash). The draft's base (non-SWA) sub-cache has no layers, so its kq_mask buffer stays null and set_input_kq_mask asserts during the seq_rm probe at load. Guard each kq_mask on its own buffer in set_input and can_reuse, base and swa.
Additional information
Following #23398 (Gemma 4 MTP), regression on StepFun Step-3.7-Flash loading reported by @vbooka1, confirmed by @forforever73. Thanks @ggerganov for the can_reuse guards; guarding on the mask's own buffer (not self_k_idxs_swa) covers the SWA-only case too. Tested on Step-3.7-Flash (Q2_K_XL + Q8/BF16 draft, q8_0 and f16 KV): loads clean, greedy output identical with/without MTP. Needs --spec-draft-n-max 1 (Step MTP head is single-token).
Requirements