SYCL gated_delta_net K>1#23174
Conversation
|
I opened a separate issue with more details: I also tested this PR locally because it seems directly related to the SYCL On my setup, I still see excessive memory usage and a severe slowdown when SYCL and Vulkan + MTP does not show the same level of memory growth or slowdown on the same model/prompt/settings. So this may not be a general SYCL issue or a model-size issue, but possibly an interaction between SYCL, |
|
thanks @Yuimi062 , my comment on Vulkan was based only on R-SITES testing. I think the speed being poor on sycl with mtp is a separate issue as pre-patch is the same speed (& memory behavior) as post-patch for me, just with garbled text. Let me see if I can find anything about the issue you raised, I'll respond further there if so. |
|
@karavayev |
* sycl_gated_delta_net K>1 * editor_config
* sycl_gated_delta_net K>1 * editor_config
* origin/master: server: only parse empty msg if continuing an assistant msg (ggml-org#23506) perplexity : fix integer overflow (ggml-org#23496) SYCL: improve MoE prefill throughput (ggml-org#23142) sycl : Level Zero detection in ggml_sycl_init (ggml-org#23097) SYCL : gated_delta_net K>1 (ggml-org#23174) SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc) (ggml-org#21580) docs: Update documentation with Granite 4.0/4.1 (ggml-org#23404) ggml-zendnn : add Q8_0 quantization support (ggml-org#23414) cmake : build router app only during standalone builds (ggml-org#23521) vocab : fix HybridDNA tokenizer (ggml-org#23466) cmake : add install() for impl libraries + fix apple builds (ggml-org#23511) CUDA: fix PDL CC check for JIT compilation (ggml-org#23471) cmake : remove STATIC from impl libraries, enable LLAMA_BUILD_APP by default (ggml-org#23462) Update WebGPU support and add link to blog/demo (ggml-org#23483) vulkan: fuse snake activation (mul, sin, sqr, mul, add) (ggml-org#22855)
* sycl_gated_delta_net K>1 * editor_config
* sycl_gated_delta_net K>1 * editor_config
* sycl_gated_delta_net K>1 * editor_config
* sycl_gated_delta_net K>1 * editor_config
* sycl_gated_delta_net K>1 * editor_config
Extend the OpenCL gated_delta_net kernel to support K>1 input/output state slots, matching the CUDA / Metal / Vulkan / SYCL implementations landed by upstream PR ggml-org#22673 ("llama + spec: MTP Support") and PR ggml-org#23174 (SYCL K>1). MTP draft heads predict K tokens ahead; the verify batch then rolls back any rejected draft tokens by reading from the K snapshot slots the forward pass writes during the n_tokens loop. K==1 is the legacy backwards-compatible single-slot final-state-only behaviour. Layout - Input state: (S_v*S_v*H, K, n_seqs) — only slot 0 carries the seed. - Output state: K slots stacked as the outermost dim, each S_v*S_v*H*n_seqs floats. shift = n_tokens - K; the kernel writes this t's state to slot (t - shift) when 0 <= target_slot < K. - For K>n_tokens (cold spec restart), only the last n_tokens slots are written; earlier slots are caller-owned and left untouched. - For K==1 the per-t write condition fires once on the last iteration (slot 0 = final state), preserving prior semantics. Both kernels updated - kernel_gated_delta_net_f32 (generic, any S_v <= 128): adopts a private working column s_col[GDN_GENERIC_MAX_SV] so the per-t slot write doesn't have to read back from global between tokens. Replaces the previous in-place global s_out modification. - kernel_gated_delta_net_f32_sv128 (Qwen3-Next / Qwen3.6-A3B fast path): state was already kept in per-lane private s_shard[4]; just added the per-t slot write loop using the same target_slot rule. Dispatch derives K from src_state->ne[1] and forwards it as the last kernel arg. supports_op needed no change — the existing f32-only gate already accepts both K==1 and K>1 ops. test-backend-ops -o GATED_DELTA_NET: 36/36 pass (was 28/36 — the 8 K∈{2,3,4} cases now green). FLASH_ATTN_EXT regression check: 2564/2564. Perf: feature-correctness commit; further tuning (cluster-32 ALU optimisations, k_img staging for slot writes, etc.) deferred.
Overview
Fix failures in test-backend-ops gated_delta_net related to K>1 by porting MTP relevant code snippets from ggml-cuda/gated_delta_net.cu to ggml-sycl/gated_delta_net.cpp. Without this patch, MTP on SYCL gives garbled output after a few tokens. After this patch, MTP on SYCL output is normal and is similar in speed to MTP on Vulkan, though it is not necessarily faster than without MTP on SYCL yet.
No new code just copy-pasted to relevant sections.
Prior to this PR:
After this PR:
Additional information
Eval bug: MTP support in SYCL #23149
Requirements