SYCL gated_delta_net K>1 by karavayev · Pull Request #23174 · ggml-org/llama.cpp

karavayev · 2026-05-17T01:09:33Z

Overview

Fix failures in test-backend-ops gated_delta_net related to K>1 by porting MTP relevant code snippets from ggml-cuda/gated_delta_net.cu to ggml-sycl/gated_delta_net.cpp. Without this patch, MTP on SYCL gives garbled output after a few tokens. After this patch, MTP on SYCL output is normal and is similar in speed to MTP on Vulkan, though it is not necessarily faster than without MTP on SYCL yet.

No new code just copy-pasted to relevant sections.

Prior to this PR:

load_backend: loaded SYCL backend from /app/libggml-sycl.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
Build with Macros:
  GGML_SYCL_FORCE_MMQ: no
  GGML_SYCL_F16: yes
  GGML_SYCL_GRAPH: yes
  GGML_SYCL_DNNL: yes
  GGML_SYCL_SUPPORT_LEVEL_ZERO: yes
Running with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 0
  GGML_SYCL_DISABLE_GRAPH: 1
  GGML_SYCL_ENABLE_LEVEL_ZERO: 1
  GGML_SYCL_DISABLE_DNN: 0
  GGML_SYCL_PRIORITIZE_DMMV: 0
  GGML_SYCL_ENABLE_FLASH_ATTN: 1
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Graphics [0xe211]|   20.1|    160|    1024|   32| 24385M|        1.13.35563+10|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|
Testing 2 devices

Backend 1/2: SYCL0
  Device description: Intel(R) Graphics [0xe211]
  Device memory: 23256 MB (2267 MB free)

  GATED_DELTA_NET(type=f32,head_count=32,head_size=128,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=1,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=16,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=1,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=16,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=32,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=16,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=64,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=127,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=256,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=65,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=100,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=200,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=127,n_seqs=2,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=64,n_seqs=1,v_repeat=1,permuted=0,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=33,n_seqs=1,v_repeat=1,permuted=0,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=100,n_seqs=1,v_repeat=1,permuted=0,kda=1,K=1): �[1;32mOK�[0m
[GATED_DELTA_NET] ERR = 1.509518692 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=16,n_seq_tokens=2,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=2): �[1;31mFAIL�[0m
[GATED_DELTA_NET] ERR = 1.329997385 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=32,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=4): �[1;31mFAIL�[0m
[GATED_DELTA_NET] ERR = 1.152509605 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=0,K=4): �[1;31mFAIL�[0m
[GATED_DELTA_NET] ERR = 1.069939604 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=8,head_size=128,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=4): �[1;31mFAIL�[0m
[GATED_DELTA_NET] ERR = 1.143701992 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=1,K=4): �[1;31mFAIL�[0m
[GATED_DELTA_NET] ERR = 1.298812164 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=1,K=4): �[1;31mFAIL�[0m
[GATED_DELTA_NET] ERR = 1.394819220 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=32,n_seq_tokens=8,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=3): �[1;31mFAIL�[0m
[GATED_DELTA_NET] ERR = 1.132955602 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=16,n_seqs=2,v_repeat=1,permuted=0,kda=0,K=4): �[1;31mFAIL�[0m
  28/36 tests passed

Failing tests:
  GATED_DELTA_NET(type=f32,head_count=4,head_size=16,n_seq_tokens=2,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=2)
  GATED_DELTA_NET(type=f32,head_count=4,head_size=32,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=4)
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=0,K=4)
  GATED_DELTA_NET(type=f32,head_count=8,head_size=128,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=4)
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=1,K=4)
  GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=1,K=4)
  GATED_DELTA_NET(type=f32,head_count=4,head_size=32,n_seq_tokens=8,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=3)
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=16,n_seqs=2,v_repeat=1,permuted=0,kda=0,K=4)
  Backend SYCL0: �[1;31mFAIL�[0m
Backend 2/2: CPU
  Skipping CPU backend
1/2 backends passed
�[1;31mFAIL�[0m

After this PR:

load_backend: loaded SYCL backend from /app/libggml-sycl.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
Build with Macros:
  GGML_SYCL_FORCE_MMQ: no
  GGML_SYCL_F16: yes
  GGML_SYCL_GRAPH: yes
  GGML_SYCL_DNNL: yes
  GGML_SYCL_SUPPORT_LEVEL_ZERO: yes
Running with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 0
  GGML_SYCL_DISABLE_GRAPH: 1
  GGML_SYCL_ENABLE_LEVEL_ZERO: 1
  GGML_SYCL_DISABLE_DNN: 0
  GGML_SYCL_PRIORITIZE_DMMV: 0
  GGML_SYCL_ENABLE_FLASH_ATTN: 1
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Graphics [0xe211]|   20.1|    160|    1024|   32| 24385M|        1.13.35563+10|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|
Testing 2 devices

Backend 1/2: SYCL0
  Device description: Intel(R) Graphics [0xe211]
  Device memory: 23256 MB (2267 MB free)

  GATED_DELTA_NET(type=f32,head_count=32,head_size=128,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=1,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=16,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=1,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=16,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=32,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=16,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=64,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=127,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=256,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=65,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=100,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=200,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=127,n_seqs=2,v_repeat=1,permuted=0,kda=0,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=64,n_seqs=1,v_repeat=1,permuted=0,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=33,n_seqs=1,v_repeat=1,permuted=0,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=100,n_seqs=1,v_repeat=1,permuted=0,kda=1,K=1): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=16,n_seq_tokens=2,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=2): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=32,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=4): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=0,K=4): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=8,head_size=128,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=4): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=1,K=4): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=1,K=4): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=32,n_seq_tokens=8,n_seqs=1,v_repeat=1,permuted=0,kda=0,K=3): �[1;32mOK�[0m
  GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=16,n_seqs=2,v_repeat=1,permuted=0,kda=0,K=4): �[1;32mOK�[0m
  36/36 tests passed
  Backend SYCL0: �[1;32mOK�[0m
Backend 2/2: CPU
  Skipping CPU backend
2/2 backends passed
�[1;32mOK�[0m

Additional information

Eval bug: MTP support in SYCL #23149

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, used for root-cause analysis and initial testing. This PR branch was handwritten by me. No real differences other than preserving comments across sycl/cuda versions.

Yuimi062 · 2026-05-17T12:29:16Z

I opened a separate issue with more details:

I also tested this PR locally because it seems directly related to the SYCL GATED_DELTA_NET K>1 / MTP path.

On my setup, I still see excessive memory usage and a severe slowdown when SYCL and draft-mtp are enabled together, even with a 2B MTP model. The important control result is that SYCL without MTP does reach the normal SYCL GDN path: my local instrumentation prints SYCL_ALLOC_DEBUG logs with K=1. However, when draft-mtp is enabled, statistics draft-mtp confirms that MTP is active, but the same SYCL GDN debug output no longer appears.

Vulkan + MTP does not show the same level of memory growth or slowdown on the same model/prompt/settings.

So this may not be a general SYCL issue or a model-size issue, but possibly an interaction between SYCL, draft-mtp, and the GATED_DELTA_NET K>1 / keep_rs path. I am not sure whether this is expected behavior for this PR, but it seems different from the PR description that MTP on SYCL should be similar in speed to MTP on Vulkan.

karavayev · 2026-05-17T12:56:43Z

thanks @Yuimi062 , my comment on Vulkan was based only on R-SITES testing. I think the speed being poor on sycl with mtp is a separate issue as pre-patch is the same speed (& memory behavior) as post-patch for me, just with garbled text. Let me see if I can find anything about the issue you raised, I'll respond further there if so.

arthw

It's good job!

Thank you!

arthw · 2026-05-19T14:58:46Z

@karavayev
Please fix the issue in CI EditorConfig!

karavayev · 2026-05-19T22:47:49Z

@arthw sorry about that - I added EditorConfig, can you run the CI tests again? Webgpu fails likely related to #23299 (should be fixed now?)

arthw · 2026-05-20T02:47:35Z

@arthw sorry about that - I added EditorConfig, can you run the CI tests again? Webgpu fails likely related to #23299 (should be fixed now?)

Good!

The webgpu issue needn't we fix, since this PR doesn't impact it.

Thank you!

* sycl_gated_delta_net K>1 * editor_config

* origin/master: server: only parse empty msg if continuing an assistant msg (ggml-org#23506) perplexity : fix integer overflow (ggml-org#23496) SYCL: improve MoE prefill throughput (ggml-org#23142) sycl : Level Zero detection in ggml_sycl_init (ggml-org#23097) SYCL : gated_delta_net K>1 (ggml-org#23174) SYCL: add BF16 to DMMV kernel path (~4x tg speedup on Intel Arc) (ggml-org#21580) docs: Update documentation with Granite 4.0/4.1 (ggml-org#23404) ggml-zendnn : add Q8_0 quantization support (ggml-org#23414) cmake : build router app only during standalone builds (ggml-org#23521) vocab : fix HybridDNA tokenizer (ggml-org#23466) cmake : add install() for impl libraries + fix apple builds (ggml-org#23511) CUDA: fix PDL CC check for JIT compilation (ggml-org#23471) cmake : remove STATIC from impl libraries, enable LLAMA_BUILD_APP by default (ggml-org#23462) Update WebGPU support and add link to blog/demo (ggml-org#23483) vulkan: fuse snake activation (mul, sin, sqr, mul, add) (ggml-org#22855)

* sycl_gated_delta_net K>1 * editor_config

Extend the OpenCL gated_delta_net kernel to support K>1 input/output state slots, matching the CUDA / Metal / Vulkan / SYCL implementations landed by upstream PR ggml-org#22673 ("llama + spec: MTP Support") and PR ggml-org#23174 (SYCL K>1). MTP draft heads predict K tokens ahead; the verify batch then rolls back any rejected draft tokens by reading from the K snapshot slots the forward pass writes during the n_tokens loop. K==1 is the legacy backwards-compatible single-slot final-state-only behaviour. Layout - Input state: (S_v*S_v*H, K, n_seqs) — only slot 0 carries the seed. - Output state: K slots stacked as the outermost dim, each S_v*S_v*H*n_seqs floats. shift = n_tokens - K; the kernel writes this t's state to slot (t - shift) when 0 <= target_slot < K. - For K>n_tokens (cold spec restart), only the last n_tokens slots are written; earlier slots are caller-owned and left untouched. - For K==1 the per-t write condition fires once on the last iteration (slot 0 = final state), preserving prior semantics. Both kernels updated - kernel_gated_delta_net_f32 (generic, any S_v <= 128): adopts a private working column s_col[GDN_GENERIC_MAX_SV] so the per-t slot write doesn't have to read back from global between tokens. Replaces the previous in-place global s_out modification. - kernel_gated_delta_net_f32_sv128 (Qwen3-Next / Qwen3.6-A3B fast path): state was already kept in per-lane private s_shard[4]; just added the per-t slot write loop using the same target_slot rule. Dispatch derives K from src_state->ne[1] and forwards it as the last kernel arg. supports_op needed no change — the existing f32-only gate already accepts both K==1 and K>1 ops. test-backend-ops -o GATED_DELTA_NET: 36/36 pass (was 28/36 — the 8 K∈{2,3,4} cases now green). FLASH_ATTN_EXT regression check: 2564/2564. Perf: feature-correctness commit; further tuning (cluster-32 ALU optimisations, k_img staging for slot writes, etc.) deferred.

sycl_gated_delta_net K>1

19d65bc

karavayev requested a review from a team as a code owner May 17, 2026 01:09

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 17, 2026

sanmai mentioned this pull request May 17, 2026

[SYCL] Offload MTP to CPU #23187

Closed

Yuimi062 mentioned this pull request May 17, 2026

Misc. bug: SYCL-specific excessive memory usage and slowdown with draft-mtp speculative decoding, even on 2B model #23203

Open

arthw approved these changes May 18, 2026

View reviewed changes

This was referenced May 18, 2026

Eval bug: MTP support in SYCL #23149

Closed

SYCL MTP produces garbled output on Intel Arc GPU (draft self-prediction incorrect) #23155

Closed

This was referenced May 18, 2026

[SYCL] improve MoE prefill throughput (+70% with Qwen3.6-35B) #23142

Merged

sycl : port multi-column MMVQ from CUDA backend (~45% speculative decoding speedup on Intel Arc) #21845

Merged

editor_config

d8651eb

arthw mentioned this pull request May 20, 2026

Misc. bug: Performance from sycl with server-intel on Qwen 3.6 dropped from 32t/s to 25t/s after update server-intel-b9159 #23160

Open

arthw added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label May 20, 2026

ggerganov merged commit 56f16f2 into ggml-org:master May 22, 2026
52 checks passed

karavayev deleted the PR branch May 22, 2026 13:21

Alex7MV pushed a commit to Alex7MV/claude_llama.cpp that referenced this pull request May 22, 2026

SYCL : gated_delta_net K>1 (ggml-org#23174)

9de3af6

* sycl_gated_delta_net K>1 * editor_config

ProTekk pushed a commit to ProTekk/buun-llama-cpp that referenced this pull request May 22, 2026

SYCL : gated_delta_net K>1 (ggml-org#23174)

1c60363

* sycl_gated_delta_net K>1 * editor_config

THEman6989 mentioned this pull request May 22, 2026

Add install() for impl libraries and fix Apple/Android builds THEman6989/llama.cpp-gfx906-turbo-mtp#1

Merged

R-SITES mentioned this pull request May 22, 2026

SYCL MTP on Intel Arc: correct output but no speed gain over baseline #23533

Open

baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026

SYCL : gated_delta_net K>1 (ggml-org#23174)

cc6d20c

* sycl_gated_delta_net K>1 * editor_config

srossitto79 pushed a commit to srossitto79/llama.cpp that referenced this pull request May 23, 2026

SYCL : gated_delta_net K>1 (ggml-org#23174)

4a5ba70

* sycl_gated_delta_net K>1 * editor_config

kashif pushed a commit to kashif/llama.cpp that referenced this pull request May 23, 2026

SYCL : gated_delta_net K>1 (ggml-org#23174)

70f0e4d

* sycl_gated_delta_net K>1 * editor_config

fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026

SYCL : gated_delta_net K>1 (ggml-org#23174)

0e46e1e

* sycl_gated_delta_net K>1 * editor_config

turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026

SYCL : gated_delta_net K>1 (ggml-org#23174)

2af22bc

* sycl_gated_delta_net K>1 * editor_config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYCL gated_delta_net K>1#23174

SYCL gated_delta_net K>1#23174
ggerganov merged 2 commits into
ggml-org:masterfrom
karavayev:PR

karavayev commented May 17, 2026 •

edited

Loading

Uh oh!

Yuimi062 commented May 17, 2026

Uh oh!

karavayev commented May 17, 2026 •

edited

Loading

Uh oh!

arthw left a comment

Uh oh!

arthw commented May 19, 2026

Uh oh!

karavayev commented May 19, 2026

Uh oh!

arthw commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

karavayev commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

Yuimi062 commented May 17, 2026

Uh oh!

karavayev commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthw left a comment

Choose a reason for hiding this comment

Uh oh!

arthw commented May 19, 2026

Uh oh!

karavayev commented May 19, 2026

Uh oh!

arthw commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

karavayev commented May 17, 2026 •

edited

Loading

karavayev commented May 17, 2026 •

edited

Loading