Skip to content

ggml-webgpu: Add supports for GGML_OP_REPEAT#20230

Merged
reeselevine merged 3 commits intoggml-org:masterfrom
yomaytk:repeat-webgpu
Mar 11, 2026
Merged

ggml-webgpu: Add supports for GGML_OP_REPEAT#20230
reeselevine merged 3 commits intoggml-org:masterfrom
yomaytk:repeat-webgpu

Conversation

@yomaytk
Copy link
Contributor

@yomaytk yomaytk commented Mar 8, 2026

This PR adds supports for GGML_OP_REPEAT to the WebGPU backend. The status of REPEAT for WebGPU in docs/ops.md is changed to "partially supported" because WebGPU doesn't seem to support i16.

Also, this PR includes formatting changes (clang-format) for the modified files. Since #20173 touches the same parts, this PR might need to be merged after that one.

@yomaytk yomaytk requested a review from reeselevine as a code owner March 8, 2026 09:51
@github-actions github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning labels Mar 8, 2026
@reeselevine
Copy link
Contributor

Potentially could support i16 by just treating the bytes being repeated as opaque, since they're just copied/repeated over to the dst buffer, but I don't think that's required. Curious, are there any particular models/usecases that need repeat? I haven't seen it come up in what I'm working on yet.

@reeselevine
Copy link
Contributor

And sounds good on merging after #20173, I should merge that soon.

@CISC
Copy link
Member

CISC commented Mar 9, 2026

Curious, are there any particular models/usecases that need repeat? I haven't seen it come up in what I'm working on yet.

There are a couple, but most recently (and notably) Qwen 3.5.

@yomaytk
Copy link
Contributor Author

yomaytk commented Mar 10, 2026

Curious, are there any particular models/usecases that need repeat? I haven't seen it come up in what I'm working on yet.

I confirmed that DeepSeek-V2 also uses REPEAT, and DeepSeek-V3 may as well.

Potentially could support i16 by just treating the bytes being repeated as opaque, since they're just copied/repeated over to the dst buffer

I see. As you mentioned, treating the repeated bytes as opaque would work. However, using f16 is simpler here, and the CPU backend does the same. So I added i16 support in the same way, and the i16 test pass as well.

case GGML_TYPE_F16:
case GGML_TYPE_BF16:
case GGML_TYPE_I16:
{
ggml_compute_forward_repeat_f16(params, dst);
} break;

@reeselevine
Copy link
Contributor

Now that #20173 is merged just need to fix some conflicts then we should be good to merge this!

@reeselevine
Copy link
Contributor

Looks good to me! Do you know if there are other operations/changes necessary to get Qwen 3.5 running? I tried this model off this branch and I'm getting a segfault in the WebGPU backend. I can try and debug unless you're already looking into/working on it.

@yomaytk
Copy link
Contributor Author

yomaytk commented Mar 11, 2026

According to my experiments, the following operations appear to be not yet implemented for Qwen 3.5 in WebGPU backend.

L2_NORM, SET, DIAG, TRI, SSM_CONV, FLASH_ATTN_EXT, SOLVE_TRI, GATED_DELTA_NET

I tried this model off this branch and I'm getting a segfault in the WebGPU backend.

In my program (based on examples/simple-chat) too, the model fails with the following errors.

スクリーンショット 2026-03-11 19 01 54

On the other hand, unsloth/Qwen3.5-4B-Q4_0.gguf and unsloth/Qwen3.5-9B-Q4_0.gguf seem to work well with my program.

I can try and debug unless you're already looking into/working on it.

I haven't been working on Qwen 3.5 support myself, so it would be great if you could work on this!

@reeselevine reeselevine merged commit f2ab047 into ggml-org:master Mar 11, 2026
73 of 79 checks passed
@yomaytk yomaytk deleted the repeat-webgpu branch March 12, 2026 01:45
ProgenyAlpha pushed a commit to ProgenyAlpha/llama.cpp that referenced this pull request Mar 12, 2026
* Add GGML_OP_REPEAT to webgpu backend.

* Add i16 support for GGML_OP_REPEAT.
tekintian added a commit to tekintian/llama.cpp that referenced this pull request Mar 12, 2026
* 'master' of github.com:ggml-org/llama.cpp: (33 commits)
  convert : better mtp check and fix return [no ci] (ggml-org#20419)
  vulkan: fix SSM_CONV PP scaling with large ubatch sizes (ggml-org#20379)
  New conversations now auto-select the first loaded model (ggml-org#20403)
  ggml-virtgpu: Fix some build commands (ggml-org#20341)
  metal : avoid divisions in bin kernel (ggml-org#20426)
  ci: Setup self-hosted CI for Intel Linux Vulkan backend (ggml-org#20154)
  vulkan: fix l2_norm epsilon handling (ggml-org#20350)
  vulkan: fix OOB check in flash_attn_mask_opt (ggml-org#20296)
  vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (ggml-org#20059)
  opencl: use larger workgroup size for get_rows (ggml-org#20316)
  opencl: add cumsum op (ggml-org#18981)
  hip: compile debug builds with -O2 on hip to avoid a compiler bug (ggml-org#20392)
  common/parser: add GigaChatV3/3.1 models support (ggml-org#19931)
  model : add support for Phi4ForCausalLMV (ggml-org#20168)
  graph : add optional scale parameter to build_lora_mm [no ci] (ggml-org#20427)
  common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (ggml-org#20416)
  ggml-webgpu: Add supports for `GGML_OP_REPEAT` (ggml-org#20230)
  llama : enable chunked fused GDN path (ggml-org#20340)
  llama : whitespace cleanup (ggml-org#20422)
  ggml : add NVFP4 quantization type support (ggml-org#19769)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants