ggml-webgpu: Add supports for GGML_OP_REPEAT#20230
ggml-webgpu: Add supports for GGML_OP_REPEAT#20230reeselevine merged 3 commits intoggml-org:masterfrom
GGML_OP_REPEAT#20230Conversation
|
Potentially could support |
|
And sounds good on merging after #20173, I should merge that soon. |
There are a couple, but most recently (and notably) Qwen 3.5. |
I confirmed that DeepSeek-V2 also uses REPEAT, and DeepSeek-V3 may as well.
I see. As you mentioned, treating the repeated bytes as opaque would work. However, using llama.cpp/ggml/src/ggml-cpu/ops.cpp Lines 1787 to 1792 in c96f608 |
|
Now that #20173 is merged just need to fix some conflicts then we should be good to merge this! |
|
Looks good to me! Do you know if there are other operations/changes necessary to get Qwen 3.5 running? I tried this model off this branch and I'm getting a segfault in the WebGPU backend. I can try and debug unless you're already looking into/working on it. |
|
According to my experiments, the following operations appear to be not yet implemented for Qwen 3.5 in WebGPU backend.
In my program (based on
On the other hand, unsloth/Qwen3.5-4B-Q4_0.gguf and unsloth/Qwen3.5-9B-Q4_0.gguf seem to work well with my program.
I haven't been working on Qwen 3.5 support myself, so it would be great if you could work on this! |
* Add GGML_OP_REPEAT to webgpu backend. * Add i16 support for GGML_OP_REPEAT.
* 'master' of github.com:ggml-org/llama.cpp: (33 commits) convert : better mtp check and fix return [no ci] (ggml-org#20419) vulkan: fix SSM_CONV PP scaling with large ubatch sizes (ggml-org#20379) New conversations now auto-select the first loaded model (ggml-org#20403) ggml-virtgpu: Fix some build commands (ggml-org#20341) metal : avoid divisions in bin kernel (ggml-org#20426) ci: Setup self-hosted CI for Intel Linux Vulkan backend (ggml-org#20154) vulkan: fix l2_norm epsilon handling (ggml-org#20350) vulkan: fix OOB check in flash_attn_mask_opt (ggml-org#20296) vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (ggml-org#20059) opencl: use larger workgroup size for get_rows (ggml-org#20316) opencl: add cumsum op (ggml-org#18981) hip: compile debug builds with -O2 on hip to avoid a compiler bug (ggml-org#20392) common/parser: add GigaChatV3/3.1 models support (ggml-org#19931) model : add support for Phi4ForCausalLMV (ggml-org#20168) graph : add optional scale parameter to build_lora_mm [no ci] (ggml-org#20427) common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (ggml-org#20416) ggml-webgpu: Add supports for `GGML_OP_REPEAT` (ggml-org#20230) llama : enable chunked fused GDN path (ggml-org#20340) llama : whitespace cleanup (ggml-org#20422) ggml : add NVFP4 quantization type support (ggml-org#19769) ...

This PR adds supports for
GGML_OP_REPEATto the WebGPU backend. The status ofREPEATfor WebGPU indocs/ops.mdis changed to "partially supported" because WebGPU doesn't seem to supporti16.Also, this PR includes formatting changes (clang-format) for the modified files. Since #20173 touches the same parts, this PR might need to be merged after that one.