vulkan: fix SSM_CONV PP scaling with large ubatch sizes#20379
Merged
0cc4m merged 2 commits intoggml-org:masterfrom Mar 12, 2026
Merged
vulkan: fix SSM_CONV PP scaling with large ubatch sizes#203790cc4m merged 2 commits intoggml-org:masterfrom
0cc4m merged 2 commits intoggml-org:masterfrom
Conversation
Tile tokens into 2D workgroups (32x16) to reduce workgroup launch overhead at large ubatch sizes. Add vec4 fast path for nc=4 (common d_conv size). Fixes PP performance degradation with ubatch > 512. Ref: ggml-org#18725 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
All numbers are up. Especially the big model sees a huge improvement with larger u-batch sizes. There still is a noticeable drop in performance after a certain u-batch size. master (e1a3999):
PR (209464006):
Bonus pp8128 run for the 122B model, since I didn't see the drop:
|
Collaborator
|
A nice boost on 5090: |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
Author
|
@lemmi Those 122B numbers are solid — ub2048 going from 211 to 252 and holding through pp8192 is exactly what we want. The remaining drop at very large ubatch is likely CONCAT or memory bandwidth, not SSM_CONV anymore. Could confirm with a perf logger run if you're curious but that's a separate issue. @jeffbolznv Nice to see it helps on the 5090 too. +5% on already-fast hardware from a dispatch change is free money. |
0cc4m
approved these changes
Mar 12, 2026
Contributor
|
LGTM, good improvement, thank you. |
tekintian
added a commit
to tekintian/llama.cpp
that referenced
this pull request
Mar 12, 2026
* 'master' of github.com:ggml-org/llama.cpp: (33 commits) convert : better mtp check and fix return [no ci] (ggml-org#20419) vulkan: fix SSM_CONV PP scaling with large ubatch sizes (ggml-org#20379) New conversations now auto-select the first loaded model (ggml-org#20403) ggml-virtgpu: Fix some build commands (ggml-org#20341) metal : avoid divisions in bin kernel (ggml-org#20426) ci: Setup self-hosted CI for Intel Linux Vulkan backend (ggml-org#20154) vulkan: fix l2_norm epsilon handling (ggml-org#20350) vulkan: fix OOB check in flash_attn_mask_opt (ggml-org#20296) vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (ggml-org#20059) opencl: use larger workgroup size for get_rows (ggml-org#20316) opencl: add cumsum op (ggml-org#18981) hip: compile debug builds with -O2 on hip to avoid a compiler bug (ggml-org#20392) common/parser: add GigaChatV3/3.1 models support (ggml-org#19931) model : add support for Phi4ForCausalLMV (ggml-org#20168) graph : add optional scale parameter to build_lora_mm [no ci] (ggml-org#20427) common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (ggml-org#20416) ggml-webgpu: Add supports for `GGML_OP_REPEAT` (ggml-org#20230) llama : enable chunked fused GDN path (ggml-org#20340) llama : whitespace cleanup (ggml-org#20422) ggml : add NVFP4 quantization type support (ggml-org#19769) ...
am17an
pushed a commit
to am17an/llama.cpp
that referenced
this pull request
Mar 12, 2026
* vulkan: optimize SSM_CONV workgroup dispatch for large ubatch Tile tokens into 2D workgroups (32x16) to reduce workgroup launch overhead at large ubatch sizes. Add vec4 fast path for nc=4 (common d_conv size). Fixes PP performance degradation with ubatch > 512. Ref: ggml-org#18725 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: remove unused shared memory declaration in SSM_CONV Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
ProgenyAlpha
added a commit
to ProgenyAlpha/llama.cpp
that referenced
this pull request
Mar 13, 2026
The 2D tiling (32x16 workgroups) from ggml-org#20379 causes DeviceLost on multi-GPU RADV setups. Revert to 1D dispatch but keep the vec4 dot product fast path for nc=4. Fixes ggml-org#20462
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #18725
The SSM_CONV shader dispatched one token per Y workgroup, each doing only
nc(typically 4) multiply-adds. At ubatch=2048 this meant 2048 workgroups in Y with almost no work per launch — workgroup dispatch overhead dominated.Changes:
vec4dot product fast path for the commonnc=4(d_conv) case{32,1,1}to{32,16,1}45/45 SSM_CONV backend-ops tests passing.
test-backend-ops perf (ne_a=[515,3328], nc=4):
Model bench (Qwen3-Coder-Next REAM Q4_K_M, pp2048, AMD 890M):
Master shows the #18725 pattern — PP drops from 171 at ub512 to 126 at ub2048. With this fix, PP peaks at ub1024 (181) and stays strong at ub2048 (162). The degradation cliff is gone.
Tested on AMD Radeon 890M (RDNA3.5, 8 CUs, Strix Point integrated). Would appreciate testing from @lemmi on the discrete 8060S where the original issue was reported.