vulkan: optimize operations in the IM2COL shader by daniandtheweb · Pull Request #22685 · ggml-org/llama.cpp

daniandtheweb · 2026-05-04T16:22:28Z

Overview

This optimizes the IM2COL shader by extracting redundant operations from the loops, similar to how I already did it in this: #11826.

Radeon RX 7800XT

Radeon RX 5700XT

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, Gemini used for planning the possible optimizations and reviewing the final code.

jeffbolznv · 2026-05-04T17:12:25Z

Perf on RTX 5090:

before
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              98280 runs -    10.25 us/run -    10244 kB/run -  952.83 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              27060 runs -    37.40 us/run -    40964 kB/run - 1044.61 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     1352 runs -   766.24 us/run -   655364 kB/run -  817.25 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     8528 runs -   121.87 us/run -   102445 kB/run -  801.93 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     2132 runs -   475.39 us/run -   409645 kB/run -  822.78 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              47058 runs -    21.72 us/run -    23536 kB/run - 1033.39 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               8710 runs -   117.16 us/run -   100208 kB/run -  815.78 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      560 runs -  1795.14 us/run -  1678448 kB/run -  893.42 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     3289 runs -   306.13 us/run -   235365 kB/run -  733.44 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      850 runs -  1216.93 us/run -  1002085 kB/run -  786.25 GB/s
  
after
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):             114660 runs -     8.86 us/run -    10244 kB/run - 1103.00 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              28700 runs -    35.26 us/run -    40964 kB/run - 1108.08 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     1352 runs -   755.95 us/run -   655364 kB/run -  828.37 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     9512 runs -   107.82 us/run -   102445 kB/run -  906.44 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     2296 runs -   446.94 us/run -   409645 kB/run -  875.15 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              55614 runs -    18.24 us/run -    23536 kB/run - 1230.41 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               9045 runs -   114.88 us/run -   100208 kB/run -  831.96 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      620 runs -  1647.06 us/run -  1678448 kB/run -  973.74 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     3718 runs -   279.38 us/run -   235365 kB/run -  803.67 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      918 runs -  1116.62 us/run -  1002085 kB/run -  856.88 GB/s

jeffbolznv · 2026-05-04T17:19:48Z

+    const uint delta_ic  = BLOCK_SIZE / KHKW;
+    const uint delta_rem = BLOCK_SIZE % KHKW;
+    const uint delta_ky  = delta_rem / p.KW;
+    const uint delta_kx  = delta_rem % p.KW;


I'm not totally following this. In general it seems unsafe to precompute divs/mods and add them, as sometimes you would wrap to the next value and need to do a fixup. Maybe that's what the fixup logic is doing, but it's not clear.

I wonder if it might be better to pass KW as a spec constant and let the compiler transform it into something faster.

kx_wrap and ky_wrap should take care of the wrapping. Moreover it shouldn't be possible for the values to wrap around twice as:

delta_kx by using a modulo is always less than p.KW, kx is always less than p.KW because of the modulo and the maximum value that it can reach is 2 * p.KW - 2 which is always less than 2 * p.KW.

delta_ky's max value is p.KH - 1 so always less than p.KH, ky is at most p.KH - 1 and the maximum value that it can reach is 2 * p.KH - 1 so less than 2 * p.KH.

I'm not that experienced on vulkan shaders so the most I could confidently achieve was this as it should (in theory) be mathematically correct. If there are better approaches I'll gladly look into them.

Maybe it's better if I add some comments on the wrap values to make it more clear?

Comments would help, but I think spec constants would keep the code more clear. I'll defer to @0cc4m on what to do.

Got it, I'll start by adding some comments on the most confusing parts for now.

I think a spec constant version might be easier to read on the shader side, but would add complexity on the host side. Either is fine with me, but we already have this now, so I think it is okay to keep it.

@jeffbolznv Any concerns? Otherwise I'll merge it.

I'm OK with it.

0cc4m · 2026-05-17T08:21:21Z

+    const uint delta_ic  = BLOCK_SIZE / KHKW;
+    const uint delta_rem = BLOCK_SIZE % KHKW;
+    const uint delta_ky  = delta_rem / p.KW;
+    const uint delta_kx  = delta_rem % p.KW;


I think a spec constant version might be easier to read on the shader side, but would add complexity on the host side. Either is fine with me, but we already have this now, so I think it is okay to keep it.

JohnLoveJoy · 2026-05-20T13:07:46Z

It's incredible that there's still room for improvement at this level.

* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting

* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting (cherry picked from commit acd604fb277044e07c2bff01f4c169167b45f478)

* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting

* upstream/HEAD: (38 commits) vocab : add Carbon-3B (HybridDNATokenizer) support (ggml-org#23410) doc: fix spec mtp typo (ggml-org#23435) ui: Improve Git Hooks for UI development (ggml-org#23403) ggml : Check the right iface method before using the fallback 2d get (ggml-org#23306) llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa for SWA-only models (ggml-org#23131) hexagon: ssm-conv fix for large prompts (ggml-org#23307) app : show version (ggml-org#23426) mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (ggml-org#23329) ui: Add max image size option (ggml-org#22849) Move to backend sampling for MTP draft path (ggml-org#23287) opencl: refactor backend initilization (ggml-org#23318) common/speculative : fix nullptr crash in get_devices_str (ggml-org#23386) mtmd : DeepSeek-OCR image processing fixes, img_tool::resize padding refactor (ggml-org#23345) vulkan: optimize operations in the IM2COL shader (ggml-org#22685) feat: Add WAV MIME type variants and improve audio format detection (ggml-org#23396) hexagon: HMX quantized matmul rework (ggml-org#23368) Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (ggml-org#22522) app : introduce the llama unified executable (ggml-org#23296) refactor: Move text attachments up before the message content in chat completions payload (ggml-org#23406) mtmd: fit_params now take into account mmproj (ggml-org#21489) ...

* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting

* upstream/HEAD: mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (ggml-org#23329) ui: Add max image size option (ggml-org#22849) Move to backend sampling for MTP draft path (ggml-org#23287) opencl: refactor backend initilization (ggml-org#23318) common/speculative : fix nullptr crash in get_devices_str (ggml-org#23386) mtmd : DeepSeek-OCR image processing fixes, img_tool::resize padding refactor (ggml-org#23345) vulkan: optimize operations in the IM2COL shader (ggml-org#22685) feat: Add WAV MIME type variants and improve audio format detection (ggml-org#23396) hexagon: HMX quantized matmul rework (ggml-org#23368) Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (ggml-org#22522) app : introduce the llama unified executable (ggml-org#23296) refactor: Move text attachments up before the message content in chat completions payload (ggml-org#23406) mtmd: fit_params now take into account mmproj (ggml-org#21489) docker : copy conversion files (ggml-org#23370) ui: Refactor `isMobile` as reactive value in `viewport` store (ggml-org#23330) fix: Div wrapper no pointer events on hidden (ggml-org#23390)

vulkan: optimize operations in the IM2COL shader

c6860fd

daniandtheweb requested a review from a team as a code owner May 4, 2026 16:22

jeffbolznv reviewed May 4, 2026

View reviewed changes

github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 4, 2026

Add comments and improve the code formatting

52336b2

0cc4m approved these changes May 17, 2026

View reviewed changes

jeffbolznv approved these changes May 20, 2026

View reviewed changes

0cc4m merged commit acd604f into ggml-org:master May 20, 2026
43 of 44 checks passed

ProTekk pushed a commit to ProTekk/buun-llama-cpp that referenced this pull request May 20, 2026

vulkan: optimize operations in the IM2COL shader (ggml-org#22685)

f6ae681

* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting

dbrain pushed a commit to dbrain/hbd-llama-cpp-turboquant that referenced this pull request May 21, 2026

vulkan: optimize operations in the IM2COL shader (ggml-org#22685)

1d0340a

* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting

nyo16 mentioned this pull request May 21, 2026

Bump llama.cpp to 52fb93a2b (30 commits) nyo16/llama_cpp_ex#42

Merged

4 tasks

baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026

vulkan: optimize operations in the IM2COL shader (ggml-org#22685)

f992987

* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting

srossitto79 pushed a commit to srossitto79/llama.cpp that referenced this pull request May 23, 2026

vulkan: optimize operations in the IM2COL shader (ggml-org#22685)

af9b4da

* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting

fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026

vulkan: optimize operations in the IM2COL shader (ggml-org#22685)

9e1484c

* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting

turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026

vulkan: optimize operations in the IM2COL shader (ggml-org#22685)

51a3542

* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: optimize operations in the IM2COL shader#22685

vulkan: optimize operations in the IM2COL shader#22685
0cc4m merged 2 commits into
ggml-org:masterfrom
daniandtheweb:im2col

daniandtheweb commented May 4, 2026 •

edited

Loading

Uh oh!

jeffbolznv commented May 4, 2026

Uh oh!

jeffbolznv May 4, 2026

Uh oh!

daniandtheweb May 4, 2026

Uh oh!

daniandtheweb May 4, 2026

Uh oh!

daniandtheweb May 4, 2026

Uh oh!

jeffbolznv May 4, 2026

Uh oh!

daniandtheweb May 4, 2026 •

edited

Loading

Uh oh!

0cc4m May 17, 2026

Uh oh!

0cc4m May 20, 2026

Uh oh!

jeffbolznv May 20, 2026

Uh oh!

0cc4m May 17, 2026

Uh oh!

JohnLoveJoy commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

daniandtheweb commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Requirements

Uh oh!

jeffbolznv commented May 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daniandtheweb May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JohnLoveJoy commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

daniandtheweb commented May 4, 2026 •

edited

Loading

daniandtheweb May 4, 2026 •

edited

Loading