vulkan: optimize operations in the IM2COL shader#22685
Conversation
|
Perf on RTX 5090: |
| const uint delta_ic = BLOCK_SIZE / KHKW; | ||
| const uint delta_rem = BLOCK_SIZE % KHKW; | ||
| const uint delta_ky = delta_rem / p.KW; | ||
| const uint delta_kx = delta_rem % p.KW; |
There was a problem hiding this comment.
I'm not totally following this. In general it seems unsafe to precompute divs/mods and add them, as sometimes you would wrap to the next value and need to do a fixup. Maybe that's what the fixup logic is doing, but it's not clear.
I wonder if it might be better to pass KW as a spec constant and let the compiler transform it into something faster.
There was a problem hiding this comment.
kx_wrap and ky_wrap should take care of the wrapping. Moreover it shouldn't be possible for the values to wrap around twice as:
delta_kxby using a modulo is always less thanp.KW,kxis always less thanp.KWbecause of the modulo and the maximum value that it can reach is2 * p.KW - 2which is always less than2 * p.KW.delta_ky's max value isp.KH - 1so always less thanp.KH,kyis at mostp.KH - 1and the maximum value that it can reach is2 * p.KH - 1so less than2 * p.KH.
There was a problem hiding this comment.
I'm not that experienced on vulkan shaders so the most I could confidently achieve was this as it should (in theory) be mathematically correct. If there are better approaches I'll gladly look into them.
There was a problem hiding this comment.
Maybe it's better if I add some comments on the wrap values to make it more clear?
There was a problem hiding this comment.
Comments would help, but I think spec constants would keep the code more clear. I'll defer to @0cc4m on what to do.
There was a problem hiding this comment.
Got it, I'll start by adding some comments on the most confusing parts for now.
There was a problem hiding this comment.
I think a spec constant version might be easier to read on the shader side, but would add complexity on the host side. Either is fine with me, but we already have this now, so I think it is okay to keep it.
There was a problem hiding this comment.
@jeffbolznv Any concerns? Otherwise I'll merge it.
| const uint delta_ic = BLOCK_SIZE / KHKW; | ||
| const uint delta_rem = BLOCK_SIZE % KHKW; | ||
| const uint delta_ky = delta_rem / p.KW; | ||
| const uint delta_kx = delta_rem % p.KW; |
There was a problem hiding this comment.
I think a spec constant version might be easier to read on the shader side, but would add complexity on the host side. Either is fine with me, but we already have this now, so I think it is okay to keep it.
|
It's incredible that there's still room for improvement at this level. |
* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting
* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting
* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting (cherry picked from commit acd604fb277044e07c2bff01f4c169167b45f478)
* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting
* upstream/HEAD: (38 commits) vocab : add Carbon-3B (HybridDNATokenizer) support (ggml-org#23410) doc: fix spec mtp typo (ggml-org#23435) ui: Improve Git Hooks for UI development (ggml-org#23403) ggml : Check the right iface method before using the fallback 2d get (ggml-org#23306) llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa for SWA-only models (ggml-org#23131) hexagon: ssm-conv fix for large prompts (ggml-org#23307) app : show version (ggml-org#23426) mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (ggml-org#23329) ui: Add max image size option (ggml-org#22849) Move to backend sampling for MTP draft path (ggml-org#23287) opencl: refactor backend initilization (ggml-org#23318) common/speculative : fix nullptr crash in get_devices_str (ggml-org#23386) mtmd : DeepSeek-OCR image processing fixes, img_tool::resize padding refactor (ggml-org#23345) vulkan: optimize operations in the IM2COL shader (ggml-org#22685) feat: Add WAV MIME type variants and improve audio format detection (ggml-org#23396) hexagon: HMX quantized matmul rework (ggml-org#23368) Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (ggml-org#22522) app : introduce the llama unified executable (ggml-org#23296) refactor: Move text attachments up before the message content in chat completions payload (ggml-org#23406) mtmd: fit_params now take into account mmproj (ggml-org#21489) ...
* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting
* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting
* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting
* upstream/HEAD: mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (ggml-org#23329) ui: Add max image size option (ggml-org#22849) Move to backend sampling for MTP draft path (ggml-org#23287) opencl: refactor backend initilization (ggml-org#23318) common/speculative : fix nullptr crash in get_devices_str (ggml-org#23386) mtmd : DeepSeek-OCR image processing fixes, img_tool::resize padding refactor (ggml-org#23345) vulkan: optimize operations in the IM2COL shader (ggml-org#22685) feat: Add WAV MIME type variants and improve audio format detection (ggml-org#23396) hexagon: HMX quantized matmul rework (ggml-org#23368) Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (ggml-org#22522) app : introduce the llama unified executable (ggml-org#23296) refactor: Move text attachments up before the message content in chat completions payload (ggml-org#23406) mtmd: fit_params now take into account mmproj (ggml-org#21489) docker : copy conversion files (ggml-org#23370) ui: Refactor `isMobile` as reactive value in `viewport` store (ggml-org#23330) fix: Div wrapper no pointer events on hidden (ggml-org#23390)
Overview
This optimizes the IM2COL shader by extracting redundant operations from the loops, similar to how I already did it in this: #11826.
Radeon RX 7800XTRadeon RX 5700XTRequirements