Ggml/cuda col2im 1d#24417
Open
ServeurpersoCom wants to merge 2 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
cuda: add GGML_OP_COL2IM_1D
CUDA backend follow-up to the CPU op ( #24206 ), same formulation: a gather kernel, one thread per output, each reading only the ceil(K/s0) columns that scatter into it. F32 / F16 / BF16 with an F32 accumulator.
The flat idx -> (channel, time) decomposition uses fast_div_modulo, which buys back time on the cache resident F32 / F16 shapes where the kernel is ALU exposed; on the DRAM bound long shape it is a no op, as expected.
Additional information
Validated against the test-backend-ops grid merged with the CPU op, zero additional test code: 33/33 on CUDA0 across the eight geometries and three types, plus the three perf entries. CMake globs the new .cu, so the only wiring is the dispatch case and the supports_op entry next to conv_transpose_1d.
Optimization (2nd commit):
Same fastdiv pattern as the Snake CUDA fusion ( #22667 ), measured around 10% on F32 and F16 on the cache resident vocoder stage shapes vs a plain div + mod.
Requirements