ggml : add GGML_OP_COL2IM_1D (CPU + CUDA)#23424
Closed
ServeurpersoCom wants to merge 2 commits into
Closed
Conversation
Add the overlap-add (scatter-add) step of a 1D transposed convolution. A ConvTranspose1d factorizes as a GEMM followed by col2im: a weight pre-permuted to [IC, K*OC] is contracted against the [IC, T_in] input with mul_mat to produce a column matrix [K*OC, T_in], and col2im_1d scatters those columns back into the [T_out, OC] signal, with T_out = (T_in - 1)*s0 + K - 2*p0. Keeping the contraction as a plain mul_mat leaves the heavy work on the optimized (and quantizable) matmul kernels, so col2im_1d only does the cheap overlap-add. CPU uses a gather formulation parallelized over output channels. CUDA mirrors it one thread per output element. Both support F32, F16 and BF16 with an F32 accumulator.
Add test_col2im_1d to test-backend-ops, registered next to the conv_transpose_1d cases since col2im is its overlap-add inverse. Each case builds a [K*OC, T_in] column matrix and runs ggml_col2im_1d, covering F32, F16 and BF16 across four geometries: the canonical kernel = 2*stride DAC decoder upsampling shape, a small overlapping case, a stride 1 case with no overlap, and a cropped case (p0 = 1). No ggml_set_param: the op has no backward, so gradient checks stay off. max_nmse_err relaxes to 5e-4 for F16 and BF16, F32 keeps the default.
Member
|
Let's split the CPU implementation in a separate PR and then work on the backend implementations in follow-ups. |
Contributor
Author
|
Yes. |
Merged
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
ggml : add GGML_OP_COL2IM_1D (CPU + CUDA)
Modern neural audio vocoders (the BigVGAN family and its descendants) build their generator from upsampling blocks: a transposed 1D convolution followed by an AMP / Snake stack. The transposed conv is the upsampler, Snake ( #22667 ) is the periodic activation, and both sit on the hot path of every generated frame.
A ConvTranspose1d factorizes exactly as a GEMM followed by an overlap-add:
Keeping the channel contraction inside ggml_mul_mat lets it ride the tuned (and quantizable) matmul kernels and tensor cores, leaving col2im_1d as a thin, memory bound overlap-add: each output reads only ceil(K/s0) columns.
The existing ggml_conv_transpose_1d takes the naive route: a single direct kernel that folds the IC contraction into the scatter and rescans the full input per output element, F32 only on CUDA. For a vocoder generator running this op many times per second over long sequences, that is the bottleneck. The GEMM + col2im split removes it and unlocks F16 / BF16 / quantized weights.
This is the upsampling primitive used across three downstream GGML projects: acestep.cpp (music generation), omnivoice.cpp (multilingual TTS) and qwentts.cpp (Qwen3-TTS 12Hz DAC decoder). It pairs with GGML_OP_SNAKE (the other half of the AMP block), and as with Snake the implementation has been exercised and validated on all backends by others before upstreaming.
Additional information
GGML_OP_COL2IM_1D for CPU and CUDA, F32 / F16 / BF16 with an F32 accumulator. Backend coverage added in test-backend-ops: CUDA matches the CPU reference across all three types and several geometries, including the canonical kernel = 2*stride upsampling shape.
Requirements