Skip to content

ggml : add GGML_OP_COL2IM_1D (CPU + CUDA)#23424

Closed
ServeurpersoCom wants to merge 2 commits into
ggml-org:masterfrom
ServeurpersoCom:ggml/cpu-cuda-col2im_1d
Closed

ggml : add GGML_OP_COL2IM_1D (CPU + CUDA)#23424
ServeurpersoCom wants to merge 2 commits into
ggml-org:masterfrom
ServeurpersoCom:ggml/cpu-cuda-col2im_1d

Conversation

@ServeurpersoCom

Copy link
Copy Markdown
Contributor

Overview

ggml : add GGML_OP_COL2IM_1D (CPU + CUDA)

Modern neural audio vocoders (the BigVGAN family and its descendants) build their generator from upsampling blocks: a transposed 1D convolution followed by an AMP / Snake stack. The transposed conv is the upsampler, Snake ( #22667 ) is the periodic activation, and both sit on the hot path of every generated frame.

A ConvTranspose1d factorizes exactly as a GEMM followed by an overlap-add:

    columns = mul_mat(weight[IC, K*OC], input[IC, T_in])  -> [K*OC, T_in]
    signal  = col2im_1d(columns)                          -> [T_out, OC]
    with T_out = (T_in - 1)*s0 + K - 2*p0

Keeping the channel contraction inside ggml_mul_mat lets it ride the tuned (and quantizable) matmul kernels and tensor cores, leaving col2im_1d as a thin, memory bound overlap-add: each output reads only ceil(K/s0) columns.

The existing ggml_conv_transpose_1d takes the naive route: a single direct kernel that folds the IC contraction into the scatter and rescans the full input per output element, F32 only on CUDA. For a vocoder generator running this op many times per second over long sequences, that is the bottleneck. The GEMM + col2im split removes it and unlocks F16 / BF16 / quantized weights.

This is the upsampling primitive used across three downstream GGML projects: acestep.cpp (music generation), omnivoice.cpp (multilingual TTS) and qwentts.cpp (Qwen3-TTS 12Hz DAC decoder). It pairs with GGML_OP_SNAKE (the other half of the AMP block), and as with Snake the implementation has been exercised and validated on all backends by others before upstreaming.

Additional information

GGML_OP_COL2IM_1D for CPU and CUDA, F32 / F16 / BF16 with an F32 accumulator. Backend coverage added in test-backend-ops: CUDA matches the CPU reference across all three types and several geometries, including the canonical kernel = 2*stride upsampling shape.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES Opus / MCP rootless container with Nvidia GPU

Add the overlap-add (scatter-add) step of a 1D transposed convolution.
A ConvTranspose1d factorizes as a GEMM followed by col2im: a weight
pre-permuted to [IC, K*OC] is contracted against the [IC, T_in] input
with mul_mat to produce a column matrix [K*OC, T_in], and col2im_1d
scatters those columns back into the [T_out, OC] signal, with
T_out = (T_in - 1)*s0 + K - 2*p0.

Keeping the contraction as a plain mul_mat leaves the heavy work on the
optimized (and quantizable) matmul kernels, so col2im_1d only does the
cheap overlap-add.

CPU uses a gather formulation parallelized over output channels.
CUDA mirrors it one thread per output element. Both support F32, F16
and BF16 with an F32 accumulator.
Add test_col2im_1d to test-backend-ops, registered next to the
conv_transpose_1d cases since col2im is its overlap-add inverse.

Each case builds a [K*OC, T_in] column matrix and runs ggml_col2im_1d,
covering F32, F16 and BF16 across four geometries: the canonical
kernel = 2*stride DAC decoder upsampling shape, a small overlapping
case, a stride 1 case with no overlap, and a cropped case (p0 = 1).

No ggml_set_param: the op has no backward, so gradient checks stay off.
max_nmse_err relaxes to 5e-4 for F16 and BF16, F32 keeps the default.
@ServeurpersoCom ServeurpersoCom requested review from a team and ggerganov as code owners May 20, 2026 14:51
@github-actions github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 20, 2026
@ggerganov

Copy link
Copy Markdown
Member

Let's split the CPU implementation in a separate PR and then work on the backend implementations in follow-ups.

@ServeurpersoCom

Copy link
Copy Markdown
Contributor Author

Yes.
I will split out the CPU version first, prove both mathematical equivalence and a real performance gain over the native path on actual vocoder geometries (because I think tiny shapes would hide the cost of the naive per-output contraction, the gap only shows at real channel counts and sequence lengths),
And only then follow up with CUDA, validated against the CPU reference.

@ServeurpersoCom

ServeurpersoCom commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Superseded by #24206 (CPU) #24417 (CUDA)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants