ggml : add GGML_OP_COL2IM_1D (CPU + CUDA) by ServeurpersoCom · Pull Request #23424 · ggml-org/llama.cpp

ServeurpersoCom · 2026-05-20T14:51:00Z

Overview

ggml : add GGML_OP_COL2IM_1D (CPU + CUDA)

Modern neural audio vocoders (the BigVGAN family and its descendants) build their generator from upsampling blocks: a transposed 1D convolution followed by an AMP / Snake stack. The transposed conv is the upsampler, Snake ( #22667 ) is the periodic activation, and both sit on the hot path of every generated frame.

A ConvTranspose1d factorizes exactly as a GEMM followed by an overlap-add:

    columns = mul_mat(weight[IC, K*OC], input[IC, T_in])  -> [K*OC, T_in]
    signal  = col2im_1d(columns)                          -> [T_out, OC]
    with T_out = (T_in - 1)*s0 + K - 2*p0

Keeping the channel contraction inside ggml_mul_mat lets it ride the tuned (and quantizable) matmul kernels and tensor cores, leaving col2im_1d as a thin, memory bound overlap-add: each output reads only ceil(K/s0) columns.

The existing ggml_conv_transpose_1d takes the naive route: a single direct kernel that folds the IC contraction into the scatter and rescans the full input per output element, F32 only on CUDA. For a vocoder generator running this op many times per second over long sequences, that is the bottleneck. The GEMM + col2im split removes it and unlocks F16 / BF16 / quantized weights.

This is the upsampling primitive used across three downstream GGML projects: acestep.cpp (music generation), omnivoice.cpp (multilingual TTS) and qwentts.cpp (Qwen3-TTS 12Hz DAC decoder). It pairs with GGML_OP_SNAKE (the other half of the AMP block), and as with Snake the implementation has been exercised and validated on all backends by others before upstreaming.

Additional information

GGML_OP_COL2IM_1D for CPU and CUDA, F32 / F16 / BF16 with an F32 accumulator. Backend coverage added in test-backend-ops: CUDA matches the CPU reference across all three types and several geometries, including the canonical kernel = 2*stride upsampling shape.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES Opus / MCP rootless container with Nvidia GPU

Add the overlap-add (scatter-add) step of a 1D transposed convolution. A ConvTranspose1d factorizes as a GEMM followed by col2im: a weight pre-permuted to [IC, K*OC] is contracted against the [IC, T_in] input with mul_mat to produce a column matrix [K*OC, T_in], and col2im_1d scatters those columns back into the [T_out, OC] signal, with T_out = (T_in - 1)*s0 + K - 2*p0. Keeping the contraction as a plain mul_mat leaves the heavy work on the optimized (and quantizable) matmul kernels, so col2im_1d only does the cheap overlap-add. CPU uses a gather formulation parallelized over output channels. CUDA mirrors it one thread per output element. Both support F32, F16 and BF16 with an F32 accumulator.

Add test_col2im_1d to test-backend-ops, registered next to the conv_transpose_1d cases since col2im is its overlap-add inverse. Each case builds a [K*OC, T_in] column matrix and runs ggml_col2im_1d, covering F32, F16 and BF16 across four geometries: the canonical kernel = 2*stride DAC decoder upsampling shape, a small overlapping case, a stride 1 case with no overlap, and a cropped case (p0 = 1). No ggml_set_param: the op has no backward, so gradient checks stay off. max_nmse_err relaxes to 5e-4 for F16 and BF16, F32 keeps the default.

ggerganov · 2026-06-05T05:50:58Z

Let's split the CPU implementation in a separate PR and then work on the backend implementations in follow-ups.

ServeurpersoCom · 2026-06-05T17:40:34Z

Yes.
I will split out the CPU version first, prove both mathematical equivalence and a real performance gain over the native path on actual vocoder geometries (because I think tiny shapes would hide the cost of the naive per-output contraction, the gap only shows at real channel counts and sequence lengths),
And only then follow up with CUDA, validated against the CPU reference.

ServeurpersoCom · 2026-06-12T13:31:26Z

Superseded by #24206 (CPU) #24417 (CUDA)

ServeurpersoCom added 2 commits May 20, 2026 16:24

ServeurpersoCom requested review from a team and ggerganov as code owners May 20, 2026 14:51

github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 20, 2026

ServeurpersoCom mentioned this pull request Jun 5, 2026

Ggml/cpu col2im 1d #24206

Merged

ServeurpersoCom closed this Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : add GGML_OP_COL2IM_1D (CPU + CUDA)#23424

ggml : add GGML_OP_COL2IM_1D (CPU + CUDA)#23424
ServeurpersoCom wants to merge 2 commits into
ggml-org:masterfrom
ServeurpersoCom:ggml/cpu-cuda-col2im_1d

ServeurpersoCom commented May 20, 2026

Uh oh!

ggerganov commented Jun 5, 2026

Uh oh!

ServeurpersoCom commented Jun 5, 2026

Uh oh!

ServeurpersoCom commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ServeurpersoCom commented May 20, 2026

Overview