cuda: fuse snake activation (mul, sin, sqr, mul, add) by ServeurpersoCom · Pull Request #22667 · ggml-org/llama.cpp

ServeurpersoCom · 2026-05-04T06:42:45Z

Overview

Fuses snake activation y = x + sin(a*x)^2 * inv_b in the CUDA backend via a graph rewrite. The naive 5 op chain (mul, sin, sqr, mul, add) is matched in ggml_cuda_try_fuse and dispatched to a single elementwise kernel. No public op, no API change: frontends keep emitting the standard chain and pick up the fused path automatically.

The matcher uses ggml_can_fuse_subgraph so the rewrite only fires when the four intermediate nodes have no external consumers, and enforces the broadcast contract a / inv_b shaped as [1, C] over x [T, C].

Additional information

Used by acestep.cpp (https://github.com/ServeurpersoCom/acestep.cpp) on the SEANet decoder of the VAE, where this fusion alone delivers a 40% end-to-end speedup, even though that path is otherwise dominated by transposed 1d convolution. Matches the 5 op form found in koboldcpp's ace-step decoder (https://github.com/LostRuins/koboldcpp/blob/concedo/otherarch/acestep/vae.h); koboldcpp's qwen3-tts decoder (https://github.com/LostRuins/koboldcpp/blob/concedo/otherarch/qwen3tts/audio_tokenizer_decoder.cpp) uses a 12 op form with ggml_repeat broadcasts and would pick up this fusion by reshaping a / inv_b to [1, C].

Beyond music generation, this is the same activation introduced by Ziyin et al., NeurIPS 2020 (https://arxiv.org/abs/2006.08195) and adopted as the standard nonlinearity in the BigVGAN vocoder family (Lee et al., ICLR 2023, https://arxiv.org/abs/2206.04658), shared by Qwen3-TTS, Qwen3-Omni, OmniVoice and DAC.

CPU, Metal and Vulkan matchers follow the same pattern and land in separate PRs, alongside a GGML_OP_COL2IM_1D PR that targets the transposed 1d convolution path mentioned above.

Validation

test_snake_fuse builds the 5 op chain a frontend emits and compares the CPU naive path against the CUDA fused path via run_whole_graph(), so passing implies the rewrite preserves the math.

On RTX PRO 6000 Blackwell, SNAKE_FUSE passes 15/15 (F32 / F16 / BF16 x 5 shapes) and the full test-backend-ops suite passes 12210/12210, no regression. NMSE tolerance is 5e-3 for BF16 and 5e-5 for F16 to match the roundoff drift between the mixed-precision naive chain and the single F32 compute path of the fused kernel. F32 keeps the 1e-7 default.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES Opus 4.7 + rootless pod

Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The matcher recognizes the naive 5 op decomposition emitted by audio decoders (BigVGAN, Vocos) for snake activation y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise kernel. Add test_snake_fuse comparing CPU naive vs CUDA fused across F32 / F16 / BF16.

Use ggml_cuda_cast for F32/F16/BF16 conversions and rename kernel_snake to snake_kernel to match upstream conventions.

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

@am17an

Address review feedback from @am17an

@am17an

Moved for readability (equivalent) Address review feedback from @am17an

@am17an

* cuda: fuse snake activation (mul, sin, sqr, mul, add) Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The matcher recognizes the naive 5 op decomposition emitted by audio decoders (BigVGAN, Vocos) for snake activation y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise kernel. Add test_snake_fuse comparing CPU naive vs CUDA fused across F32 / F16 / BF16. * cuda: address review feedback from @am17an Use ggml_cuda_cast for F32/F16/BF16 conversions and rename kernel_snake to snake_kernel to match upstream conventions. * cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an * Update tests/test-backend-ops.cpp Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cuda: snake fusion check add->type matches x->type Address review feedback from @am17an * cuda: snake fusion check add->type matches x->type Moved for readability (equivalent) Address review feedback from @am17an --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>

@am17an

* cuda: fuse snake activation (mul, sin, sqr, mul, add) Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The matcher recognizes the naive 5 op decomposition emitted by audio decoders (BigVGAN, Vocos) for snake activation y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise kernel. Add test_snake_fuse comparing CPU naive vs CUDA fused across F32 / F16 / BF16. * cuda: address review feedback from @am17an Use ggml_cuda_cast for F32/F16/BF16 conversions and rename kernel_snake to snake_kernel to match upstream conventions. * cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an * Update tests/test-backend-ops.cpp Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cuda: snake fusion check add->type matches x->type Address review feedback from @am17an * cuda: snake fusion check add->type matches x->type Moved for readability (equivalent) Address review feedback from @am17an --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>

@am17an

* cuda: fuse snake activation (mul, sin, sqr, mul, add) Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The matcher recognizes the naive 5 op decomposition emitted by audio decoders (BigVGAN, Vocos) for snake activation y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise kernel. Add test_snake_fuse comparing CPU naive vs CUDA fused across F32 / F16 / BF16. * cuda: address review feedback from @am17an Use ggml_cuda_cast for F32/F16/BF16 conversions and rename kernel_snake to snake_kernel to match upstream conventions. * cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an * Update tests/test-backend-ops.cpp Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cuda: snake fusion check add->type matches x->type Address review feedback from @am17an * cuda: snake fusion check add->type matches x->type Moved for readability (equivalent) Address review feedback from @am17an --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>

@am17an

* cuda: fuse snake activation (mul, sin, sqr, mul, add) Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The matcher recognizes the naive 5 op decomposition emitted by audio decoders (BigVGAN, Vocos) for snake activation y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise kernel. Add test_snake_fuse comparing CPU naive vs CUDA fused across F32 / F16 / BF16. * cuda: address review feedback from @am17an Use ggml_cuda_cast for F32/F16/BF16 conversions and rename kernel_snake to snake_kernel to match upstream conventions. * cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an * Update tests/test-backend-ops.cpp Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cuda: snake fusion check add->type matches x->type Address review feedback from @am17an * cuda: snake fusion check add->type matches x->type Moved for readability (equivalent) Address review feedback from @am17an --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>

@am17an

* cuda: fuse snake activation (mul, sin, sqr, mul, add) Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The matcher recognizes the naive 5 op decomposition emitted by audio decoders (BigVGAN, Vocos) for snake activation y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise kernel. Add test_snake_fuse comparing CPU naive vs CUDA fused across F32 / F16 / BF16. * cuda: address review feedback from @am17an Use ggml_cuda_cast for F32/F16/BF16 conversions and rename kernel_snake to snake_kernel to match upstream conventions. * cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an * Update tests/test-backend-ops.cpp Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cuda: snake fusion check add->type matches x->type Address review feedback from @am17an * cuda: snake fusion check add->type matches x->type Moved for readability (equivalent) Address review feedback from @am17an --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>

@am17an

* cuda: fuse snake activation (mul, sin, sqr, mul, add) Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The matcher recognizes the naive 5 op decomposition emitted by audio decoders (BigVGAN, Vocos) for snake activation y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise kernel. Add test_snake_fuse comparing CPU naive vs CUDA fused across F32 / F16 / BF16. * cuda: address review feedback from @am17an Use ggml_cuda_cast for F32/F16/BF16 conversions and rename kernel_snake to snake_kernel to match upstream conventions. * cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an * Update tests/test-backend-ops.cpp Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cuda: snake fusion check add->type matches x->type Address review feedback from @am17an * cuda: snake fusion check add->type matches x->type Moved for readability (equivalent) Address review feedback from @am17an --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>

ServeurpersoCom requested review from a team and ggerganov as code owners May 4, 2026 06:42

am17an reviewed May 4, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/snake.cu Outdated

Comment thread ggml/src/ggml-cuda/snake.cu Outdated

Comment thread ggml/src/ggml-cuda/snake.cu Outdated

github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 4, 2026

cuda: address review feedback from @am17an

732c7a3

Use ggml_cuda_cast for F32/F16/BF16 conversions and rename kernel_snake to snake_kernel to match upstream conventions.

ServeurpersoCom mentioned this pull request May 4, 2026

ggml: add GGML_OP_SNAKE for fused Snake activation #22613

Closed

cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an

4f6c55a

am17an approved these changes May 4, 2026

View reviewed changes

Comment thread ggml/src/ggml-cuda/ggml-cuda.cu

Comment thread tests/test-backend-ops.cpp Outdated

ServeurpersoCom and others added 2 commits May 4, 2026 16:12

Update tests/test-backend-ops.cpp

34c2d95

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

cuda: snake fusion check add->type matches x->type

9c0d1ba

Address review feedback from @am17an

JohannesGaessler approved these changes May 4, 2026

View reviewed changes

cuda: snake fusion check add->type matches x->type

a79debb

Moved for readability (equivalent) Address review feedback from @am17an

am17an approved these changes May 8, 2026

View reviewed changes

am17an merged commit 58e68df into ggml-org:master May 8, 2026
77 of 79 checks passed

ServeurpersoCom mentioned this pull request May 8, 2026

vulkan: fuse snake activation (mul, sin, sqr, mul, add) #22855

Merged

ServeurpersoCom mentioned this pull request May 18, 2026

(Planning) Support audio output in mtmd #21956

Open

ServeurpersoCom mentioned this pull request May 20, 2026

ggml : add GGML_OP_COL2IM_1D (CPU + CUDA) #23424

Closed

This was referenced Jun 5, 2026

Ggml/cpu col2im 1d #24206

Merged

Ggml/cuda col2im 1d #24417

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda: fuse snake activation (mul, sin, sqr, mul, add)#22667

cuda: fuse snake activation (mul, sin, sqr, mul, add)#22667
am17an merged 6 commits into
ggml-org:masterfrom
ServeurpersoCom:ggml/cuda-snake-fusion

ServeurpersoCom commented May 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ServeurpersoCom commented May 4, 2026

Overview

Additional information

Validation

Requirements

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants