Skip to content

cuda: fuse snake activation (mul, sin, sqr, mul, add)#22667

Merged
am17an merged 6 commits into
ggml-org:masterfrom
ServeurpersoCom:ggml/cuda-snake-fusion
May 8, 2026
Merged

cuda: fuse snake activation (mul, sin, sqr, mul, add)#22667
am17an merged 6 commits into
ggml-org:masterfrom
ServeurpersoCom:ggml/cuda-snake-fusion

Conversation

@ServeurpersoCom

Copy link
Copy Markdown
Contributor

Overview

Fuses snake activation y = x + sin(a*x)^2 * inv_b in the CUDA backend via a graph rewrite. The naive 5 op chain (mul, sin, sqr, mul, add) is matched in ggml_cuda_try_fuse and dispatched to a single elementwise kernel. No public op, no API change: frontends keep emitting the standard chain and pick up the fused path automatically.

The matcher uses ggml_can_fuse_subgraph so the rewrite only fires when the four intermediate nodes have no external consumers, and enforces the broadcast contract a / inv_b shaped as [1, C] over x [T, C].

Additional information

Used by acestep.cpp (https://github.com/ServeurpersoCom/acestep.cpp) on the SEANet decoder of the VAE, where this fusion alone delivers a 40% end-to-end speedup, even though that path is otherwise dominated by transposed 1d convolution. Matches the 5 op form found in koboldcpp's ace-step decoder (https://github.com/LostRuins/koboldcpp/blob/concedo/otherarch/acestep/vae.h); koboldcpp's qwen3-tts decoder (https://github.com/LostRuins/koboldcpp/blob/concedo/otherarch/qwen3tts/audio_tokenizer_decoder.cpp) uses a 12 op form with ggml_repeat broadcasts and would pick up this fusion by reshaping a / inv_b to [1, C].

Beyond music generation, this is the same activation introduced by Ziyin et al., NeurIPS 2020 (https://arxiv.org/abs/2006.08195) and adopted as the standard nonlinearity in the BigVGAN vocoder family (Lee et al., ICLR 2023, https://arxiv.org/abs/2206.04658), shared by Qwen3-TTS, Qwen3-Omni, OmniVoice and DAC.

CPU, Metal and Vulkan matchers follow the same pattern and land in separate PRs, alongside a GGML_OP_COL2IM_1D PR that targets the transposed 1d convolution path mentioned above.

Validation

test_snake_fuse builds the 5 op chain a frontend emits and compares the CPU naive path against the CUDA fused path via run_whole_graph(), so passing implies the rewrite preserves the math.

On RTX PRO 6000 Blackwell, SNAKE_FUSE passes 15/15 (F32 / F16 / BF16 x 5 shapes) and the full test-backend-ops suite passes 12210/12210, no regression. NMSE tolerance is 5e-3 for BF16 and 5e-5 for F16 to match the roundoff drift between the mixed-precision naive chain and the single F32 compute path of the fused kernel. F32 keeps the 1e-7 default.

Requirements

Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The
matcher recognizes the naive 5 op decomposition emitted by audio
decoders (BigVGAN, Vocos) for snake activation
y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise
kernel.

Add test_snake_fuse comparing CPU naive vs CUDA fused across
F32 / F16 / BF16.
@ServeurpersoCom ServeurpersoCom requested review from a team and ggerganov as code owners May 4, 2026 06:42
Comment thread ggml/src/ggml-cuda/snake.cu Outdated
Comment thread ggml/src/ggml-cuda/snake.cu Outdated
Comment thread ggml/src/ggml-cuda/snake.cu Outdated
@github-actions github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 4, 2026
Use ggml_cuda_cast for F32/F16/BF16 conversions and rename
kernel_snake to snake_kernel to match upstream conventions.
Comment thread ggml/src/ggml-cuda/ggml-cuda.cu
Comment thread tests/test-backend-ops.cpp Outdated
ServeurpersoCom and others added 2 commits May 4, 2026 16:12
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Moved for readability (equivalent)
Address review feedback from @am17an
@am17an am17an merged commit 58e68df into ggml-org:master May 8, 2026
77 of 79 checks passed
cetarthoriphros pushed a commit to cetarthoriphros/llama.cpp that referenced this pull request May 9, 2026
* cuda: fuse snake activation (mul, sin, sqr, mul, add)

Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The
matcher recognizes the naive 5 op decomposition emitted by audio
decoders (BigVGAN, Vocos) for snake activation
y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise
kernel.

Add test_snake_fuse comparing CPU naive vs CUDA fused across
F32 / F16 / BF16.

* cuda: address review feedback from @am17an

Use ggml_cuda_cast for F32/F16/BF16 conversions and rename
kernel_snake to snake_kernel to match upstream conventions.

* cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an

* Update tests/test-backend-ops.cpp

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* cuda: snake fusion check add->type matches x->type

Address review feedback from @am17an

* cuda: snake fusion check add->type matches x->type

Moved for readability (equivalent)
Address review feedback from @am17an

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
* cuda: fuse snake activation (mul, sin, sqr, mul, add)

Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The
matcher recognizes the naive 5 op decomposition emitted by audio
decoders (BigVGAN, Vocos) for snake activation
y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise
kernel.

Add test_snake_fuse comparing CPU naive vs CUDA fused across
F32 / F16 / BF16.

* cuda: address review feedback from @am17an

Use ggml_cuda_cast for F32/F16/BF16 conversions and rename
kernel_snake to snake_kernel to match upstream conventions.

* cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an

* Update tests/test-backend-ops.cpp

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* cuda: snake fusion check add->type matches x->type

Address review feedback from @am17an

* cuda: snake fusion check add->type matches x->type

Moved for readability (equivalent)
Address review feedback from @am17an

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 19, 2026
* cuda: fuse snake activation (mul, sin, sqr, mul, add)

Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The
matcher recognizes the naive 5 op decomposition emitted by audio
decoders (BigVGAN, Vocos) for snake activation
y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise
kernel.

Add test_snake_fuse comparing CPU naive vs CUDA fused across
F32 / F16 / BF16.

* cuda: address review feedback from @am17an

Use ggml_cuda_cast for F32/F16/BF16 conversions and rename
kernel_snake to snake_kernel to match upstream conventions.

* cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an

* Update tests/test-backend-ops.cpp

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* cuda: snake fusion check add->type matches x->type

Address review feedback from @am17an

* cuda: snake fusion check add->type matches x->type

Moved for readability (equivalent)
Address review feedback from @am17an

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
* cuda: fuse snake activation (mul, sin, sqr, mul, add)

Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The
matcher recognizes the naive 5 op decomposition emitted by audio
decoders (BigVGAN, Vocos) for snake activation
y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise
kernel.

Add test_snake_fuse comparing CPU naive vs CUDA fused across
F32 / F16 / BF16.

* cuda: address review feedback from @am17an

Use ggml_cuda_cast for F32/F16/BF16 conversions and rename
kernel_snake to snake_kernel to match upstream conventions.

* cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an

* Update tests/test-backend-ops.cpp

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* cuda: snake fusion check add->type matches x->type

Address review feedback from @am17an

* cuda: snake fusion check add->type matches x->type

Moved for readability (equivalent)
Address review feedback from @am17an

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
winstonma pushed a commit to winstonma/llama.cpp that referenced this pull request May 27, 2026
* cuda: fuse snake activation (mul, sin, sqr, mul, add)

Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The
matcher recognizes the naive 5 op decomposition emitted by audio
decoders (BigVGAN, Vocos) for snake activation
y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise
kernel.

Add test_snake_fuse comparing CPU naive vs CUDA fused across
F32 / F16 / BF16.

* cuda: address review feedback from @am17an

Use ggml_cuda_cast for F32/F16/BF16 conversions and rename
kernel_snake to snake_kernel to match upstream conventions.

* cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an

* Update tests/test-backend-ops.cpp

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* cuda: snake fusion check add->type matches x->type

Address review feedback from @am17an

* cuda: snake fusion check add->type matches x->type

Moved for readability (equivalent)
Address review feedback from @am17an

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* cuda: fuse snake activation (mul, sin, sqr, mul, add)

Add ggml_cuda_op_snake_fused with F32 / F16 / BF16 templates. The
matcher recognizes the naive 5 op decomposition emitted by audio
decoders (BigVGAN, Vocos) for snake activation
y = x + sin(a*x)^2 * inv_b and rewrites it to a single elementwise
kernel.

Add test_snake_fuse comparing CPU naive vs CUDA fused across
F32 / F16 / BF16.

* cuda: address review feedback from @am17an

Use ggml_cuda_cast for F32/F16/BF16 conversions and rename
kernel_snake to snake_kernel to match upstream conventions.

* cuda: snake fusion fastdiv on T_len, Suggested-by: @am17an

* Update tests/test-backend-ops.cpp

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* cuda: snake fusion check add->type matches x->type

Address review feedback from @am17an

* cuda: snake fusion check add->type matches x->type

Moved for readability (equivalent)
Address review feedback from @am17an

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
This was referenced Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants