CUDA: General GEMV fusion by am17an · Pull Request #16715 · ggml-org/llama.cpp

am17an · 2025-10-22T07:59:08Z

This is a follow up to #16630. This PR adds ability to fuse the following common GEMV operations:

GLU
Bias + GLU
Bias

It uses a template bool to determine if we are in the fusion path, then does runtime checks for which fusion path to take. This PR also splits up mmvq (by type) and mmvf (by ncols-dst) as their compile times were becoming large after this change. This change helps TG (which is IO bound) to almost all class of models. Apart from adding tests to test-backend-ops I also spot-checked perplexity on a couple of models and it is unchanged by this change.

Tested on 6x 4090

Model	Test	t/s master	t/s cuda_fuse_gate	Speedup
gpt-oss 120B MXFP4 MoE	tg32	118.72	125.14	1.05
gpt-oss 120B MXFP4 MoE	tg64	116.91	123.09	1.05
gpt-oss 120B MXFP4 MoE	tg128	115.72	121.74	1.05
gpt-oss 20B MXFP4 MoE	tg32	171.60	180.07	1.05
gpt-oss 20B MXFP4 MoE	tg64	169.46	177.63	1.05
gpt-oss 20B MXFP4 MoE	tg128	167.58	175.59	1.05
qwen3moe 30B.A3B Q4_0	tg32	154.72	162.06	1.05
qwen3moe 30B.A3B Q4_0	tg64	151.37	158.40	1.05
qwen3moe 30B.A3B Q4_0	tg128	149.25	156.00	1.05
qwen3 0.6B F16	tg32	310.61	333.92	1.08
qwen3 0.6B F16	tg64	306.26	325.99	1.06
qwen3 0.6B F16	tg128	303.14	322.62	1.06
glm4moe 106B.A12B IQ4_XS - 4.25 bpw	tg32	68.99	72.30	1.05
glm4moe 106B.A12B IQ4_XS - 4.25 bpw	tg64	68.24	71.44	1.05
glm4moe 106B.A12B IQ4_XS - 4.25 bpw	tg128	67.53	70.71	1.05
llama 8B Q4_0	tg32	133.00	137.42	1.03
llama 8B Q4_0	tg64	131.89	136.47	1.03
llama 8B Q4_0	tg128	130.78	135.35	1.03
gemma 7B Q4_0	tg32	123.23	126.88	1.03
gemma 7B Q4_0	tg64	122.28	125.76	1.03
gemma 7B Q4_0	tg128	121.45	124.74	1.03

am17an · 2025-10-22T08:04:47Z

@ggerganov after #16649 and this PR, tg for gpt-oss models should increase by ~9-10%

ORippler · 2025-10-22T08:31:39Z

Curious but how much does this increase binary size for the cuda backend?

am17an · 2025-10-22T08:47:15Z

Curious but how much does this increase binary size for the cuda backend?

It increases about ~20% (from 30M to 36M on my machine)

JohannesGaessler

I'll do performance testing on either Friday or Saturday when (hopefully) I'll finally be able to get the RTX 5090 that NVIDIA sent me to work.

ggml/src/ggml-cuda/ggml-cuda.cu

JohannesGaessler · 2025-10-23T11:06:54Z

Regarding binary size: when I compile the CUDA backend with GGML_NATIVE=OFF the size of libggml-cuda.so increases from 106 MiB to 145 MiB. This seems disproportionate to the amount of added template instances. Did you check for register spilling as ncols increases? That would result in disproportionate compilation times and binary sizes and the performance would be bad anyways.

In any case, for MMVF we can shave off a bit independently of this PR by only compiling it for cases not covered by MMF.

am17an · 2025-10-23T11:13:41Z

Since the main-use is ncols=1,I am also okay in just doing fusion for that case.

JohannesGaessler · 2025-10-23T11:17:32Z

That would I think also be fine. Matrix multiplications with small batch sizes > 1 are relevant for batched inference throughput and speculative decoding but we can always revisit those cases later.

am17an · 2025-10-23T16:46:09Z

Simplified the code to just fuse on ncols_dst = 1, now binary size and compilation time should be mostly unaffected with this change

ggml/src/ggml-cuda/mmvf.cu

ggml/src/ggml-cuda/mmvq.cu

JohannesGaessler · 2025-10-26T09:43:40Z

When I tested performance:

GPU	Model	Microbatch size	Test	t/s `5cca254`	t/s `65a098f`	Speedup
MI50	gpt-oss 20B MXFP4 MoE	512	tg128	102.93	119.86	1.16
MI50	llama 1B BF16	512	tg128	153.86	148.99	0.97
MI50	llama 1B F16	512	tg128	152.85	149.61	0.98
MI50	llama 1B Q4_0	512	tg128	298.27	326.99	1.10
MI50	llama 1B all F32	512	tg128	94.44	97.73	1.03
MI50	llama 8B Q4_0	512	tg128	84.99	91.74	1.08
P40	gemma3 4B Q4_0	512	tg128	73.25	74.34	1.01
P40	gpt-oss 20B MXFP4 MoE	512	tg128	72.42	62.37	0.86
P40	llama 1B BF16	512	tg128	110.78	110.91	1.00
P40	llama 1B F16	512	tg128	110.45	109.99	1.00
P40	llama 1B Q4_0	512	tg128	217.79	229.01	1.05
P40	llama 1B all F32	512	tg128	59.47	59.61	1.00
P40	llama 8B F16	512	tg128	19.70	19.71	1.00
P40	llama 8B Q4_0	512	tg128	54.59	51.99	0.95
P40	qwen3 0.6B Q4_0	512	tg128	207.40	215.85	1.04
P40	qwen3moe 30B.A3B Q4_0	512	tg128	64.97	66.54	1.02
RX 6800	gpt-oss 20B MXFP4 MoE	512	tg128	82.41	90.99	1.10
RX 6800	llama 1B BF16	512	tg128	97.26	104.51	1.07
RX 6800	llama 1B F16	512	tg128	97.41	104.21	1.07
RX 6800	llama 1B Q4_0	512	tg128	218.19	236.98	1.09
RX 6800	llama 1B all F32	512	tg128	79.92	82.37	1.03
RX 6800	llama 8B Q4_0	512	tg128	67.14	70.79	1.05
RX 9060 XT	gpt-oss 20B MXFP4 MoE	512	tg128	73.86	80.03	1.08
RX 9060 XT	llama 1B BF16	512	tg128	89.00	94.90	1.07
RX 9060 XT	llama 1B F16	512	tg128	90.06	94.72	1.05
RX 9060 XT	llama 1B Q4_0	512	tg128	183.53	195.18	1.06
RX 9060 XT	llama 1B all F32	512	tg128	57.42	58.57	1.02
RX 9060 XT	llama 8B Q4_0	512	tg128	52.11	54.06	1.04
RTX 3090	gpt-oss 20B MXFP4 MoE	512	tg128	187.76	191.65	1.02
RTX 3090	llama 1B BF16	512	tg128	273.53	276.84	1.01
RTX 3090	llama 1B F16	512	tg128	273.68	277.20	1.01
RTX 3090	llama 1B Q4_0	512	tg128	526.23	561.10	1.07
RTX 3090	llama 1B all F32	512	tg128	153.19	155.44	1.01
RTX 3090	llama 8B Q4_0	512	tg128	142.84	146.64	1.03
RTX 4090	gpt-oss 20B MXFP4 MoE	512	tg128	232.12	245.32	1.06
RTX 4090	llama 1B BF16	512	tg128	317.22	324.40	1.02
RTX 4090	llama 1B F16	512	tg128	317.76	325.11	1.02
RTX 4090	llama 1B Q4_0	512	tg128	690.64	723.50	1.05
RTX 4090	llama 1B all F32	512	tg128	174.58	176.87	1.01
RTX 4090	llama 8B Q4_0	512	tg128	170.52	175.33	1.03

On the P40 the fused MMVQ kernel does not seem to be consistently faster so I would suggest enabling fusion of that kernel only for Volta and newer.

am17an · 2025-10-26T10:19:24Z

Thanks for testing!

TinyServal · 2025-10-28T12:12:47Z

This might need to be disabled for compute capability 8.7 specifically in addition to pascal and older devices, right now I'm seeing a 10% performance loss on a Jetson AGX Orin. Benchmark results: #16815

…is resolved revert ggml-org#16715 (+2 squashed commit) Squashed commit: [289af2ee2] Revert "Hide latency of bias and gate-loading (ggml-org#16847)" This reverts commit 8b11dee. [a3e5c1e95] Revert "CUDA: add unused vars to mmvf and mmvq (ggml-org#16807)" This reverts commit 463bbf2.

…g#16799 is resolved" This reverts commit 3aec5ed.

am17an requested review from CISC, JohannesGaessler and slaren as code owners October 22, 2025 07:59

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs python python script changes ggml changes relating to the ggml tensor library for machine learning labels Oct 22, 2025

am17an force-pushed the cuda_fuse_gate_bias branch from c0a69df to 22ee634 Compare October 22, 2025 08:04

JohannesGaessler reviewed Oct 23, 2025

View reviewed changes

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

am17an force-pushed the cuda_fuse_gate_bias branch 4 times, most recently from a6e0d34 to 9b95697 Compare October 23, 2025 16:41

am17an requested a review from JohannesGaessler October 23, 2025 16:57

JohannesGaessler reviewed Oct 24, 2025

View reviewed changes

ggml/src/ggml-cuda/mmvf.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/mmvf.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/mmvq.cu Show resolved Hide resolved

ggml/src/ggml-cuda/mmvq.cu Outdated Show resolved Hide resolved

am17an added 7 commits October 25, 2025 12:41

CUDA: fuse ffn gate for mmvf

26fa8d0

fix hip build

8366599

fix musa build

bf349cb

only fuse ncols_dst=1

010a23a

add missing header

e212c85

check fusion=false for ncols_dst!=1

d67fcb8

add back comments

65a098f

am17an force-pushed the cuda_fuse_gate_bias branch from 6614a9b to 65a098f Compare October 25, 2025 04:49

am17an requested a review from JohannesGaessler October 26, 2025 07:17

don't use mmvq in pascal and lower

975ef38

JohannesGaessler approved these changes Oct 26, 2025

View reviewed changes

am17an merged commit f77c13b into ggml-org:master Oct 26, 2025
72 checks passed

am17an deleted the cuda_fuse_gate_bias branch October 26, 2025 11:28

TinyServal mentioned this pull request Oct 28, 2025

CUDA Performance Regression on Jetson AGX Orin #16815

Closed

This was referenced Oct 28, 2025

CUDA: Fix bug in topk-moe for gpt-oss #16821

Merged

Eval bug: When offloading to CPU after f77c13b commit using CUDA (MultiGPU), PP performance seems to be reduced by ~75% (CUDA: General GEMV fusion) #16912

Closed

radiskulldevildoll mentioned this pull request Nov 2, 2025

Eval bug: ROCm illegal memory access with -sm row #16799

Closed

Green-Sky mentioned this pull request Nov 6, 2025

Heads up - upstream GGML produces black images on CUDA leejet/stable-diffusion.cpp#945

Closed

LostRuins added a commit to LostRuins/koboldcpp that referenced this pull request Nov 8, 2025

Revert "Kcpp triage for rowsplit: revert ggml-org#16715 until ggml-or…

7e787c2

…g#16799 is resolved" This reverts commit 3aec5ed.

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026

CUDA: General GEMV fusion (ggml-org#16715)

c7a9907

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

CUDA: General GEMV fusion (#16715)

c2cd971

Conversation

am17an commented Oct 22, 2025

Uh oh!

am17an commented Oct 22, 2025

Uh oh!

ORippler commented Oct 22, 2025

Uh oh!

am17an commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Oct 23, 2025

Uh oh!

am17an commented Oct 23, 2025

Uh oh!

JohannesGaessler commented Oct 23, 2025

Uh oh!

am17an commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Oct 26, 2025

Uh oh!

Uh oh!

TinyServal commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

am17an commented Oct 22, 2025 •

edited

Loading

am17an commented Oct 23, 2025 •

edited

Loading

JohannesGaessler commented Oct 26, 2025 •

edited

Loading