[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full by gaugarg-nv · Pull Request #19042 · ggml-org/llama.cpp

gaugarg-nv · 2026-01-23T10:27:32Z

With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, resulting in bubbles in the GPU timeline. This PR fixes this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.

The NSight profile below shows the issue in more detail:

After setting the environment variable, this is how the new profile looks:

The GPU0 is busy for the most part, but there are small bubbles on GPU 1. I think the reason for this is that for a constant batch size, batch n+1 takes more time than batch n due to causal attention. That's why GPU 0 on batch n+1 has more work to do than GPU 1 on batch n. This can be fixed by setting non-uniform tensor-split between GPUs.

Performance Gains

Significant perf improvement of 25% in PP throughout for larger models with pipeline parallelism.
Less but measurable perf improvement on single GPU for larger models.
The smaller the GPU and the larger the model, the greater the performance benefit is expected with this environment variable.
No change in performance for smaller models
No change in decode phase throughput
No change in VRAM usage
~120 MB higher system RAM usage per GPU. For two GPUs, sys RAM usage increases by 240 MB.

Pipeline parallelism with 2x RTX Pro 6000 Blackwell GPUs.

Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf

ISL	Env disabled	Env enabled	Speed-up
512	1682.79	1678.15	1.00
1024	1884.01	2064.24	1.10
2048	1948.14	2289.02	1.17
4096	1841.07	2266.42	1.23
8192	1563.33	1959.12	1.25

Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

ISL	Env disabled	Env enabled	Speed-up
512	11467.79	11597.3	1.01
1024	14371.86	14381.97	1.00
2048	15551.36	15537.17	1.00
4096	14545.61	14522.35	1.00
8192	11896.39	11874.36	1.00

Single GPU: RTX Pro 6000 Blackwell

Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf

ISL	Env disabled	Env enabled	Speed-up
512	1656.24	1686.89	1.02
1024	1567.71	1597.92	1.02
2048	1455.61	1484.2	1.02
4096	1235.89	1314.54	1.06
8192	953.8	976.03	1.02

Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

ISL	Env disabled	Env enabled	Speed-up
512	12109.8	12031.88	0.99
1024	11426.8	11426.9	1.00
2048	10589.84	10594.17	1.00
4096	8902.05	8909.34	1.00
8192	6874.66	6872.47	1.00

With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline. Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.

JohannesGaessler

Rather than a message to the warning log I think it's more appropriate to explain the usage of the variable and how we're changing the default in the documentation.

JohannesGaessler · 2026-01-24T14:34:26Z

ggml/src/ggml-cuda/ggml-cuda.cu

+                _putenv_s("CUDA_SCALE_LAUNCH_QUEUES", "4x");
+#else
+                setenv("CUDA_SCALE_LAUNCH_QUEUES", "4x", 0); // don't overwrite if already set
+#endif


Suggested change

#endif

#endif // _WIN32

ggerganov · 2026-01-24T15:13:14Z

Rather than a message to the warning log I think it's more appropriate to explain the usage of the variable and how we're changing the default in the documentation.

Agreed, this would be the better way to do it. Let's remove the warning logs.

am17an · 2026-01-24T17:02:55Z

Maybe it should also be a part of ggml_backend_cuda_get_features

gaugarg-nv · 2026-01-26T14:50:17Z

@JohannesGaessler @ggerganov Thanks for the feedback. Removed the warning log and updated the documentation.

@am17an my understanding of ggml_backend_cuda_get_features is that it reports features that can be enabled/disabled based on compile-time flags or features that depend on underlying GPU architecture.

Given that we are changing the default value of the env variable irrespective of underlying GPU or compile time flags, do you still want to add it in ggml_backend_cuda_get_features?

am17an · 2026-01-26T14:56:51Z

I think it will be useful in debugging issues in case they come up. Currently there is no where in the logs where this is being printed and it would be good to have it somewhere if not here

gaugarg-nv · 2026-01-26T15:03:06Z

I think it will be useful in debugging issues in case they come up. Currently there is no where in the logs where this is being printed and it would be good to have it somewhere if not here

How about adding info or a debug log, where I am setting the env variable? This is similar to how I had a warning log earlier, but I can change the log type to debug/info.

am17an · 2026-01-26T15:26:06Z

I don't see a problem in adding into ggml_cuda_backend_get_features tbh, it's definition can be expanded to include flags we are setting inside the binary (which is a pretty rare case)

JohannesGaessler

I think this PR would be fine as-is, reporting the launch queue size in the properties would be nice-to-have but in my opinion not critical since it (to my knowledge) does not affect correctness.

ggml-org#19042

gaugarg-nv · 2026-01-27T06:34:16Z

Hi @ggerganov, the CI failures seem to be unrelated to my change. Could you please take a look? Thanks.

ggerganov · 2026-01-27T06:54:57Z

Regarding the log - I'm not sure what's the best way. On one hand, we don't want to print this on every run as this information does not seem that important. But it's also nice to be aware that we are hijacking an environment variable.

The ggml_cuda_backend_get_features don't seem to fit very well for this information either.

Hopefully we will find some better solution in the future and not need to modify the environment variable.

ggml-org#19042

Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (ggml-org#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.

ggml-org#19042

…19227) Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.

…gml-org#19227) Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (ggml-org#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.

ggml-org#19042

…ll (ggml-org#19042) * [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline. Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size. * Set the env variable in the CUDA backend registry allocation * Add link to PR in code comment * Remove warning logs and update documentation

ggml-org#19042

…gml-org#19227) Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (ggml-org#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.

ggml-org#19042

gaugarg-nv added 2 commits January 23, 2026 13:37

Set the env variable in the CUDA backend registry allocation

29c73ef

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 23, 2026

Add link to PR in code comment

14de97e

loci-dev mentioned this pull request Jan 23, 2026

UPSTREAM PR #19042: [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full auroralabs-loci/llama.cpp#1006

Open

JohannesGaessler reviewed Jan 24, 2026

View reviewed changes

Remove warning logs and update documentation

ed2e484

github-actions bot added the documentation Improvements or additions to documentation label Jan 26, 2026

JohannesGaessler approved these changes Jan 26, 2026

View reviewed changes

ggerganov approved these changes Jan 26, 2026

View reviewed changes

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jan 26, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

d192ec0

ggml-org#19042

ggerganov merged commit a83c73a into ggml-org:master Jan 27, 2026
142 of 149 checks passed

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jan 27, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

cbedcc5

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jan 27, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

ddb06c0

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jan 28, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

8dd9b8c

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jan 30, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

0c6a22f

ggml-org#19042

BusinessBuilders mentioned this pull request Jan 30, 2026

MoE model decode hangs on Jetson Orin AGX (SM87) since b7309 #19219

Open

gaugarg-nv mentioned this pull request Jan 31, 2026

Fix Issue !19219 #19227

Merged

BusinessBuilders mentioned this pull request Jan 31, 2026

cuda: only scale launch queues for multi-GPU systems #19230

Closed

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 1, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

b0a8045

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 1, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

63313a2

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 4, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

7e860b4

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 4, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

0568a6d

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 6, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

77fd850

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 8, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

087c882

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 8, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

54b6382

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 11, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

02e789d

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 14, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

b9d9b22

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 14, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

d12fcec

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 18, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

0a183d5

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 20, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

2e762e5

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 23, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

9b246d0

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 23, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

d76fb16

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 27, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

31334e1

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Mar 8, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

948bd6b

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Mar 8, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

ecc6c1a

ggml-org#19042

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Mar 12, 2026

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

78893b8

ggml-org#19042

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full#19042

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full#19042
ggerganov merged 4 commits intoggml-org:masterfrom
gaugarg-nv:pp_perf_improve

gaugarg-nv commented Jan 23, 2026

Uh oh!

JohannesGaessler left a comment

Uh oh!

JohannesGaessler Jan 24, 2026

Uh oh!

ggerganov commented Jan 24, 2026

Uh oh!

am17an commented Jan 24, 2026

Uh oh!

gaugarg-nv commented Jan 26, 2026

Uh oh!

am17an commented Jan 26, 2026

Uh oh!

gaugarg-nv commented Jan 26, 2026

Uh oh!

am17an commented Jan 26, 2026

Uh oh!

JohannesGaessler left a comment

Uh oh!

gaugarg-nv commented Jan 27, 2026

Uh oh!

Uh oh!

ggerganov commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

gaugarg-nv commented Jan 23, 2026

Performance Gains

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Jan 24, 2026

Uh oh!

am17an commented Jan 24, 2026

Uh oh!

gaugarg-nv commented Jan 26, 2026

Uh oh!

am17an commented Jan 26, 2026

Uh oh!

gaugarg-nv commented Jan 26, 2026

Uh oh!

am17an commented Jan 26, 2026

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

gaugarg-nv commented Jan 27, 2026

Uh oh!

Uh oh!

ggerganov commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants