Skip to content

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full#19042

Merged
ggerganov merged 4 commits intoggml-org:masterfrom
gaugarg-nv:pp_perf_improve
Jan 27, 2026
Merged

[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full#19042
ggerganov merged 4 commits intoggml-org:masterfrom
gaugarg-nv:pp_perf_improve

Conversation

@gaugarg-nv
Copy link
Contributor

With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, resulting in bubbles in the GPU timeline. This PR fixes this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.

The NSight profile below shows the issue in more detail:

image

After setting the environment variable, this is how the new profile looks:

image

The GPU0 is busy for the most part, but there are small bubbles on GPU 1. I think the reason for this is that for a constant batch size, batch n+1 takes more time than batch n due to causal attention. That's why GPU 0 on batch n+1 has more work to do than GPU 1 on batch n. This can be fixed by setting non-uniform tensor-split between GPUs.

Performance Gains

  • Significant perf improvement of 25% in PP throughout for larger models with pipeline parallelism.
  • Less but measurable perf improvement on single GPU for larger models.
  • The smaller the GPU and the larger the model, the greater the performance benefit is expected with this environment variable.
  • No change in performance for smaller models
  • No change in decode phase throughput
  • No change in VRAM usage
  • ~120 MB higher system RAM usage per GPU. For two GPUs, sys RAM usage increases by 240 MB.

Pipeline parallelism with 2x RTX Pro 6000 Blackwell GPUs.

Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf

ISL Env disabled Env enabled Speed-up
512 1682.79 1678.15 1.00
1024 1884.01 2064.24 1.10
2048 1948.14 2289.02 1.17
4096 1841.07 2266.42 1.23
8192 1563.33 1959.12 1.25

Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

ISL Env disabled Env enabled Speed-up
512 11467.79 11597.3 1.01
1024 14371.86 14381.97 1.00
2048 15551.36 15537.17 1.00
4096 14545.61 14522.35 1.00
8192 11896.39 11874.36 1.00

Single GPU: RTX Pro 6000 Blackwell

Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf

ISL Env disabled Env enabled Speed-up
512 1656.24 1686.89 1.02
1024 1567.71 1597.92 1.02
2048 1455.61 1484.2 1.02
4096 1235.89 1314.54 1.06
8192 953.8 976.03 1.02

Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

ISL Env disabled Env enabled Speed-up
512 12109.8 12031.88 0.99
1024 11426.8 11426.9 1.00
2048 10589.84 10594.17 1.00
4096 8902.05 8909.34 1.00
8192 6874.66 6872.47 1.00

With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline.
Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 23, 2026
Copy link
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a message to the warning log I think it's more appropriate to explain the usage of the variable and how we're changing the default in the documentation.

_putenv_s("CUDA_SCALE_LAUNCH_QUEUES", "4x");
#else
setenv("CUDA_SCALE_LAUNCH_QUEUES", "4x", 0); // don't overwrite if already set
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#endif
#endif // _WIN32

@ggerganov
Copy link
Member

Rather than a message to the warning log I think it's more appropriate to explain the usage of the variable and how we're changing the default in the documentation.

Agreed, this would be the better way to do it. Let's remove the warning logs.

@am17an
Copy link
Contributor

am17an commented Jan 24, 2026

Maybe it should also be a part of ggml_backend_cuda_get_features

@gaugarg-nv
Copy link
Contributor Author

@JohannesGaessler @ggerganov Thanks for the feedback. Removed the warning log and updated the documentation.

@am17an my understanding of ggml_backend_cuda_get_features is that it reports features that can be enabled/disabled based on compile-time flags or features that depend on underlying GPU architecture.

Given that we are changing the default value of the env variable irrespective of underlying GPU or compile time flags, do you still want to add it in ggml_backend_cuda_get_features?

@am17an
Copy link
Contributor

am17an commented Jan 26, 2026

I think it will be useful in debugging issues in case they come up. Currently there is no where in the logs where this is being printed and it would be good to have it somewhere if not here

@gaugarg-nv
Copy link
Contributor Author

I think it will be useful in debugging issues in case they come up. Currently there is no where in the logs where this is being printed and it would be good to have it somewhere if not here

How about adding info or a debug log, where I am setting the env variable? This is similar to how I had a warning log earlier, but I can change the log type to debug/info.

@am17an
Copy link
Contributor

am17an commented Jan 26, 2026

I don't see a problem in adding into ggml_cuda_backend_get_features tbh, it's definition can be expanded to include flags we are setting inside the binary (which is a pretty rare case)

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 26, 2026
Copy link
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR would be fine as-is, reporting the launch queue size in the properties would be nice-to-have but in my opinion not critical since it (to my knowledge) does not affect correctness.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jan 26, 2026
@gaugarg-nv
Copy link
Contributor Author

Hi @ggerganov, the CI failures seem to be unrelated to my change. Could you please take a look? Thanks.

@ggerganov ggerganov merged commit a83c73a into ggml-org:master Jan 27, 2026
142 of 149 checks passed
@ggerganov
Copy link
Member

Regarding the log - I'm not sure what's the best way. On one hand, we don't want to print this on every run as this information does not seem that important. But it's also nice to be aware that we are hijacking an environment variable.

The ggml_cuda_backend_get_features don't seem to fit very well for this information either.

Hopefully we will find some better solution in the future and not need to modify the environment variable.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jan 27, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jan 27, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jan 28, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Jan 30, 2026
gaugarg-nv added a commit to gaugarg-nv/llama.cpp that referenced this pull request Jan 31, 2026
Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (ggml-org#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.
@gaugarg-nv gaugarg-nv mentioned this pull request Jan 31, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 1, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 1, 2026
ggerganov pushed a commit that referenced this pull request Feb 3, 2026
…19227)

Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.
agent-enemy-2 pushed a commit to agent-enemy-2/llama.cpp that referenced this pull request Feb 4, 2026
…gml-org#19227)

Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (ggml-org#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 4, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 4, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 6, 2026
shaofeiqi pushed a commit to qualcomm/llama.cpp that referenced this pull request Feb 6, 2026
…ll (ggml-org#19042)

* [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full

With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline.
Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.

* Set the env variable in the CUDA backend registry allocation

* Add link to PR in code comment

* Remove warning logs and update documentation
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 8, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 8, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 11, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 14, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 14, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 18, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 20, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 23, 2026
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
…gml-org#19227)

Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (ggml-org#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 23, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Feb 27, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Mar 8, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Mar 8, 2026
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants