[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full#19042
[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full#19042ggerganov merged 4 commits intoggml-org:masterfrom
Conversation
With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline. Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.
JohannesGaessler
left a comment
There was a problem hiding this comment.
Rather than a message to the warning log I think it's more appropriate to explain the usage of the variable and how we're changing the default in the documentation.
ggml/src/ggml-cuda/ggml-cuda.cu
Outdated
| _putenv_s("CUDA_SCALE_LAUNCH_QUEUES", "4x"); | ||
| #else | ||
| setenv("CUDA_SCALE_LAUNCH_QUEUES", "4x", 0); // don't overwrite if already set | ||
| #endif |
There was a problem hiding this comment.
| #endif | |
| #endif // _WIN32 |
Agreed, this would be the better way to do it. Let's remove the warning logs. |
|
Maybe it should also be a part of |
|
@JohannesGaessler @ggerganov Thanks for the feedback. Removed the warning log and updated the documentation. @am17an my understanding of Given that we are changing the default value of the env variable irrespective of underlying GPU or compile time flags, do you still want to add it in |
|
I think it will be useful in debugging issues in case they come up. Currently there is no where in the logs where this is being printed and it would be good to have it somewhere if not here |
How about adding info or a debug log, where I am setting the env variable? This is similar to how I had a warning log earlier, but I can change the log type to debug/info. |
|
I don't see a problem in adding into |
JohannesGaessler
left a comment
There was a problem hiding this comment.
I think this PR would be fine as-is, reporting the launch queue size in the properties would be nice-to-have but in my opinion not critical since it (to my knowledge) does not affect correctness.
|
Hi @ggerganov, the CI failures seem to be unrelated to my change. Could you please take a look? Thanks. |
|
Regarding the log - I'm not sure what's the best way. On one hand, we don't want to print this on every run as this information does not seem that important. But it's also nice to be aware that we are hijacking an environment variable. The Hopefully we will find some better solution in the future and not need to modify the environment variable. |
Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (ggml-org#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.
…gml-org#19227) Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (ggml-org#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.
…ll (ggml-org#19042) * [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline. Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size. * Set the env variable in the CUDA backend registry allocation * Add link to PR in code comment * Remove warning logs and update documentation
…gml-org#19227) Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (ggml-org#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.
With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, resulting in bubbles in the GPU timeline. This PR fixes this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size.
The NSight profile below shows the issue in more detail:
After setting the environment variable, this is how the new profile looks:
The GPU0 is busy for the most part, but there are small bubbles on GPU 1. I think the reason for this is that for a constant batch size, batch n+1 takes more time than batch n due to causal attention. That's why GPU 0 on batch n+1 has more work to do than GPU 1 on batch n. This can be fixed by setting non-uniform tensor-split between GPUs.
Performance Gains
Pipeline parallelism with 2x RTX Pro 6000 Blackwell GPUs.
Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
Single GPU: RTX Pro 6000 Blackwell
Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf