CUDA Copy Kernel for Contiguous Tensors for GGML CPY OP#16471
CUDA Copy Kernel for Contiguous Tensors for GGML CPY OP#16471anavp-nvidia wants to merge 3 commits intoggml-org:masterfrom
Conversation
ggml/src/ggml-cuda/cpy.cu
Outdated
|
|
||
| const int elements_per_thread = 4; | ||
| const int threads_needed = (ne_elements + elements_per_thread - 1) / elements_per_thread; | ||
| const int num_blocks = max(1, min(65535, (threads_needed + CUDA_CPY_BLOCK_SIZE - 1) / CUDA_CPY_BLOCK_SIZE)); |
There was a problem hiding this comment.
I think this won't work if ne_elements is larger than a certain amount (I think 16726016). We can add an assert here or check whether num_blocks limit can be higher than 65535
ggml/src/ggml-cuda/cpy.cu
Outdated
| const int64_t remaining = ne_elements - base_idx; | ||
|
|
||
| if (remaining >= elements_per_thread) { | ||
| if (base_idx % 4 == 0) { |
There was a problem hiding this comment.
from what I understand base_idx is always a multiple of elements_per_threads which is 4, so this check is not neccessary?
ggml/src/ggml-cuda/cpy.cu
Outdated
|
|
||
| T * cdst = (cdst_indirect != nullptr) ? cdst_indirect[graph_cpynode_index] : cdst_direct; | ||
|
|
||
| const int elements_per_thread = 4; |
There was a problem hiding this comment.
This is declared both while launching the kernel and here, perhaps make it a constant like CUDA_CPY_BLOCK_SIZE
ggml/src/ggml-cuda/cpy.cu
Outdated
| } | ||
| } | ||
| } else { | ||
| for (int j = 0; j < remaining; ++j) { |
There was a problem hiding this comment.
since here remaining is < 4, we can do a unroll like below, but I doubt it will have any affect on performance
#pragma unroll
for (int j = 0; j < 4; ++j) {
size_t i = base + (size_t)j;
if (i < ne_elements) cdst[i] = cx[i];
}
JohannesGaessler
left a comment
There was a problem hiding this comment.
As it is this kernel does not properly check for memory alignment. When you copy a float4 this is done as a single, 16 byte transfer. However, if the pointer is not aligned to 16 byte this will result in a crash.
I would suggest you look at
llama.cpp/ggml/src/ggml-cuda/common.cuh
Lines 301 to 312 in 2c0d875
llama.cpp/ggml/src/ggml-cuda/common.cuh
Lines 573 to 597 in 2c0d875
- Add a template parameter for the alignment of the copy.
- At runtime, check the alignment of the tensors and run the template specialization with the maximum memory alignment supported by the hardware (I think this will need a utility function to fetch this property in host code).
- In the kernel use
ggml_cuda_memcpy_1exactly once per thread to get optimal memory alignment. Avoid using the function with an alignment smaller than the copy size since this will result in suboptimal memory bandwidth. - (Maybe change the kernel to use
char *to make it more generally applicable.)
|
Is this kernel faster than the previous |
It's still used: llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu Lines 2698 to 2704 in 2c0d875 |
The return value is completely ignored. Even if it wasn't, the reason it was necessary in the first place is because we used |
Right, you meant removing the function call... |
|
If the in-code comment regarding CUDA graph support is outdated then my opinion is that we should simply use |
|
The pointers in mamba should be the same on every token, so I don't think the indirection is necessary. |
|
Thanks! I wasn't aware that pointer indirection wasn't required here, appreciate the insight. I tested this locally by deleting the following section: llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu Lines 2691 to 2704 in 2c0d875 and modifying these lines to always use llama.cpp/ggml/src/ggml-cuda/cpy.cu Lines 332 to 336 in 2c0d875 With those changes, CUDA Graph execution ran without any issues, and performance (for Nemotron Nano v2) was as follows: Testing as per contribution guidelines also didn't raise any new issues. Based on this, it seems safe to remove the copy op pointer indirection code and revert to using |

In PR 16328, CUDA Graph support for the Nemotron Nano v2 (NemotronH) model was enabled by replacing use of cudaMemcpyAsync with an existing CUDA copy kernel for copy of contiguous tensors. However, that kernel is optimized for non-contiguous tensors.
This PR introduces a CUDA copy kernel for contiguous GGML tensors, which provides a performance improvement of ~3.7% for Nemotron Nano v2 on RTX 5090.
Results (RTX 5090):
Weights: bartowski/nvidia_NVIDIA-Nemotron-Nano-9B-v2-GGUF
Quantization: Q4_K_M
Performance before:
Performance after: