musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy#13647
Conversation
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
JohannesGaessler
left a comment
There was a problem hiding this comment.
You are replacing only the case where the memory of one tensor is copied to another tensor as one contiguous block. I would have intuitively assumed that a memcpy would perform quite well in that scenario, how much faster is the mudnn implementation?
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
In my local tests on the MTT S80, I observed nearly a 70% ( |
|
I also have a question regarding how |
|
@ggerganov can you make @yeahdongcn a collaborator so that he can merge approved PRs at his own discretion? |
|
Yes, invite sent. |
|
Thanks @JohannesGaessler @ggerganov Just accepted the invitation. |
…CHECK Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
|
I’ve removed |
…ITY op to accelerate D2D memory copy (ggml-org#13647) * musa: fix build warning (unused parameter) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: upgrade MUSA SDK version to rc4.0.1 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Update ggml/src/ggml-cuda/cpy.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Make sure to read the contributing guidelines before submitting a PR
Testing Done
test-backend-ops -o CPYpasseddocker run -it -v ~/models:/models local/llama.cpp:light-musa -m /models/deepseek-r1_7b_q4_0.gguf -ngl 999/docker run -p 8080:8080 -it -v ~/models:/models local/llama.cpp:server-musa -m /models/deepseek-r1_7b_q4_0.gguf -ngl 999/docker run -it -v ~/models:/models local/llama.cpp:full-musa --run -m /models/deepseek-r1_7b_q4_0.gguf -ngl 999Logs: