Conversation
|
Both on while ./bin/test-backend-ops -o ROPE -b CUDA0 ; do date ; doneExample failureI think it is related to CUDA graphs because disabling them won't reproduce the failure. @ORippler Can you confirm this failure on your end? |
On my side (Ada 6000), I am seeing failure in different tests sporadically (Unary ops like relu) when cuda graphs are enabled.
While there seems to be an issue with CUDA graph reuse/stability, I don't think it's specific/tied to the ROPE operator (and thus this PR). Details@JohannesGaessler @am17an Mind pinging me the latest changes that were made to cuda graphs (I faintly recall us introducing and reverting leaf-node checks for top-k operator, but not sure if there have been changes since) so I can run a bisect? |
|
This code doesn't match properties when params change, I think ROPE should also be here. And perhaps GGML_OP_UNARY too llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu Lines 2982 to 2985 in eaba92c CUDA graph mistakenly thinks that this graph be re-used. It probably doesn't show up in when you run all tests because the num consecutive updates disable kicks in (the |
|
@am17an Yes, this patch seems to fix the rope fails on my end: diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 08383edb4..3eaf7763c 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -2973,8 +2973,7 @@ static bool ggml_cuda_graph_node_properties_match(ggml_tensor * node, ggml_cuda_
}
}
- if ((node->op == GGML_OP_SCALE || node->op == GGML_OP_GLU) &&
- memcmp(props->op_params, node->op_params, GGML_MAX_OP_PARAMS) != 0) {
+ if (memcmp(props->op_params, node->op_params, GGML_MAX_OP_PARAMS) != 0) {
return false;
}
I'm not sure what was the reasoning to do this check only for SCALE and GLU. I think it is safe to always to do it? |
It would be better to always do it I think. I can't think of an operator where the |
|
should that be added into this PR? or is it handled separately |
|
Tested this PR, Shifting appears to work fine on GLM 4.7 Flash for me now. Closes #19292 |
Seems memory layout is shared with Vulkan so we can port fix from ggml-org#19299
f4feb25 to
dac6e60
Compare
JohannesGaessler
left a comment
There was a problem hiding this comment.
It may make sense to start adding fastdiv to RoPE while we're at it but it's also fine to merge this PR without. (To my understanding size_t strides are not relevant for Blackwell performance since there is no loop where pointer aliasing could occur.)
ggml/src/ggml-cuda/rope.cu
Outdated
| const size_t nb01 = src0->nb[1] / ggml_type_size(src0->type); | ||
| const size_t nb02 = src0->nb[2] / ggml_type_size(src0->type); | ||
| const size_t nb03 = src0->nb[3] / ggml_type_size(src0->type); | ||
|
|
||
| const size_t nb11 = dst->nb[1] / ggml_type_size(dst->type); | ||
| const size_t nb12 = dst->nb[2] / ggml_type_size(dst->type); | ||
| const size_t nb13 = dst->nb[3] / ggml_type_size(dst->type); |
There was a problem hiding this comment.
To avoid confusion, please use e.g. nb01 only for the strides between elements in bytes. For the logical strides in units of the element size I've been using e.g. s01 in the CUDA backend.
This happens on host side, and conversion to |
* Rename variables + fix rope_neox Seems memory layout is shared with Vulkan so we can port fix from ggml-org#19299 * Fix rope_multi * Fix rope_vision * Fix rope_norm * Rename ne* to ne0* for consistent variable naming * cont : consistent stride names --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Rename variables + fix rope_neox Seems memory layout is shared with Vulkan so we can port fix from ggml-org#19299 * Fix rope_multi * Fix rope_vision * Fix rope_norm * Rename ne* to ne0* for consistent variable naming * cont : consistent stride names --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This is a port of #19299 to the CUDA backend, which should fix the broken logic revealed by tests added in #19296
Thanks @jeffbolznv for the work in #19299