vulkan: Update topk_moe fusion to handle gpt's late softmax#16656
vulkan: Update topk_moe fusion to handle gpt's late softmax#166560cc4m merged 5 commits intoggml-org:masterfrom
Conversation
|
CC @am17an I've included the ggml_check_edges change in this PR. |
0cc4m
left a comment
There was a problem hiding this comment.
I understand what this change is doing, but how do I test it? The topk_moe tests pass before and after this change. Which model architectures correspond to the three modes?
Usually I put a debug statement printing the number of nodes fused. We'll need to come up with a better way to assert that the nodes were actually fused |
I've added some logging in the latest commit that I use to verify fusion and the effects of graph_optimize. You can see the whole sequence of ops without a sync in between, which implies the fusion is working. Early softmax w/norm: qwen3 |
|
I've rebased this and updated it to handle the clamp added in #16655. |
|
Are the non-Vulkan changes fine @slaren ? |
Co-authored-by: Diego Devesa <slarengh@gmail.com>
…#16656) * vulkan: Update topk_moe fusion to handle gpt's late softmax Based on ggml-org#16649. * Add ggml_check_edges * Add sync logging to show fusion effects * handle clamp added in ggml-org#16655 * Update ggml/src/ggml-impl.h Co-authored-by: Diego Devesa <slarengh@gmail.com>
* vulkan: Update topk_moe fusion to handle gpt's late softmax Based on #16649. * Add ggml_check_edges * Add sync logging to show fusion effects * handle clamp added in #16655 * Update ggml/src/ggml-impl.h Co-authored-by: Diego Devesa <slarengh@gmail.com>
Based on #16649.