Add deepseek_v3 fused gate#3191
Conversation
| # Your module under test | ||
| output, indices_my = deepseekv3_fused_gate(tensor, bias, seq_length) | ||
|
|
||
| ###### Reference Implementation ###### |
There was a problem hiding this comment.
Please refactor this code into a standalone function, which can be directly used from https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/layers/moe/topk.py#L111-L147.
There was a problem hiding this comment.
Do you mean I separate the reference implementation into a standalone function?
| output_ref = weights.type_as(scores) | ||
|
|
||
| # Assertions | ||
| output_check = torch.allclose(output_ref.sort()[0], output.sort()[0], rtol=1e-04, atol=1e-05) |
There was a problem hiding this comment.
Why not directly compare output and output_ref instead of sorting them?
There was a problem hiding this comment.
This is weird, kernel sometimes will output exact same output but in a different order. I checked the following steps and the output order does not matter so I used this way to do the unit test, is this ok?
There was a problem hiding this comment.
We need to determine at which specific step of the fused kernel this inconsistency in order occurs. Additionally, we need to clarify whether running the PyTorch implementation twice with the same input would result in inconsistent output orders. Finally, if you believe that the current order inconsistency does not affect the fused MoE accuracy, you need to provide an end-to-end result, such as running the GSM8K test with the DeepSeek V3 model.
There was a problem hiding this comment.
I see, I will check the inconsistency inside the kernel. I cannot run e2e test on my server, Yineng will help me do the test
| from sgl_kernel import deepseekv3_fused_gate | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("seq_length", range(1, 20000)) |
There was a problem hiding this comment.
Can you add a benchmark script? Maybe refer to https://github.com/sgl-project/sglang/tree/main/sgl-kernel/benchmark
| bmm_fp8, | ||
| custom_dispose, | ||
| custom_reduce, | ||
| deepseekv3_fused_gate, |
There was a problem hiding this comment.
It seems more appropriate to name it deepseekv3_fused_gate here, as models from the deepseek series can all go through this gate function.
There was a problem hiding this comment.
This is not a generalized kernel, it only works for deepseek v3 671b model
There was a problem hiding this comment.
I think it also works for DeepSeek V2 VL
| input.data_ptr(), bias.data_ptr(), output.data_ptr(), indices.data_ptr<int64_t>(), num_rows, k, route_scale | ||
| ); | ||
|
|
||
| CHECK_CUDA_SUCCESS(cudaDeviceSynchronize()); |
There was a problem hiding this comment.
Synchronization is not allowed in CUDA kernel's host code, as it will cause CUDA graphs to crash. Can you remove it?
| @@ -0,0 +1,219 @@ | |||
| #include <cfloat> | |||
There was a problem hiding this comment.
Please add Adapted from https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.cu#L231
There was a problem hiding this comment.
Note the codes has been removed from v0.16.0 into static lib (closed source) in v0.17.0
|
In TensorRT-LLM, the fused MoE module, in addition to the |
sounds good @BBuf |
Yeah, I can have a try. |
|
@NovTi are you still working on the PR ? Have you reference to any other implementation open sourced ? |
Add deepseek v3 fused gate module