Checklist
Motivation
Currently dsv3_router_gemm outputs kernel with fp32 precision. When applying it to deepseek-r1 fp4, the output is converted to bf16 to avoid bug. However, these extra conversions will cause performance drop when bs >= 4.
We should modify router_gemm kernel and make it output bf16 tensor directly. Then dsv3_router_gemm kernel can be applied to bs >=4
Ref: #7627, #7677
Related resources
No response
Checklist
Motivation
Currently dsv3_router_gemm outputs kernel with fp32 precision. When applying it to deepseek-r1 fp4, the output is converted to bf16 to avoid bug. However, these extra conversions will cause performance drop when bs >= 4.
We should modify router_gemm kernel and make it output bf16 tensor directly. Then dsv3_router_gemm kernel can be applied to bs >=4
Ref: #7627, #7677
Related resources
No response