Checklist
Motivation
The performance of w8a8 gemm kernel and fused moe kernel is not good enough on B200. There is some space for tuning.
Related resources
Reproduction on 8*B200:
python3 -m sglang.bench_one_batch --model-path /dev/shm/DeepSeek-V3 --tp 8 --batch 16 --input-len 1024 --output-len 128 --attention-backend triton --profile
No response
Checklist
Motivation
The performance of w8a8 gemm kernel and fused moe kernel is not good enough on B200. There is some space for tuning.
Related resources
Reproduction on 8*B200:
No response