support w8a8 fp8 kernel with CUTLASS#3047
Conversation
clean code
c70943d to
cd51083
Compare
|
We have fixed the review issues and resolved the conflicts. And we also tried to optimize the performance on sm90, but it can't still overcome vllm under all cases. The final results shows that our kernel and vllm's have their own advantages in different cases. |
53fd85a to
a1b582e
Compare
|
why it fails when I run 'pip install .' in sgl-kernel dir? |
@ll2088 Please run |
|
|
@ll2088 build-wheels CI works well, so I think the issue is caused by your local environment. |
which version of flashinfer are you using? |
|
@HandH1998 Please paste the latest benchmark results. Thanks! |
Co-authored-by: yych0745 <1398089567@qq.com>


Support sm89 and sm90 fp8 GEMM implementation with cutlass for w8a8 fp8 quantization. Co-author @yych0745 @b0urnee
Benchmark
GPU: sm89-L40
meta-llama/Llama-3.1-8B-Instruct, TP=1
meta-llama/Llama-3.3-70B-Instruct, TP=1
mistralai/Mistral-Large-Instruct-2407, TP=1
Qwen/Qwen2.5-7B-Instruct, TP=1
Qwen/Qwen2.5-32B-Instruct, TP=1
Qwen/Qwen2.5-72B-Instruct, TP=1
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct, TP=1
meta-llama/Llama-3.1-8B-Instruct, TP=4
meta-llama/Llama-3.3-70B-Instruct, TP=4
mistralai/Mistral-Large-Instruct-2407, TP=4
Qwen/Qwen2.5-7B-Instruct, TP=4
Qwen/Qwen2.5-32B-Instruct, TP=4
Qwen/Qwen2.5-72B-Instruct, TP=4
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct, TP=4
meta-llama/Llama-3.1-8B-Instruct, TP=8
meta-llama/Llama-3.3-70B-Instruct, TP=8
mistralai/Mistral-Large-Instruct-2407, TP=8
Qwen/Qwen2.5-7B-Instruct, TP=8
Qwen/Qwen2.5-32B-Instruct, TP=8
Qwen/Qwen2.5-72B-Instruct, TP=8
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct, TP=8
GPU: sm90-H100
meta-llama/Llama-3.1-8B-Instruct, TP=1
meta-llama/Llama-3.3-70B-Instruct, TP=1
mistralai/Mistral-Large-Instruct-2407, TP=1
Qwen/Qwen2.5-7B-Instruct, TP=1
Qwen/Qwen2.5-32B-Instruct, TP=1
Qwen/Qwen2.5-72B-Instruct, TP=1
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct, TP=1
meta-llama/Llama-3.1-8B-Instruct, TP=4
meta-llama/Llama-3.3-70B-Instruct, TP=4
mistralai/Mistral-Large-Instruct-2407, TP=4
Qwen/Qwen2.5-7B-Instruct, TP=4
Qwen/Qwen2.5-32B-Instruct, TP=4
Qwen/Qwen2.5-72B-Instruct, TP=4
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct, TP=4
meta-llama/Llama-3.1-8B-Instruct, TP=8
meta-llama/Llama-3.3-70B-Instruct, TP=8
mistralai/Mistral-Large-Instruct-2407, TP=8
Qwen/Qwen2.5-7B-Instruct, TP=8
Qwen/Qwen2.5-32B-Instruct, TP=8
Qwen/Qwen2.5-72B-Instruct, TP=8
deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct, TP=8