Apply sgl w8a8 fp8 kernel#3148
Conversation
|
Let me bump a new sgl-kernel version to unblock this PR. |
|
|
@HandH1998 What is the progress of this PR? Please let me know when it is ready. |
|
@zhyncs two days later |
|
@merrymercy @zhyncs |
|
I also added a quantization config |
|
#3493 @HandH1998 this has been merged |
|
@zhyncs |
|
update this to Line 47 in 96263f2 |
need to upload to pypi? |
|
|
The two falied CIs seems are related with DSv3. I tried to reproduce them locally. But I can't find |
|
@HandH1998 You can give me the HF user name or use DeepSeek V3/R1 for testing. I have also updated this, so if you wish to upgrade, please update this as well. sglang/scripts/ci_install_dependency.sh Line 29 in 70866b6 |
|
@HandH1998 Do you think we should support similar api like |
The cutlass w8a8 fp8 kernel only support per-channel activation scales, so I only apply per_token_quant. The |
My HF user name is HandH1998. |


Following #3047, we replace w8a8 fp8 vllm kernel with sgl-kernel. Generally, the w8a8 fp8 sgl-kernel yields higher accuracy on gsm8k. On sm89-L40, the w8a8 fp8 sgl-kernel delivers a 14% higher throughput than the vllm kernel. On sm90-H100, both kernels exhibit similar performance.
Benchmark
model: neuralmagic/Meta-Llama-3-8B-Instruct-FP8
sm89-L40
gsm8k
throughput under various request rates
tok/s
sm90-H100
gsm8k
throughput under various request rates
tok/s
model: neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8-dynamic
activation dynamic quantization
sm89-L40
gsm8k
throughput under various request rates
tok/s
sm90-H100
gsm8k
throughput under various request rates
tok/s