Support w8a8 int8 quantization config#2881
Conversation
zhyncs
left a comment
There was a problem hiding this comment.
LGTM minor issues, I left some comments.
BTW please provide the results for bench_serving request rates 8/16/32/inf
We should ensure that we have an advantage not only with large batches but also with small batches, whether in terms of throughput or latency.
Also when will support for SM90 be released?
| "bitsandbytes", | ||
| "gguf", | ||
| "modelopt", | ||
| "w8a8_int8", |
There was a problem hiding this comment.
Is w8a8_int8 for int8 and w8a8_fp8 for fp8?
There was a problem hiding this comment.
Maybe fp8 for w8a8_fp8?
There was a problem hiding this comment.
I use w8a8_int8 since it's more clear. While there is already a fp8 config for w8a8_fp8.
cc: @HandH1998
|
Finally, when this feature is stable, we should default to using our implementation instead of the compressed-tensors implementation when using w8a8 int8. |
|
note: Per-channel symmetric int8 is sufficient for most cases, and asymmetric can be temporarily unsupported. |
|
Output throughput of bench serving with 8/16/32/inf request rate:
Qwen2-7B-Instruct W8A8:
|
Looks good. How about the latency (TTFT and ITL)? |
|
Both the TTFT/TPOT/ITL are reduced for w8a8_int8 config. Better acceleration can be achieved for higher QPS workload. The benchmark results are attached here for reference. benchmark results |
Motivation
Add quantization config for w8a8 int8 with int8 GEMM in sgl-kernel and int8 quant kernel.
w8a8_int8 can achieve ~10% higher output throughput and without accuracy loss compared to the original compressed-tensors config. (Tested on A100)
Meta-Llama-3-8B-Instruct W8A8:
Qwen2-7B-Instruct W8A8:
Usage: Add
--quantization w8a8_int8server args. Compatible with HF checkpoints with per-channel symmetric int8 quantization.cc: @merrymercy @zhyncs @HandH1998