Checklist
Motivation
This issue is opened to support multiple KVCache optimizations.
- Add the support of FP4 KVCache Quantization support:
- Support to use the high performance TRTLLM operators for FP8 KVCache:
We do have the support of --kv-cache-dtype fp8_e4m3, which will quantize the KV Cache into FP8. However, in most cases, the feature will quantize the KV Cache from BF16/FP16 to FP8 at the storing time, and dequantize to BF16/FP16 at the reading time for attention calculation. This is suboptimal as it wasted time for quant/dequant. The optimal solution is to fuse the dequant into the attention kernel, so no explicit dequantization is needed.
TRTLLM GEMM has the SOTA performance as of now, and it does offer a variety of operators. The work is to pick the right operator that to be compatible with FP8 KV Cache.
- Support to use the high performance TRTLLM operators for FP4 KVCache:
Similar to above (2), will do the integration to fuse KV4 KVCache to TRTLLM operators.
- Support MLA-based Model, e.g. DeepSeek V3/R1. (Planning)
- Support MHA-based Mode, e.g. GPT-OSS (Planning)
Related resources
No response
Checklist
Motivation
This issue is opened to support multiple KVCache optimizations.
We do have the support of
--kv-cache-dtype fp8_e4m3, which will quantize the KV Cache into FP8. However, in most cases, the feature will quantize the KV Cache from BF16/FP16 to FP8 at the storing time, and dequantize to BF16/FP16 at the reading time for attention calculation. This is suboptimal as it wasted time for quant/dequant. The optimal solution is to fuse the dequant into the attention kernel, so no explicit dequantization is needed.TRTLLM GEMM has the SOTA performance as of now, and it does offer a variety of operators. The work is to pick the right operator that to be compatible with FP8 KV Cache.
Similar to above (2), will do the integration to fuse KV4 KVCache to TRTLLM operators.
Related resources
No response