[Feature] Multiple KVCache Quantization Enhancements

### Checklist

- [x] 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 2. Please use English, otherwise it will be closed.

### Motivation

This issue is opened to support multiple KVCache optimizations.

1. Add the support of FP4 KVCache Quantization support:

- Support MLA-based Model, e.g.  DeepSeek V3/R1. (PR: https://github.com/sgl-project/sglang/pull/10078)
- Support MHA-based Model, e.g. GPT-OSS (WIP)

2. Support to use the high performance TRTLLM operators for FP8 KVCache:

We do have the support of `--kv-cache-dtype fp8_e4m3`, which will quantize the KV Cache into FP8. However, in most cases, the feature will quantize the KV Cache from BF16/FP16 to FP8 at the storing time, and dequantize to BF16/FP16 at the reading time for attention calculation. This is suboptimal as it wasted time for quant/dequant. The optimal solution is to fuse the dequant into the attention kernel, so no explicit dequantization is needed. 

TRTLLM GEMM has the SOTA performance as of now, and it does offer a variety of operators. The work is to pick the right operator that to be compatible with FP8 KV Cache.

- Support MLA-based Model, e.g. DeepSeek V3/R1. (Already Supported)
- Support MHA-based Mode, e.g. GPT-OSS (PR: https://github.com/sgl-project/sglang/pull/9783, WIP...)

3. Support to use the high performance TRTLLM operators for FP4 KVCache:

Similar to above (2), will do the integration to fuse KV4 KVCache to TRTLLM operators.

- Support MLA-based Model, e.g. DeepSeek V3/R1. (Planning)
- Support MHA-based Mode, e.g. GPT-OSS (Planning)

### Related resources

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Multiple KVCache Quantization Enhancements #10083

Checklist

Motivation

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Multiple KVCache Quantization Enhancements #10083

Description

Checklist

Motivation

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions