feat: Add FP4 (E2M1) KV Cache Support for MHA by JackChuang · Pull Request #12612 · sgl-project/sglang

JackChuang · 2025-11-04T07:34:02Z

Summary

This PR introduces support for FP4 (float4_e2m1fn_x2) KV caching in Multi-Headed Attention (MHA) e.g., Qwen and GPT-OSS . See #10083, points 1-2, for more context.

Co-authored-by: @yicwang Yichen Wang yichen.wang@bytedance.com

Usage

$ python3 -m sglang.launch_server --kv-cache-dtype fp4_e2m1 ...

Motivation and Benefits

Large models often face GPU memory constraints when storing KV cache.
By introducing FP4 quantization with scale buffers, this PR significantly reduces KV memory usage and improves efficiency:

Supports significantly more tokens than KV8 (≈1.78×) and KV16 (≈3.56×) due to FP4 quantization with block_size = 16.
Improves scalability for longer context windows and throughput for large batch requests
Enables inference of larger models or longer context windows on memory-limited GPUs.
Seamless integration with existing inference pipelines without breaking KV16/KV8 workflows.

Key Changes

MHATokenToKVPool
- Added FP4 KV cache support with uint8 storage format.
- Introduced k_scale_buffer and v_scale_buffer for per-block scaling factors.
- Integrated batched quantization (on update) and dequantization (on access) using KVFP4QuantizeUtil.
ModelRunner
- Updated GPU memory estimation logic to account for FP4 cache and scale buffers.
Compatibility
- Preserves existing FP16/FP8 KV cache behavior without changes.

Accuracy tests for KV4 MHA

FP4 KV cache is well-suited for large-scale models, providing memory savings with minimal accuracy impact.
For smaller models, careful evaluation is needed to balance memory efficiency and accuracy.

Qwen3-235B-A22B

Model	Dataset	Metric	Subset	Num	Score	Cat.0
KV4 (fp4_e2m1)
KV4	gsm8k	mean_acc	main	6595	0.9186	default
KV4	aime25	mean_acc	OVERALL	150	0.6	-
KV4	gpqa_diamond	mean_acc	default	990	0.6778	default
KV8 (fp8_e4m3)
KV8	gsm8k	mean_acc	main	6595	0.9181	default
KV8	aime25	mean_acc	OVERALL	150	0.7333	-
KV8	gpqa_diamond	mean_acc	default	990	0.6899	default
KV16
KV16	gsm8k	mean_acc	main	6595	0.9168	default
KV16	aime25	mean_acc	OVERALL	150	0.7733	-
KV16	gpqa_diamond	mean_acc	default	990	0.701	default

gpt-oss-120b

Model	Dataset	Metric	Subset	Num	Score	Cat.0
KV4 (fp4_e2m1)
KV4	aime25	mean_acc	OVERALL	150	0.3533	-
KV4	gsm8k	mean_acc	main	6595	0.9152	default
KV4	gpqa_diamond	mean_acc	default	990	0.3202	default
KV8 (fp8_e4m3)
KV8	aime25	mean_acc	OVERALL	150	0.7667	-
KV8	gsm8k	mean_acc	main	6595	0.9163	default
KV8	gpqa_diamond	mean_acc	default	990	0.5434	default
KV16
KV16	aime25	mean_acc	OVERALL	150	0.7533	-
KV16	gsm8k	mean_acc	main	6595	0.9161	default
KV16	gpqa_diamond	mean_acc	default	990	0.5081	default

Observation:

On large models (Qwen3-235B-A22B), FP4 maintains accuracy close to FP8/FP16.
On smaller models (gpt-oss-120b), FP4 shows more pronounced accuracy drops on difficult datasets.
Trend: Accuracy degradation is more significant in smaller models.

Performance Results

Although speed is not the main goal of this PR (will be addressed in #10083 3-2), we ran throughput tests using torch_native to provide reference:

Reason for torch_native:
- Other backends (e.g., trtllm_mha, Triton attention) have fused kernels for FP8 only, making FP8 faster there.
- KV8 lacks a fused kernel on torch_native, so both KV4 and KV8 are measured on the same backend.

Note: KV8 could not run when the attention backend was set to torch_native. We have fixed this problem in PR Support kv8 (FP8) with torch_native attention backend #12596
Test configuration:
- --num-prompts: 100–400
- --max-concurrency: 50–200
- Unit: Output token throughput (tok/s)

Num Prompts	Concurrency	KV8 (tok/s)	KV4 (tok/s)	Gain	TTFT (ms)	TPOT (ms)
100	50	62.43	60.35	-3.33%	5323	798
200	100	67.34	68.02	+1.0%	9378	1480
300	150	68.81	71.63	+4.1%	13500	2172
400	200	69.75	74.19	+6.36%	19595	2685

Observation:

KV4 shows slightly lower throughput at very small workloads, but catches up and surpasses KV8 at higher concurrency.
Speed improvements will be addressed in [Feature] Multiple KVCache Quantization Enhancements #10083 3-2; this PR focuses on FP4 KV cache feature support.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

Based on PR sgl-project#10078, this patch - introduces FP4 KV cache support in MHATokenToKVPool with uint8 storage. - adds k_scale_buffer and v_scale_buffer to store FP4 scaling factors. - implements batched quantization on cache update and dequantization on access. - updates ModelRunner memory estimation to account for FP4 scale buffers. - maintains backward compatibility with FP16/FP8 KV cache. Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com> Co-authored-by: Yichen Wang <yichen.wang@bytedance.com>

gemini-code-assist · 2025-11-04T07:34:06Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

JackChuang · 2025-11-04T07:55:24Z

Hi @Fridge003 @AniZpZ @zhyncs,
Thank you very much for helping review and merge the PR for MLA KV4 (#10078).
Could you please help review this PR for MLA KV4? Thank you!

JackChuang · 2025-11-12T00:14:57Z

Hi @zhyncs @AniZpZ @Fridge003,
Would really appreciate it if someone could take a quick look at this PR when you have a moment. Thanks!

Fridge003 · 2025-11-13T00:42:12Z

-                    )
-                    for _ in range(self.layer_num)
-                ]
+                if is_float4_e2m1fn_x2(self.dtype):


Is it possible to overload the MHATokenToKVPool with a MHATokenToKVPoolFP4?
There are too many if-else branches here.
I feel the same change needs to be applied to MLA FP4 pool

@Fridge003 It’s feasible. How about this — since I have to update the MLA and submit a new PR anyway, let me fix this issue in that PR as well. If you agree with this, I’ll start working on the PR.

Yes, let's open a new PR for it

@Fridge003 Great! I'll create a new PR for the code refactoring for both MLA&MHA token pool fp4.
Meanwhile, could you please help me merge this PR first? This one is for the functionality. I think it’s better to separate the code refactoring from the new features. Thanks~

New PR #13547 has been submitted to refactor the FP4 token pools for both MLA and MHA.

JackChuang · 2025-11-14T18:54:31Z

Hi @Fridge003,
Per our discussion above, could you please help approve and merge this PR so that I can work directly on the main branch for the code refactoring of both MHA and MLA token pool fp4? Thank you~

Fridge003 · 2025-11-15T06:31:13Z

NV tests all passed
https://github.com/sgl-project/sglang/actions/runs/19177871109/job/55448358397?pr=12612

JackChuang requested review from Ying1123, hnyls2002, ispobock, merrymercy, xiezhq-hermann and zhyncs as code owners November 4, 2025 07:34

This was referenced Nov 4, 2025

Support kv8 (FP8) with torch_native attention backend #12596

Merged

[Feature] Multiple KVCache Quantization Enhancements #10083

Closed

rainj-me added the run-ci label Nov 7, 2025

Fridge003 requested changes Nov 13, 2025

View reviewed changes

Fridge003 approved these changes Nov 14, 2025

View reviewed changes

Fridge003 merged commit 6d5e16f into sgl-project:main Nov 15, 2025
141 of 161 checks passed

JackChuang mentioned this pull request Nov 18, 2025

Refactor MHA & MLA KV caches to support FP4 #13547

Merged

6 tasks

JackChuang mentioned this pull request Dec 4, 2025

add doc for quantized kv cache #14348

Merged

ashtonchew mentioned this pull request Dec 13, 2025

[Bug] FP4 KV cache leads to "fill_cuda" not implemented for 'Float4_e2m1fn_x2' error #14963

Closed

5 tasks

b8zhong mentioned this pull request Dec 15, 2025

Add NVFP4-style KV Cache #15133

Closed

HanHan009527 deleted the horenc/kv4_mha_on_main_release branch December 16, 2025 16:19

Copilot AI mentioned this pull request Feb 9, 2026

Add comprehensive NVFP4 KVCache technical documentation and upstream PR/issue analysis yiliu30/sglang-fork#2

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add FP4 (E2M1) KV Cache Support for MHA#12612

feat: Add FP4 (E2M1) KV Cache Support for MHA#12612
Fridge003 merged 1 commit intosgl-project:mainfrom
bytedance-iaas:horenc/kv4_mha_on_main_release

JackChuang commented Nov 4, 2025

Uh oh!

gemini-code-assist Bot commented Nov 4, 2025

Uh oh!

JackChuang commented Nov 4, 2025

Uh oh!

JackChuang commented Nov 12, 2025

Uh oh!

Uh oh!

Fridge003 Nov 13, 2025

Uh oh!

JackChuang Nov 13, 2025

Uh oh!

Fridge003 Nov 13, 2025

Uh oh!

JackChuang Nov 13, 2025

Uh oh!

JackChuang Nov 18, 2025

Uh oh!

JackChuang commented Nov 14, 2025

Uh oh!

Fridge003 commented Nov 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JackChuang commented Nov 4, 2025

Summary

Usage

Motivation and Benefits

Key Changes

Accuracy tests for KV4 MHA

Qwen3-235B-A22B

gpt-oss-120b

Performance Results

Checklist

Uh oh!

gemini-code-assist Bot commented Nov 4, 2025

Uh oh!

JackChuang commented Nov 4, 2025

Uh oh!

JackChuang commented Nov 12, 2025

Uh oh!

Uh oh!

Fridge003 Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

JackChuang Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Fridge003 Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

JackChuang Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

JackChuang Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

JackChuang commented Nov 14, 2025

Uh oh!

Fridge003 commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fridge003 commented Nov 15, 2025 •

edited

Loading