Check KV4 compatibility with attention backends and add KV4 support to the attention_backend doc#14467
Merged
Fridge003 merged 2 commits intosgl-project:mainfrom Dec 12, 2025
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Fridge003
reviewed
Dec 5, 2025
2 tasks
0f97cf2 to
b2a2c7f
Compare
Fridge003
approved these changes
Dec 9, 2025
Collaborator
|
/tag-and-rerun-ci keep rerunning |
b2a2c7f to
fe51449
Compare
Contributor
Author
|
@Fridge003 I've fixed the code according to your comments and I've also rebased to origin/main and fixed conflicts as well. Please check again. Thank you~ |
Collaborator
fe51449 to
b580da2
Compare
Contributor
Author
I've fixed the failing stage-a-test-1 test. Please check again. Thanks. |
… FP4 note - Introduce FP4 KV cache support in the backend matrix. - Add note on FA4 + KV4 scenario. Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
- Add _handle_kv4_compatibility() to validate backend choices for KV4 scenarios. - Warns on potential edge-case incompatibilities. - Asserts correct decode_attention_backend for FA4 + MLA/MHA and non-FA4 + MLA/MHA setups. - Raises error if KV4 is used on non-CUDA platforms. Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
b580da2 to
03eacf3
Compare
6 tasks
Prozac614
pushed a commit
to Prozac614/sglang
that referenced
this pull request
Dec 17, 2025
…o the attention_backend doc (sgl-project#14467) Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
YChange01
pushed a commit
to YChange01/sglang
that referenced
this pull request
Jan 13, 2026
…o the attention_backend doc (sgl-project#14467) Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Prevent users from using KV4 with incompatible attention backends by clearly documenting supported backends and enforcing runtime checks.
Improves reliability and reduces runtime errors by ensuring users cannot accidentally run KV4 with unsupported attention backends.
Modifications
Ensure code executes after default settings are set by placing it in server_args.py instead of server_args.py.
Description / Changes:
1. Backend documentation updates
• Added FP4 KV cache column to the MLA/MHA backend table.
• Clarifies which backend combinations are supported with FP4 KV caches.
2. ServerArgs updates
• Added _handle_kv4_compatibility() to check KV4 compatibility with attention backends at runtime.
• Logs warnings for potential edge-case incompatibilities.
• Adds assertions to enforce correct decode_attention_backend for FA4 + MLA/MHA and non-FA4 + MLA/MHA setups.
• Raises an error if KV4 is used on non-CUDA platforms.
Testing
The compatibility results are tested on B200 (sm100), using Qwen3-235B-A22B as MHA and DeepSeek-R1-0528-FP4 as MLA to test.
Next (WIP)
Test kv4 with fa3 and flashmla backend on sm90 to complete the table. I will send another PR for this.
Checklist