Skip to content

[Feat] QWen-1M context support[2/2]: Update block sparse attention backend#5949

Merged
merrymercy merged 25 commits intosgl-project:mainfrom
FlamingoPg:blocksparse-attention-backend-2
Aug 7, 2025
Merged

[Feat] QWen-1M context support[2/2]: Update block sparse attention backend#5949
merrymercy merged 25 commits intosgl-project:mainfrom
FlamingoPg:blocksparse-attention-backend-2

Conversation

@FlamingoPg
Copy link
Copy Markdown
Collaborator

Motivation

Stack PR:
[1/2]: #5847 (comment)

Todo: support cudagraphs for block sparse attention backend

Modifications

Checklist

@renjie0
Copy link
Copy Markdown

renjie0 commented May 28, 2025

XAttention seems to provide better performance and lower overhead than MInference. Are you guys also looking into using something like XAttention?

Comment thread sgl-kernel/csrc/attention/vertical_slash_index.cu Outdated
Comment thread python/sglang/srt/model_executor/forward_batch_info.py Outdated
Comment thread sgl-kernel/csrc/attention/vertical_slash_index.cu Outdated
Comment thread python/sglang/srt/configs/model_config.py Outdated
@yizhang2077
Copy link
Copy Markdown
Collaborator

please resolve conflicts. Currently there are some restrictions for DCA backend: 1. cuda graph. 2. radix cache 3. dp attention. We need disable cuda graph and radix cache automatically and add assertion for disabling dp attention.

Comment thread python/sglang/srt/models/qwen2_moe.py Outdated
@yizhang2077 yizhang2077 force-pushed the blocksparse-attention-backend-2 branch from 8d17d49 to 7fe283d Compare August 4, 2025 06:50
@yizhang2077 yizhang2077 changed the title [Feat][WIP] QWen-1M context support[2/2]: Update block sparse attention backend [Feat] QWen-1M context support[2/2]: Update block sparse attention backend Aug 4, 2025
@yizhang2077 yizhang2077 marked this pull request as ready for review August 4, 2025 08:22
@yizhang2077 yizhang2077 requested a review from kushanam as a code owner August 4, 2025 08:22
@yizhang2077 yizhang2077 force-pushed the blocksparse-attention-backend-2 branch 2 times, most recently from edf680f to 14295a8 Compare August 5, 2025 13:49
@yizhang2077 yizhang2077 force-pushed the blocksparse-attention-backend-2 branch from 14295a8 to 7c08c28 Compare August 5, 2025 13:53
@yizhang2077 yizhang2077 force-pushed the blocksparse-attention-backend-2 branch from 92aead9 to 86e188a Compare August 6, 2025 09:20
@merrymercy merrymercy merged commit b7cd743 into sgl-project:main Aug 7, 2025
9 of 85 checks passed
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@yuan-luo
Copy link
Copy Markdown
Collaborator

yuan-luo commented Aug 23, 2025

XAttention seems to provide better performance and lower overhead than MInference. Are you guys also looking into using something like XAttention?

@renjie0, this is a great proposal. However, while planning to integrate XAttention into sglang, I discovered that MInference already supports it (see https://github.com/microsoft/MInference/blob/main/minference/modules/xattention.py). Even so, I believe it’s still worthwhile, so I’ll proceed.

@yuan-luo
Copy link
Copy Markdown
Collaborator

@FlamingoPg MInference is best suited for extremely long inputs. Per our benchmark, with Qwen2.5, performance gains only begin once the input exceeds 200 k tokens, and those gains become pronounced past 600 k tokens. Do we have any benchmark result on Qwen3-1M about this backend?

@FlamingoPg
Copy link
Copy Markdown
Collaborator Author

@yuan-luo To be honest, the current minference is still in a very preliminary stage. It doesn't support cudagraph and prefix cache. Perhaps other methods could achieve better efficiency.

@yuan-luo
Copy link
Copy Markdown
Collaborator

yuan-luo commented Aug 23, 2025

@yuan-luo To be honest, the current minference is still in a very preliminary stage. It doesn't support cudagraph and prefix cache. Perhaps other methods could achieve better efficiency.

@FlamingoPg Got it. Thanks for the information.

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants