[Feat] QWen-1M context support[2/2]: Update block sparse attention backend by FlamingoPg · Pull Request #5949 · sgl-project/sglang

FlamingoPg · 2025-05-01T09:06:48Z

Motivation

Todo: support cudagraphs for block sparse attention backend

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

renjie0 · 2025-05-28T18:06:53Z

XAttention seems to provide better performance and lower overhead than MInference. Are you guys also looking into using something like XAttention?

yizhang2077 · 2025-08-03T14:48:02Z

please resolve conflicts. Currently there are some restrictions for DCA backend: 1. cuda graph. 2. radix cache 3. dp attention. We need disable cuda graph and radix cache automatically and add assertion for disabling dp attention.

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>

gemini-code-assist · 2025-08-07T06:49:39Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…ckend (sgl-project#5949)

yuan-luo · 2025-08-23T12:39:39Z

XAttention seems to provide better performance and lower overhead than MInference. Are you guys also looking into using something like XAttention?

@renjie0, this is a great proposal. However, while planning to integrate XAttention into sglang, I discovered that MInference already supports it (see https://github.com/microsoft/MInference/blob/main/minference/modules/xattention.py). Even so, I believe it’s still worthwhile, so I’ll proceed.

yuan-luo · 2025-08-23T12:59:05Z

@FlamingoPg MInference is best suited for extremely long inputs. Per our benchmark, with Qwen2.5, performance gains only begin once the input exceeds 200 k tokens, and those gains become pronounced past 600 k tokens. Do we have any benchmark result on Qwen3-1M about this backend?

FlamingoPg · 2025-08-23T13:20:20Z

@yuan-luo To be honest, the current minference is still in a very preliminary stage. It doesn't support cudagraph and prefix cache. Perhaps other methods could achieve better efficiency.

yuan-luo · 2025-08-23T13:39:35Z

@yuan-luo To be honest, the current minference is still in a very preliminary stage. It doesn't support cudagraph and prefix cache. Perhaps other methods could achieve better efficiency.

@FlamingoPg Got it. Thanks for the information.

…ckend (sgl-project#5949)

FlamingoPg requested review from BBuf, ByronHsu, HaiShaw, HandH1998, Ying1123, ch-wan, hnyls2002, ispobock, merrymercy, xiezhq-hermann, yizhang2077 and zhyncs as code owners May 1, 2025 09:06

FlamingoPg marked this pull request as draft May 1, 2025 09:12

zhyncs assigned yizhang2077 Jul 31, 2025

zhyncs added the high priority label Jul 31, 2025

yizhang2077 reviewed Jul 31, 2025

View reviewed changes

Comment thread sgl-kernel/csrc/attention/vertical_slash_index.cu Outdated

sighingnow mentioned this pull request Jul 31, 2025

[bugifx] QWen-1M context support[2/3] using current cuda stream in the DCA's kernel for bugfix. #8611

Merged

yizhang2077 reviewed Aug 3, 2025

View reviewed changes

Comment thread python/sglang/srt/model_executor/forward_batch_info.py Outdated

Comment thread sgl-kernel/csrc/attention/vertical_slash_index.cu Outdated

Comment thread python/sglang/srt/configs/model_config.py Outdated

yizhang2077 reviewed Aug 3, 2025

View reviewed changes

Comment thread python/sglang/srt/models/qwen2_moe.py Outdated

yizhang2077 force-pushed the blocksparse-attention-backend-2 branch from 8d17d49 to 7fe283d Compare August 4, 2025 06:50

yizhang2077 changed the title ~~[Feat][WIP] QWen-1M context support[2/2]: Update block sparse attention backend~~ [Feat] QWen-1M context support[2/2]: Update block sparse attention backend Aug 4, 2025

yizhang2077 marked this pull request as ready for review August 4, 2025 08:22

yizhang2077 requested a review from kushanam as a code owner August 4, 2025 08:22

yizhang2077 force-pushed the blocksparse-attention-backend-2 branch 2 times, most recently from edf680f to 14295a8 Compare August 5, 2025 13:49

laixinn and others added 3 commits August 5, 2025 21:52

Cherry-pick sparse dual-chunk flash attention

1951f6a

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>

fix attention bug

22cfb29

fix offline batch inference qwen 1m

db78f7d

yizhang2077 added 4 commits August 5, 2025 21:52

add sgl-kernel dependency

34340d6

tiny fix

a0bab50

fix sgl-kernel version

9f39de3

fix bugs about dual attention

7c08c28

yizhang2077 force-pushed the blocksparse-attention-backend-2 branch from 14295a8 to 7c08c28 Compare August 5, 2025 13:53

yizhang2077 added 5 commits August 6, 2025 10:22

Merge branch 'main' into blocksparse-attention-backend-2

c2598d5

Merge branch 'main' into blocksparse-attention-backend-2

aec8ffe

Update server_args.py

ac301aa

fix lint

a941659

tiny fix

86e188a

yizhang2077 force-pushed the blocksparse-attention-backend-2 branch from 92aead9 to 86e188a Compare August 6, 2025 09:20

yizhang2077 added 8 commits August 6, 2025 17:20

Merge branch 'main' into blocksparse-attention-backend-2

556e037

Merge branch 'main' into blocksparse-attention-backend-2

5bab559

fix for pd

06c2e5b

tiny fix

379ef7d

fix ci

da1839e

fix ci

119781a

Merge branch 'main' into blocksparse-attention-backend-2

c22e05b

Merge branch 'main' into blocksparse-attention-backend-2

1fc0a1e

merrymercy approved these changes Aug 7, 2025

View reviewed changes

merrymercy merged commit b7cd743 into sgl-project:main Aug 7, 2025
9 of 85 checks passed

yizhang2077 mentioned this pull request Aug 13, 2025

[Bugfix] qwen-1m can not be launched by using default attention backend #9141

Closed

4 tasks

narutolhy pushed a commit to narutolhy/sglang that referenced this pull request Aug 17, 2025

[Feat] QWen-1M context support[2/2]: Update block sparse attention ba…

6a1a9ec

…ckend (sgl-project#5949)

MahmoudAshraf97 pushed a commit to MahmoudAshraf97/sglang that referenced this pull request Sep 8, 2025

[Feat] QWen-1M context support[2/2]: Update block sparse attention ba…

a590f35

…ckend (sgl-project#5949)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] QWen-1M context support[2/2]: Update block sparse attention backend#5949

[Feat] QWen-1M context support[2/2]: Update block sparse attention backend#5949
merrymercy merged 25 commits intosgl-project:mainfrom
FlamingoPg:blocksparse-attention-backend-2

FlamingoPg commented May 1, 2025

Uh oh!

renjie0 commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yizhang2077 commented Aug 3, 2025

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot commented Aug 7, 2025

Uh oh!

yuan-luo commented Aug 23, 2025 •

edited

Loading

Uh oh!

yuan-luo commented Aug 23, 2025

Uh oh!

FlamingoPg commented Aug 23, 2025

Uh oh!

yuan-luo commented Aug 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

FlamingoPg commented May 1, 2025

Motivation

Modifications

Checklist

Uh oh!

renjie0 commented May 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yizhang2077 commented Aug 3, 2025

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot commented Aug 7, 2025

Uh oh!

yuan-luo commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuan-luo commented Aug 23, 2025

Uh oh!

FlamingoPg commented Aug 23, 2025

Uh oh!

yuan-luo commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

yuan-luo commented Aug 23, 2025 •

edited

Loading

yuan-luo commented Aug 23, 2025 •

edited

Loading