[Feature] Implement sparse dual-chunk flash attention for Qwen-1M models. by sighingnow · Pull Request #5358 · sgl-project/sglang

sighingnow · 2025-04-13T18:44:08Z

Motivation

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>

zhyncs · 2025-04-13T19:48:53Z

  m.impl("cutlass_mla_decode", torch::kCUDA, &cutlass_mla_decode);
  m.def("cutlass_mla_get_workspace_size", &cutlass_mla_get_workspace_size);

+  m.def(


Can you please submit another PR for the sgl-kernel change separately? Thanks.

zhyncs · 2025-04-13T19:50:38Z

ref #5329

FlamingoPg · 2025-04-15T06:24:23Z

Need UT for convert_vertical_slash_indexes kernel.

FlamingoPg · 2025-04-28T15:26:41Z

I will make another pr for sgl_kernel, will merge dual-chunk flash attn in these days.

Implement sparse dual-chunk flash attention for Qwen-1M models.

bfdda0a

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>

sighingnow requested review from BBuf, ByronHsu, FlamingoPg, HaiShaw, HandH1998, Ying1123, hnyls2002, ispobock, merrymercy, xiezhq-hermann, yizhang2077 and zhyncs as code owners April 13, 2025 18:44

zhyncs assigned hebiao064, qingquansong, FlamingoPg and zhyncs Apr 13, 2025

zhyncs added the high priority label Apr 13, 2025

zhyncs assigned Fridge003 Apr 13, 2025

zhyncs reviewed Apr 13, 2025

View reviewed changes

FlamingoPg mentioned this pull request Apr 28, 2025

[Feat] QWen-1M context support[1/2]: Update block sparse attention backend utils kernel #5847

Merged

6 tasks

sighingnow marked this pull request as draft May 9, 2025 06:58

b8zhong closed this Aug 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Implement sparse dual-chunk flash attention for Qwen-1M models.#5358

[Feature] Implement sparse dual-chunk flash attention for Qwen-1M models.#5358
sighingnow wants to merge 1 commit intosgl-project:mainfrom
sighingnow:dev/dual-chunk-attn

sighingnow commented Apr 13, 2025

Uh oh!

zhyncs Apr 13, 2025

Uh oh!

zhyncs commented Apr 13, 2025

Uh oh!

FlamingoPg commented Apr 15, 2025

Uh oh!

FlamingoPg commented Apr 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

sighingnow commented Apr 13, 2025

Motivation

Modifications

Checklist

Uh oh!

zhyncs Apr 13, 2025

Choose a reason for hiding this comment

Uh oh!

zhyncs commented Apr 13, 2025

Uh oh!

FlamingoPg commented Apr 15, 2025

Uh oh!

FlamingoPg commented Apr 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants