Skip to content

[Feature] Implement sparse dual-chunk flash attention for Qwen-1M models.#5358

Closed
sighingnow wants to merge 1 commit intosgl-project:mainfrom
sighingnow:dev/dual-chunk-attn
Closed

[Feature] Implement sparse dual-chunk flash attention for Qwen-1M models.#5358
sighingnow wants to merge 1 commit intosgl-project:mainfrom
sighingnow:dev/dual-chunk-attn

Conversation

@sighingnow
Copy link
Copy Markdown
Contributor

Motivation

Modifications

Checklist

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
m.impl("cutlass_mla_decode", torch::kCUDA, &cutlass_mla_decode);
m.def("cutlass_mla_get_workspace_size", &cutlass_mla_get_workspace_size);

m.def(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please submit another PR for the sgl-kernel change separately? Thanks.

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Apr 13, 2025

ref #5329

@FlamingoPg
Copy link
Copy Markdown
Collaborator

Need UT for convert_vertical_slash_indexes kernel.

@FlamingoPg
Copy link
Copy Markdown
Collaborator

I will make another pr for sgl_kernel, will merge dual-chunk flash attn in these days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants