Skip to content

[CPU] optimize flash_attn_varlen_func#15708

Merged
Kangyan-Zhou merged 25 commits intosgl-project:mainfrom
mingfeima:pr_flash_attn_varlen_func
Jan 30, 2026
Merged

[CPU] optimize flash_attn_varlen_func#15708
Kangyan-Zhou merged 25 commits intosgl-project:mainfrom
mingfeima:pr_flash_attn_varlen_func

Conversation

@mingfeima
Copy link
Copy Markdown
Collaborator

@mingfeima mingfeima commented Dec 24, 2025

Motivation

provide apple to apple function of flash_attn_varlen_func on CPU, optimized with intel AMX. Newly enabled models such as omni and sglang diffusion used flash_attn_varlen_func directly on CUDA device, so need to provide a counterpart on other devices.

Modifications

Add flash_attn.cpp in sgl-kernel, the kernel itself is pretty much the same as 2nd stage of extend attention.

Accuracy Tests

Add test_flash_attn.py in test cases.

Benchmarking and Profiling

The existing scaled_dot_product_attention from torch doesn't support varlen functionalities, so here compare BS=1 with flash_attn_varlen_func, tested with qwen3 omni shape on single socket Intel(R) Xeon(R) 6979P with 40C:

The result can be reproduced by test_flash_attn_varlen.py

### T = 8160, H = 6, Hkv = 6, K = 72; torch.sdpa time 23.993 ms, flash_attn time 19.787 ms

The kernel on sglang is slightly faster than the version that we implemented in torch. Anyway it should be possible to further optimize this kernel to reuse the packed key and value results for adjacent query blocks that belongs to the same head dimension. Put this as TODO.

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@mingfeima mingfeima marked this pull request as draft December 24, 2025 02:10
@mingfeima mingfeima changed the title Pr flash attn varlen func [CPU] optimize flash_attn_varlen_func Dec 24, 2025
@mingfeima mingfeima force-pushed the pr_flash_attn_varlen_func branch from c4f7b38 to 3ed12d0 Compare December 24, 2025 02:30
@mingfeima mingfeima marked this pull request as ready for review December 24, 2025 02:31
@mingfeima mingfeima added intel cpu cpu backend performance optimization labels Dec 24, 2025
@mingfeima mingfeima marked this pull request as draft January 6, 2026 07:21
@mingfeima mingfeima marked this pull request as ready for review January 20, 2026 06:56
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@mingfeima mingfeima force-pushed the pr_flash_attn_varlen_func branch 2 times, most recently from fcce838 to 1d4bebf Compare January 26, 2026 02:50
@mingfeima mingfeima mentioned this pull request Jan 27, 2026
5 tasks
@mingfeima mingfeima force-pushed the pr_flash_attn_varlen_func branch from 1d4bebf to 786910c Compare January 27, 2026 05:49
@Kangyan-Zhou Kangyan-Zhou merged commit 88f7759 into sgl-project:main Jan 30, 2026
89 of 100 checks passed
charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Jan 30, 2026
sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026
blzheng pushed a commit to blzheng/sglang that referenced this pull request Feb 9, 2026
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpu cpu backend performance optimization intel run-ci sgl-kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants