[CPU] optimize flash_attn_varlen_func by mingfeima · Pull Request #15708 · sgl-project/sglang

mingfeima · 2025-12-24T02:09:40Z

Motivation

provide apple to apple function of flash_attn_varlen_func on CPU, optimized with intel AMX. Newly enabled models such as omni and sglang diffusion used flash_attn_varlen_func directly on CUDA device, so need to provide a counterpart on other devices.

Modifications

Add flash_attn.cpp in sgl-kernel, the kernel itself is pretty much the same as 2nd stage of extend attention.

Accuracy Tests

Add test_flash_attn.py in test cases.

Benchmarking and Profiling

The existing scaled_dot_product_attention from torch doesn't support varlen functionalities, so here compare BS=1 with flash_attn_varlen_func, tested with qwen3 omni shape on single socket Intel(R) Xeon(R) 6979P with 40C:

The result can be reproduced by test_flash_attn_varlen.py

### T = 8160, H = 6, Hkv = 6, K = 72; torch.sdpa time 23.993 ms, flash_attn time 19.787 ms

The kernel on sglang is slightly faster than the version that we implemented in torch. Anyway it should be possible to further optimize this kernel to reuse the packed key and value results for adjacent query blocks that belongs to the same head dimension. Put this as TODO.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-24T02:09:43Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-01-20T06:56:33Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

This reverts commit e32af62.

This reverts commit fca9d7b.

mingfeima requested review from BBuf, FlamingoPg, HaiShaw, ispobock, merrymercy, yizhang2077 and zhyncs as code owners December 24, 2025 02:09

github-actions Bot added the sgl-kernel label Dec 24, 2025

mingfeima added the run-ci label Dec 24, 2025

mingfeima marked this pull request as draft December 24, 2025 02:10

mingfeima mentioned this pull request Dec 24, 2025

[Roadmap] Intel CPU Roadmap (2025Q4) #12802

Closed

2 tasks

mingfeima changed the title ~~Pr flash attn varlen func~~ [CPU] optimize flash_attn_varlen_func Dec 24, 2025

mingfeima force-pushed the pr_flash_attn_varlen_func branch from c4f7b38 to 3ed12d0 Compare December 24, 2025 02:30

mingfeima marked this pull request as ready for review December 24, 2025 02:31

mingfeima added intel cpu cpu backend performance optimization labels Dec 24, 2025

mingfeima mentioned this pull request Jan 6, 2026

[CPU] improve flash attn kernel performance for extend when seqlen is large #14566

Closed

6 tasks

mingfeima marked this pull request as draft January 6, 2026 07:21

blzheng mentioned this pull request Jan 7, 2026

apply optimization for flash_attn_varlen_func blzheng/sglang#19

Merged

6 tasks

mingfeima marked this pull request as ready for review January 20, 2026 06:56

mingfeima force-pushed the pr_flash_attn_varlen_func branch 2 times, most recently from fcce838 to 1d4bebf Compare January 26, 2026 02:50

mingfeima mentioned this pull request Jan 27, 2026

Pr flash attn varlen func #17795

Merged

5 tasks

mingfeima force-pushed the pr_flash_attn_varlen_func branch from 1d4bebf to 786910c Compare January 27, 2026 05:49

mingfeima added 4 commits January 27, 2026 13:47

add optimized kernel for flash_attn_varlen_func on CPU

bf35952

remove redundant code

fff91ac

update block size

494cc13

add test case

b8b9573

mingfeima added 21 commits January 27, 2026 13:47

update run_suite.py

38b2019

update license per company request

8f6c753

use faster exp implementation

bd14f3e

fuse softmax with avx512

9679fc0

use same buffer for s_i and s_delta and s_delta2

4f1b2f4

add flash_attn_kernel_impl for non-varlen cases

3239711

fix double free segfault by wrong setting of BLOCK_M

1c47ff7

improve extend kernel performance for long context length

2bcf8d2

update test_extend.py

6b9ed0a

update comment

ee9e3b8

upgrade to fexp_u20 since torch has been updated to 2.9

30ff353

update test_extend.py

87c827d

Revert "update test_extend.py"

cbcce9a

This reverts commit e32af62.

Revert "upgrade to fexp_u20 since torch has been updated to 2.9"

7cb18cb

This reverts commit fca9d7b.

use flash_attn_softmax in extend_attention

d1672bf

remove unused headers

6b8cc43

minor fix

bcdb6db

remove unused variables

63c8eee

change naming for scaling to sm_scale

78c5adf

minor change

fb5a248

clear debug print

786910c

Kangyan-Zhou merged commit 88f7759 into sgl-project:main Jan 30, 2026
89 of 100 checks passed

charlesHsuGG pushed a commit to charlesHsuGG/sglang that referenced this pull request Jan 30, 2026

[CPU] optimize flash_attn_varlen_func (sgl-project#15708)

060f3e5

sfiisf pushed a commit to sfiisf/sglang that referenced this pull request Feb 5, 2026

[CPU] optimize flash_attn_varlen_func (sgl-project#15708)

13ca5af

blzheng pushed a commit to blzheng/sglang that referenced this pull request Feb 9, 2026

[CPU] optimize flash_attn_varlen_func (sgl-project#15708)

7ed1028

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026

[CPU] optimize flash_attn_varlen_func (sgl-project#15708)

233c171

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] optimize flash_attn_varlen_func#15708

[CPU] optimize flash_attn_varlen_func#15708
Kangyan-Zhou merged 25 commits intosgl-project:mainfrom
mingfeima:pr_flash_attn_varlen_func

mingfeima commented Dec 24, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Dec 24, 2025

Uh oh!

gemini-code-assist Bot commented Jan 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mingfeima commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Dec 24, 2025

Uh oh!

gemini-code-assist Bot commented Jan 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mingfeima commented Dec 24, 2025 •

edited

Loading