[Flex Attn][CPU] support flash decoding for cpu by Valentine233 · Pull Request #159835 · pytorch/pytorch

Valentine233 · 2025-08-05T02:35:28Z

Description:

Support flash decoding in CppFlexAttentionTemplate. We prefer to choose flash decoding instead of flash attention when query length is 1.
For flash decoding, we add a kernel option PARTITION_SIZE to define the partition size of doing the parallelism on KV length dimension. The default value is 128, which should be multiple of KV cache block size to use flash decoding.
As mentioned in Fix large_tensor_test skipping cpu #158617, flex_attn UTs for the cpu backend are disabled because of the long duration. Here we re-enable them on CPU-only machines. (Already merged in Enable XPU path for FlexAttention #143553)

Performance:

Here are the E2E results for Llama3.1-8B decoding validated on a GNR machine with 6 NUMA nodes, where we can see E2E improvements from 114% to 121%.

Data Type	Input/Output tokens	Batch Size	W/O Flash Decoding (tokens/s)	With Flash Decoding (tokens/s)	Speedup
BF16	2016/32	25	892.196	1083.073	121.39%
FP16	2016/32	25	879.541	1015.593	115.47%
BF16	1024/128	30	1291.349	1529.251	118.42%
FP16	1024/128	30	1294.228	1473.049	113.82%

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @chenyang78

pytorch-bot · 2025-08-05T02:35:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159835

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5d66b9f with merge base d67b279 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Valentine233 · 2025-08-06T01:18:19Z

@jianan-gu @CaoE Please help review, thanks~

jianan-gu · 2025-08-20T02:18:49Z

torch/_inductor/codegen/cpp_flex_attention_template.py

                )
-            return self._template_from_string(FLEX_ATTENTION_TEMPLATE).render(**options)
+            if (
+                query.data.data.layout.size[2] == 1


we may double check the condition for selecting the flash decoding path, ref: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/flex/flex_decoding.py#L34

Thanks! A function to choose flex template is added.

CaoE · 2025-08-20T02:31:46Z

torch/_inductor/kernel/flex/flex_cpu.py

    SPARSE_KV_BLOCK_SIZE = V.graph.sizevars.guard_int(SPARSE_KV_BLOCK_SIZE)
    SPARSE_Q_BLOCK_SIZE = V.graph.sizevars.guard_int(SPARSE_Q_BLOCK_SIZE)
+    # In flash decoding, the partition size of doing the parallelism on KV length dim
+    PARTITION_SIZE = kernel_options.get("PARTITION_SIZE", 128)


Can we add more PARTITION_SIZE for testing ?

Why flash decoding needs to set PARTITION_SIZE instead of automatically selecting a suitable PARTITION_SIZE?

Thanks, the UT for PARTITION_SIZE is added.
Yes, it is possible to automatically select the PARTITION_SIZE, which depends on PAGE_SIZE, input shape and thread numbers. It is also the same case for PAGE_SIZE, which is a fixed value now. We can have a round of tuning for PAGE_SIZE and PARTITION_SIZE as a future work.

jianan-gu · 2025-08-20T04:25:49Z

torch/_inductor/codegen/cpp_flex_attention_template.py

+                {{kernel.kernel_name}}_conditional_data_ptr(logits, logits_reduced) + token_num,
+                v_addr,
+                tmp_out,
+                false);


we may also need to add back need_pack check, depending on the qsize threshold mentioned below.

We only let qsize=1 enter flash decoding for now, and for this case we do not need packing.

Valentine233 · 2025-08-25T07:45:34Z

@ZainRizvi Hi, could you please help check if the UT duration is acceptable with this PR? Previously reverted in #158617.

jianan-gu · 2025-08-27T02:01:56Z

Hi @drisspg,
We are adding flash decoding for inductor CPU backend (and also UT changes mentioned in #158617 (comment)) , could you kindly help review? Thanks!

CaoE · 2025-08-27T08:30:45Z

torch/_inductor/codegen/cpp_flex_attention_template.py

+                self.partition_size % self.kv_block_size == 0
+                and q_seq_len == 1
+                and num_threads > q_batch_size * q_num_heads
+                and k_seq_len / q_batch_size >= max(self.partition_size * 2, 512)


Add comments to explain this formula?

CaoE · 2025-08-27T08:53:44Z

test/inductor/test_flex_decoding.py

+        def score_mod(score, b, h, m, n):
+            return score * 2
+
+        self.run_test_with_paged_attention(


Add mask_mod tests？

Valentine233 · 2026-01-20T02:53:08Z

@drisspg @malfet @jansel
Hi, this feature is planned to target PyTorch 2.11. The PR is rebased and please help review again!

pytorch-bot bot added ciflow/inductor module: inductor labels Aug 5, 2025

Valentine233 marked this pull request as draft August 5, 2025 02:35

Valentine233 added the topic: not user facing topic category label Aug 5, 2025

pytorchbot added the open source label Aug 5, 2025

CaoE added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 12, 2025

jianan-gu reviewed Aug 20, 2025

View reviewed changes

CaoE reviewed Aug 20, 2025

View reviewed changes

jianan-gu reviewed Aug 20, 2025

View reviewed changes

Valentine233 force-pushed the flash_decoding_cpu branch 2 times, most recently from cbfa6e4 to b08d898 Compare August 25, 2025 06:04

Valentine233 requested review from CaoE, drisspg, jansel and jianan-gu and removed request for jianan-gu August 25, 2025 06:57

Valentine233 changed the title ~~[Flex Attn][CPU] support flash decoding for cpu~~ [PyTorch2.9 Feature][Flex Attn][CPU] support flash decoding for cpu Aug 25, 2025

Valentine233 marked this pull request as ready for review August 25, 2025 07:45

soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 26, 2025

Valentine233 force-pushed the flash_decoding_cpu branch 2 times, most recently from 87dcbad to 1746bb2 Compare August 27, 2025 05:54

CaoE reviewed Aug 27, 2025

View reviewed changes

CaoE approved these changes Aug 27, 2025

View reviewed changes

jianan-gu mentioned this pull request Aug 27, 2025

Enable XPU path for FlexAttention #143553

Closed

Valentine233 added 28 commits January 19, 2026 01:42

fix format

0abf728

fix format

45889eb

fix format

d444c94

update

d86a98b

update

37631b2

add ut for partition size

6371ab3

fix format

b5cf320

fix format

0b1cc93

fix format

c041ab4

fix ut issue

5aa34e1

update

c23106c

add comments

5428f2c

fix typo

dedfdf0

refine code

c8672d8

refine code

e7149ad

remove ut change

d9a4853

add skip cpu

6ff2c2d

update flash decoding codes

be12986

enable more dtypes in ut

8b58da2

fix format

a0d9892

update and rebase

dd20f0b

fix ut

b2e216e

fix ut

0e43bad

fix ut

a2f02ef

add ut for 1 partition

8f6f169

enable non-avx2 machines

8d8ff15

fix format

dfd4bfa

fix format

5d66b9f

Valentine233 force-pushed the flash_decoding_cpu branch from d79d17b to 5d66b9f Compare January 19, 2026 06:45

Conversation

Valentine233 commented Aug 5, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Performance:

Uh oh!

pytorch-bot bot commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159835

✅ No Failures

Uh oh!

Valentine233 commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Valentine233 commented Aug 25, 2025

Uh oh!

jianan-gu commented Aug 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Valentine233 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Valentine233 commented Aug 5, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 5, 2025 •

edited

Loading

Valentine233 commented Aug 6, 2025 •

edited

Loading

Valentine233 commented Jan 20, 2026 •

edited

Loading