[FlexAttention] Enforce Q,K,V memory layouts for fp8 flex attention to avoid perf degradation by danielvegamyhre · Pull Request #153357 · pytorch/pytorch

danielvegamyhre · 2025-05-12T01:58:58Z

Context

NCU analysis of the fp8 flex attention perf issue in #147336 showed an unexpected increase in shared memory access bank conflicts when loading the V tensor from HBM to SRAM.

Bringing this to the attention of triton developer @davidberard98 he identified the memory layout of the tensor in HBM to be causing non-pipelined loads into SRAM, causing the slowdown.

To summarize:

In flex attention when performing the FP8 GEMM softmax_scores @ V the right operand V must be in column-major memory layout. However, the tl.load of V blocks from HBM to SRAM cannot be pipelined if the V tensor isn't column-major in HBM already, leading to substantial performance degradation.

This is because triton does not perform async copies with the cp.async PTX instruction if the number of contiguous bytes is less than 4 (see here).

i.e., when loading 4 bytes of contiguous data from a tensor stored in row-major in HBM, we have to perform 4 separate non-contiguous writes to SRAM to place those bytes in their new location in the col-major layout in SRAM. Thus the load is not a candidate for pipelining w/ cp.async and just moves data to registers then performs a series of single byte stores.

Fix summary

To fix this, we should enforce memory layouts for Q, K, V in FlexAttention when fp8 is being used, to ensure they each exist in HBM in the necessary memory layout to facilitate pipelined loads into SRAM ahead of the FP8 GEMMs

Benchmarks

Rerunning the repro we see fp8 runtime is reduced from 120% of bf16 to 76% of bf16 runtime.

Before fix:

(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8 
2025-05-11 19:07:33,402 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 19:07:35,885 - flex_bench - INFO - bf16: 424.87228804347734 us
2025-05-11 19:07:35,893 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 19:07:37,319 - flex_bench - INFO - fp8e4m3: 515.714000000001 us

After fix:

(flex) [danvm@devgpu007.eag6 ~/ml-perf-tools/flex_attention (main)]$ rm -rf /tmp/torchinductor_${USER}; python profile_flex.py --bf16 --fp8 
2025-05-11 17:34:38,223 - flex_bench - INFO - Running benchmark: bf16
2025-05-11 17:34:41,157 - flex_bench - INFO - bf16: 423.4662032967036 us
2025-05-11 17:34:41,167 - flex_bench - INFO - Running benchmark: fp8e4m3
2025-05-11 17:34:42,917 - flex_bench - INFO - fp8e4m3: 326.3694803493453 us

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

pytorch-bot · 2025-05-12T01:59:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153357

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 1 Pending, 2 Unrelated Failures

As of commit af8fe6a with merge base 27e9d9b ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor / unit-test / cuda12.6-py3.10-gcc9-sm86 / test (inductor_cpp_wrapper, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
[ FAILED ] AotInductorTest.BasicTestCuda

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (#149370)
MISSING REGRESSION TEST

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre · 2025-05-12T02:22:45Z

cc @drisspg for review. I debated between validation vs enforcement for this fix, there are pros/cons to both but felt like enforcement would be a more seamless UX.

drisspg

I am wishy washy on if we want to does this under the hood a lil magically...

That being said this fix shouldn't be needed on sm100 and newer which has TT support, can you make sure to keep this change to sm 90 devices

danielvegamyhre · 2025-05-12T22:09:02Z

I am wishy washy on if we want to does this under the hood a lil magically...

@drisspg I am as well, we could just add validation and require the user to do the transpose themselves, so they are explicitly aware of the layout of their tensors and there's no surprises? It does add some friction to using the API, though.

That being said this fix shouldn't be needed on sm100 and newer which has TT support, can you make sure to keep this change to sm 90 devices

Good to know - updated.

davidberard98 · 2025-05-12T23:16:16Z

torch/nn/attention/flex_attention.py

is this going to work for amd?

I updated it to also check torch.cuda.is_available() to avoid an error on AMD, but no this memory layout enforcement will not be applied on any AMD hardware. I can run follow up tests to check perf on AMD if we have access to any AMD hardware that supports fp8 gemms?

davidberard98 · 2025-05-13T16:36:27Z

torch/nn/attention/flex_attention.py

+        torch.cuda.is_available()
+        and torch.cuda.get_device_capability("cuda") >= (10, 0)


Suggested change

torch.cuda.is_available()

and torch.cuda.get_device_capability("cuda") >= (10, 0)

torch.version.cuda

and torch.cuda.get_device_capability("cuda") >= (10, 0)

I ran some tests on the AMD machine that I'm on:

>>> torch.cuda.is_available() True >>> torch.version.cuda >>> torch.version.cuda is None True >>> torch.version.hip '6.3.42131-fa1d09cbd' >>> torch.cuda.get_device_capability("cuda") (9, 4)

So I think you need to check torch.version.hip to be able to reliably tell if you're on an AMD machine

Thanks for testing this. Seems like could just check if torch.version.cuda is not None to ensure we're on NVIDIA? Updated the code with this check.

davidberard98

lgtm as long as @drisspg is fine with "forcing explicit transpose" instead of "erroring/warning"

drisspg · 2025-05-14T05:13:16Z

test/inductor/test_flex_attention.py

        l_block_mask_full_q_num_blocks = L_block_mask_full_q_num_blocks
        l_block_mask_full_q_indices = L_block_mask_full_q_indices

+        get_device_capability = torch.cuda.get_device_capability('cuda');  get_device_capability = None


I am surprised this shows up in the graph module

Should probably be marked as a constant, yea. although it will get traced out in tracing so only affects pre grad.

Weird, it shows up in the graph module when I use:

is_sm_100 = torch.cuda.get_device_capability("cuda") == (10, 0)

However, it doesn't show up when I do this:

should_enforce_mem_layout = ( gemm_precision in fp8_dtypes and torch.version.cuda is not None and torch.cuda.get_device_capability("cuda") >= (8, 9) and torch.cuda.get_device_capability("cuda") < (10, 0) )

Using the latter approach now.

drisspg · 2025-05-14T05:14:22Z

Im fine with enforcing for now, your perf benchmark above includes the transpose right?

danielvegamyhre · 2025-05-14T15:21:32Z

Im fine with enforcing for now, your perf benchmark above includes the transpose right?

Yep it includes the transpose. Pretransposed it was slightly faster (~294us IIRC).

danielvegamyhre · 2025-05-14T22:38:54Z

@pytorchbot merge

pytorchmergebot · 2025-05-14T22:40:57Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

jeanschmidt · 2025-05-15T07:56:45Z

@pytorchbot revert -m "Might have introduced regressions in rocm testing for main: https://github.com/pytorch/pytorch/actions/runs/15035410497/job/42257000513 feel free to re-merge if this was a mistake" -c nosignal

pytorchmergebot · 2025-05-15T07:58:20Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2025-05-15T07:58:30Z

@danielvegamyhre your PR has been successfully reverted.

…ention to avoid perf degradation (#153357)" This reverts commit 881a598. Reverted #153357 on behalf of https://github.com/jeanschmidt due to Might have introduced regressions in rocm testing for main: https://github.com/pytorch/pytorch/actions/runs/15035410497/job/42257000513 feel free to re-merge if this was a mistake ([comment](#153357 (comment)))

danielvegamyhre · 2025-05-15T15:48:54Z

Updated test to handle the different graph module produced for AMD.

danielvegamyhre · 2025-05-15T21:43:01Z

inductor test failing is related, error looks like a transient build/infra issue:

[ RUN      ] AotInductorTest.BasicPackageLoaderTestCpu
unknown file: Failure
C++ exception with description "Error in dlopen: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /tmp/j0sT4j/data/aotinductor/model/ctl2eqddy7dqff6vahjctdrmngtsn77afchsa3gojr2xfxyyysfb.wrapper.so)

rocm test failure is unrelated, error is transient test infra issue:

Run # copy test results back to the mounted workspace, needed sudo, resulting permissions were correct
Error: No such container:

sm75 pr time benchmark failure might be related since this memory layout transformation is applied to arch < sm100, and we should really only be applying it to arch >= sm89 and < sm100. Updating this change to not apply to < sm89.

danielvegamyhre · 2025-05-16T04:54:47Z

@pytorchbot merge -f "test failures unrelated, see comments"

pytorchmergebot · 2025-05-16T04:56:28Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

danielvegamyhre requested review from albanD, jbschlosser and mikaylagawarecki as code owners May 12, 2025 01:58

danielvegamyhre changed the title ~~[flex attention] Enforce Q,K,V memory layouts for fp8 flex attention to avoid perf degradation~~ [FlexAttention] Enforce Q,K,V memory layouts for fp8 flex attention to avoid perf degradation May 12, 2025

danielvegamyhre force-pushed the validate-flex branch from e258db7 to 48bc862 Compare May 12, 2025 02:06

danielvegamyhre requested a review from drisspg May 12, 2025 02:06

danielvegamyhre added the topic: not user facing topic category label May 12, 2025

albanD requested review from ngimel and removed request for albanD May 12, 2025 14:25

ngimel approved these changes May 12, 2025

View reviewed changes

drisspg reviewed May 12, 2025

View reviewed changes

davidberard98 reviewed May 12, 2025

View reviewed changes

danielvegamyhre force-pushed the validate-flex branch from 6212253 to 84b52b0 Compare May 12, 2025 23:17

danielvegamyhre added 2 commits May 12, 2025 18:57

enforce memory layouts for fp8 flex attention to avoid perf regression

d3c0d51

only apply mem layout check on arch prior to sm100

f9d0895

danielvegamyhre force-pushed the validate-flex branch from 84b52b0 to f9d0895 Compare May 13, 2025 01:57

fix test

0b28853

pytorch-bot bot added ciflow/inductor module: inductor labels May 13, 2025

davidberard98 reviewed May 13, 2025

View reviewed changes

ensure torch cuda version is not none

e97441a

davidberard98 approved these changes May 14, 2025

View reviewed changes

drisspg reviewed May 14, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 14, 2025

pytorchmergebot added the merging label May 14, 2025

pytorchmergebot added the Merged label May 15, 2025

pytorchmergebot closed this in 881a598 May 15, 2025

pytorchmergebot removed the merging label May 15, 2025

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels May 15, 2025

pytorchmergebot reopened this May 15, 2025

update test to handle amd

74bf5bb

davidberard98 added the ciflow/rocm Trigger "default" config CI on ROCm label May 15, 2025

danielvegamyhre added 2 commits May 15, 2025 14:55

don't enforce mem layout for gpu arch before sm89

e21ad6e

update test

af8fe6a

pytorchmergebot added the merging label May 16, 2025

pytorchmergebot closed this in 7e16cb9 May 16, 2025

pytorchmergebot removed the merging label May 16, 2025

github-actions bot deleted the validate-flex branch June 19, 2025 02:20

		torch.cuda.is_available()
		and torch.cuda.get_device_capability("cuda") >= (10, 0)

Conversation

danielvegamyhre commented May 12, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Fix summary

Benchmarks

Uh oh!

pytorch-bot bot commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/153357

⏳ 1 Pending, 2 Unrelated Failures

Uh oh!

danielvegamyhre commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre commented May 12, 2025

Uh oh!

davidberard98 May 12, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidberard98 May 13, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidberard98 left a comment

Choose a reason for hiding this comment

Uh oh!

drisspg May 14, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre May 14, 2025

Choose a reason for hiding this comment

Uh oh!

eellison May 14, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre May 15, 2025

Choose a reason for hiding this comment

Uh oh!

drisspg commented May 14, 2025

Uh oh!

danielvegamyhre commented May 14, 2025

Uh oh!

danielvegamyhre commented May 14, 2025

Uh oh!

pytorchmergebot commented May 14, 2025

Merge started

Uh oh!

jeanschmidt commented May 15, 2025

Uh oh!

pytorchmergebot commented May 15, 2025

Uh oh!

pytorchmergebot commented May 15, 2025

Uh oh!

danielvegamyhre commented May 15, 2025

Uh oh!

danielvegamyhre commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielvegamyhre commented May 16, 2025

Uh oh!

pytorchmergebot commented May 16, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

danielvegamyhre commented May 12, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented May 12, 2025 •

edited

Loading

danielvegamyhre commented May 12, 2025 •

edited

Loading

danielvegamyhre May 13, 2025 •

edited

Loading

danielvegamyhre May 13, 2025 •

edited

Loading

danielvegamyhre commented May 15, 2025 •

edited

Loading