[Submodule] Turning flash-attention integration into 3rd party submod (#144120) by drisspg · Pull Request #146372 · pytorch/pytorch

drisspg · 2025-02-04T00:34:54Z

Summary:

Summary

Sticky points

Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC

Dependencies

Flash PR: Add a macro for namespace Dao-AILab/flash-attention#1419

Other Points

The BC linter is complaining about losing generate.py and its functions which is not real BC surface
cc albanD

imported-using-ghimport

Test Plan:
Imported from OSS

Building in dev
buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a //caffe2:ATen-cu --show-full-output

I and Nming the .so I do see that the flash symbols are correctly named:

0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const

Reviewed By: vkuzo

Differential Revision: D68502879

Pulled By: drisspg

cc @albanD

pytorch-bot · 2025-02-04T00:34:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146372

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 16e81a4 with merge base 651e6aa ():

NEW FAILURE - The following job has failed:

pull / linux-jammy-py3.9-gcc11 / test (backwards_compat, 1, 1, linux.2xlarge) (gh)
test_modules_can_be_imported

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / unstable-linux-focal-cuda12.4-py3.10-gcc9-sm89-xfail / build (gh) (#147642)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-02-04T00:36:14Z

This pull request was exported from Phabricator. Differential Revision: D68502879

github-actions · 2025-02-04T00:39:43Z

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.

Caused by:

aten/src/ATen/native/native_functions.yaml

github-actions · 2025-02-04T00:39:44Z

Attention! PyTorch one of the C-stable API file was changed

You MUST NOT change existing function declarations in this, as this header defines a stable C ABI. If you need to change the signature for a function, introduce a new v2 version of the function and modify code generation to target the new version of the function.

Caused by:

torch/csrc/inductor/aoti_torch/generated/c_shim_cuda.h

facebook-github-bot · 2025-02-05T04:53:50Z

This pull request was exported from Phabricator. Differential Revision: D68502879

facebook-github-bot · 2025-02-05T05:05:31Z

This pull request was exported from Phabricator. Differential Revision: D68502879

…pytorch#146372) Summary: Pull Request resolved: pytorch#146372 Pull Request resolved: pytorch#144120 # Summary ### Sticky points Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC ## Dependencies - Flash PR: Dao-AILab/flash-attention#1419 ### Other Points - The BC linter is complaining about losing generate.py and its functions which is not real BC surface cc albanD imported-using-ghimport Test Plan: Imported from OSS Building in dev `buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a //caffe2:ATen-cu --show-full-output ` I and Nming the .so I do see that the flash symbols are correctly named: ``` 0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#7}::operator()() const 0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#6}::operator()() const 0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#7}::operator()() const 0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#6}::operator()() const 0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#7}::operator()() const ``` Reviewed By: vkuzo Differential Revision: D68502879 Pulled By: drisspg

facebook-github-bot · 2025-02-08T05:31:47Z

This pull request was exported from Phabricator. Differential Revision: D68502879

facebook-github-bot · 2025-02-08T06:14:22Z

This pull request was exported from Phabricator. Differential Revision: D68502879

facebook-github-bot · 2025-02-08T06:26:27Z

This pull request was exported from Phabricator. Differential Revision: D68502879

…pytorch#146372) Summary: Pull Request resolved: pytorch#146372 Pull Request resolved: pytorch#144120 # Summary ### Sticky points Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC ## Dependencies - Flash PR: Dao-AILab/flash-attention#1419 ### Other Points - The BC linter is complaining about losing generate.py and its functions which is not real BC surface cc albanD imported-using-ghimport Test Plan: Imported from OSS Building in dev `buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a //caffe2:ATen-cu --show-full-output ` I and Nming the .so I do see that the flash symbols are correctly named: ``` 0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#7}::operator()() const 0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#6}::operator()() const 0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#7}::operator()() const 0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#6}::operator()() const 0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#7}::operator()() const ``` Reviewed By: vkuzo Differential Revision: D68502879 Pulled By: drisspg

facebook-github-bot · 2025-02-10T21:04:13Z

This pull request was exported from Phabricator. Differential Revision: D68502879

facebook-github-bot · 2025-02-10T21:15:40Z

This pull request was exported from Phabricator. Differential Revision: D68502879

facebook-github-bot · 2025-02-23T01:25:52Z

This pull request was exported from Phabricator. Differential Revision: D68502879

facebook-github-bot · 2025-02-23T01:38:06Z

This pull request was exported from Phabricator. Differential Revision: D68502879

…pytorch#146372) Summary: Pull Request resolved: pytorch#146372 Pull Request resolved: pytorch#144120 # Summary ### Sticky points Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC ## Dependencies - Flash PR: Dao-AILab/flash-attention#1419 ### Other Points - The BC linter is complaining about losing generate.py and its functions which is not real BC surface cc albanD imported-using-ghimport Test Plan: Imported from OSS Building in dev `buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a //caffe2:ATen-cu --show-full-output ` I and Nming the .so I do see that the flash symbols are correctly named: ``` 0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#7}::operator()() const 0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#6}::operator()() const 0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#7}::operator()() const 0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#6}::operator()() const 0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#7}::operator()() const ``` Reviewed By: vkuzo Differential Revision: D68502879 Pulled By: drisspg

facebook-github-bot · 2025-02-25T16:57:20Z

This pull request was exported from Phabricator. Differential Revision: D68502879

…pytorch#146372) Summary: Pull Request resolved: pytorch#146372 Pull Request resolved: pytorch#144120 # Summary ### Sticky points Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC ## Dependencies - Flash PR: Dao-AILab/flash-attention#1419 ### Other Points - The BC linter is complaining about losing generate.py and its functions which is not real BC surface cc albanD imported-using-ghimport Test Plan: Imported from OSS Building in dev `buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a //caffe2:ATen-cu --show-full-output ` I and Nming the .so I do see that the flash symbols are correctly named: ``` 0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#7}::operator()() const 0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#6}::operator()() const 0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#7}::operator()() const 0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#6}::operator()() const 0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()pytorch#7}::operator()() const ``` Reviewed By: vkuzo Differential Revision: D68502879 Pulled By: drisspg

facebook-github-bot · 2025-02-25T17:50:36Z

This pull request was exported from Phabricator. Differential Revision: D68502879

facebook-github-bot · 2025-02-26T00:03:25Z

@pytorchbot merge -i

(Initiating merge automatically since Phabricator Diff has merged, merging with -i because oss signals were bypassed internally)

pytorchmergebot · 2025-02-26T00:05:10Z

Merge started

Your change will be merged while ignoring the following 2 checks: pull / unstable-linux-focal-cuda12.4-py3.10-gcc9-sm89-xfail / build, pull / linux-jammy-py3.9-gcc11 / test (backwards_compat, 1, 1, linux.2xlarge)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

) ### Summary #146372 changed the op signature of `_scaled_dot_product_flash_attention` and as a consequence DTensor needs to change its sharding defined at https://github.com/pytorch/pytorch/blob/40ad5e01dff05c7d64e070fb01683820e678f788/torch/distributed/tensor/_ops/_matrix_ops.py#L232 ### Test `pytest test/distributed/tensor/test_attention.py` ### Follow-up It's still unclear why the CP unit tests were not run over the original PR which is BC-breaking. Pull Request resolved: #148125 Approved by: https://github.com/tianyu-l, https://github.com/fegin

…#144120) (#146372) Summary: # Summary ### Sticky points Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC ## Dependencies - Flash PR: Dao-AILab/flash-attention#1419 ### Other Points - The BC linter is complaining about losing generate.py and its functions which is not real BC surface cc albanD imported-using-ghimport Test Plan: Imported from OSS Building in dev `buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a //caffe2:ATen-cu --show-full-output ` I and Nming the .so I do see that the flash symbols are correctly named: ``` 0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const 0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const 0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const 0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const 0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const ``` Reviewed By: vkuzo Differential Revision: D68502879 Pulled By: drisspg Pull Request resolved: #146372 Approved by: https://github.com/jbschlosser

pytorch/pytorch#146372 changes the flash attention API. ``` // [Note] BC breaking change to flash seed/offset // Previously: Used separate tensors for philox_seed and philox_offset, sometimes on CPU, sometimes on CUDA // FlashAttention change: Now uses a single uint64_t[2] tensor on device containing both seed and offset // Implementation: Renamed "seed" → "rng_state" (contains both seed+offset) and "offset" → "_unused" ``` ~In nvfuser API, `SdpaFwdOp` now returns a `rng_state` and `SdpaBwdOp` expects a `rng_state`. `_unused` is ignored.~ In nvfuser API, based on torch version, `philox_seed` is now `uint64_t[2]` and philox offset is a empty `uint64_t` tensor on device. I chose to keep the same semantics as PyTorch since it avoids packing/unpacking PyTorch outputs when testing if I were to split the `rng_state` output of PyTorch into `philox_seed` and `philox_offset`. The latter approach would keep the dimensions of both tensors the same, but since we're changing `device` and `dtype`, they will need to be created distinctly for the two cases. This also allows us to switch to the single-parameter version in the future more easily if desired. --------- Co-authored-by: Jingyue Wu <wujingyue@gmail.com>

drisspg requested review from albanD and soulitzer as code owners February 4, 2025 00:34

pytorch-bot bot added the ciflow/inductor label Feb 4, 2025

facebook-github-bot added the fb-exported label Feb 4, 2025

drisspg added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category module: sdpa All things related to torch.nn.functional.scaled_dot_product_attentiion labels Feb 4, 2025

drisspg force-pushed the export-D68502879 branch from 9741ff1 to d98ddeb Compare February 5, 2025 04:53

drisspg force-pushed the export-D68502879 branch from d98ddeb to 9d82be5 Compare February 5, 2025 05:05

drisspg force-pushed the export-D68502879 branch from 9d82be5 to 3e6fe14 Compare February 8, 2025 05:31

drisspg force-pushed the export-D68502879 branch from 3e6fe14 to f77f8a9 Compare February 8, 2025 06:14

drisspg force-pushed the export-D68502879 branch from f77f8a9 to 4753c31 Compare February 8, 2025 06:26

drisspg force-pushed the export-D68502879 branch from 4753c31 to 598dd57 Compare February 10, 2025 21:04

drisspg force-pushed the export-D68502879 branch from a94fd44 to 435cd14 Compare February 23, 2025 01:25

drisspg force-pushed the export-D68502879 branch from 435cd14 to 9a51d00 Compare February 23, 2025 01:38

drisspg added the ci-no-td Do not run TD on this PR label Feb 23, 2025

drisspg force-pushed the export-D68502879 branch from 9a51d00 to e8dab90 Compare February 25, 2025 16:57

drisspg force-pushed the export-D68502879 branch from e8dab90 to 16e81a4 Compare February 25, 2025 17:50

pytorchmergebot added the merging label Feb 26, 2025

pytorchmergebot closed this in 3ecfe6b Feb 26, 2025

pytorchmergebot added Merged and removed merging labels Feb 26, 2025

This was referenced Feb 27, 2025

Temporarily disable CP tests pytorch/torchtitan#898

Merged

[dtensor][fix] fix _scaled_dot_product_flash_attention sharding #148125

Closed

drisspg mentioned this pull request Mar 10, 2025

Upgrading FlashAttention to V3 #148891

Open

This was referenced Mar 12, 2025

Update SDPA flash attention API NVIDIA/Fuser#4065

Merged

Update SDPA API in the nvFuser executor Lightning-AI/lightning-thunder#1876

Merged

Conversation

drisspg commented Feb 4, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Sticky points

Dependencies

Other Points

Uh oh!

pytorch-bot bot commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146372

❌ 1 New Failure, 1 Unrelated Failure

Uh oh!

facebook-github-bot commented Feb 4, 2025

Uh oh!

github-actions bot commented Feb 4, 2025

Attention! native_functions.yaml was changed

Uh oh!

github-actions bot commented Feb 4, 2025

Attention! PyTorch one of the C-stable API file was changed

Uh oh!

facebook-github-bot commented Feb 5, 2025

Uh oh!

facebook-github-bot commented Feb 5, 2025

Uh oh!

facebook-github-bot commented Feb 8, 2025

Uh oh!

facebook-github-bot commented Feb 8, 2025

Uh oh!

facebook-github-bot commented Feb 8, 2025

Uh oh!

facebook-github-bot commented Feb 10, 2025

Uh oh!

facebook-github-bot commented Feb 10, 2025

Uh oh!

facebook-github-bot commented Feb 23, 2025

Uh oh!

facebook-github-bot commented Feb 23, 2025

Uh oh!

facebook-github-bot commented Feb 25, 2025

Uh oh!

facebook-github-bot commented Feb 25, 2025

Uh oh!

facebook-github-bot commented Feb 26, 2025

Uh oh!

pytorchmergebot commented Feb 26, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

drisspg commented Feb 4, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 4, 2025 •

edited

Loading