[varlen_attn for inference] add out variant by liangel-02 · Pull Request #176015 · pytorch/pytorch

liangel-02 · 2026-02-27T21:13:37Z

aten/src/ATen/native/transformers/cuda/attention.cu

renamed _flash_attention_forward to _flash_attention_forward_impl. this is now the core logic and takes optional<Tensor> out.
_flash_attention_forward is the non-out variant version and is a thin wrapper that calls _flash_attention_forward_impl with out=std::nullopt
_flash_attention_forward_no_dropout_inplace is the out-variant and calls _flash_attention_forward_impl with Tensor& out

aten/src/ATen/native/native_functions.yaml

i registered a new op _flash_attention_forward_no_dropout_inplace

torch/_meta_registrations.py

added meta registration that calls meta__flash_attention_forward but doesn't return out tensor

torch/nn/attention/varlen.py

added public varlen_attn_out and private custom op _varlen_attn_out with mutates_args={"out"}

test/test_varlen_attention.py

added out variant to existing tests

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]

pytorch-bot · 2026-02-27T21:13:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176015

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 98609ba with merge base 4bc9d7f ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-jammy-py3.10-gcc11 / test (distributed, 1, 2, lf.linux.2xlarge) (gh) (similar failure)
test/distributed/tensor/test_dtensor_ops.py::TestLocalDTensorOpsCPU::test_dtensor_op_db_nanmean_cpu_float32

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable) (gh) (#174929)
detectron2_maskrcnn_r_50_fpn

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 37c1775 Pull Request resolved: #176015

github-actions · 2026-02-27T21:18:05Z

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.

Caused by:

aten/src/ATen/native/native_functions.yaml

github-actions · 2026-02-27T21:18:06Z

Attention! PyTorch one of the C-stable API file was changed

You MUST NOT change existing function declarations in this, as this header defines a stable C ABI. If you need to change the signature for a function, introduce a new v2 version of the function and modify code generation to target the new version of the function.

Caused by:

torch/csrc/inductor/aoti_torch/generated/c_shim_cuda.h

[ghstack-poisoned]

ghstack-source-id: 5104e01 Pull Request resolved: #176015

`aten/src/ATen/native/transformers/cuda/attention.cu` - renamed `_flash_attention_forward` to `_flash_attention_forward_impl`. this is now the core logic and takes `optional<Tensor> out`. - `_flash_attention_forward` is the non-out variant version and is a thin wrapper that calls `_flash_attention_forward_impl` with `out=std::nullopt` - `_flash_attention_forward_out` is the out-variant and calls `_flash_attention_forward_impl` with `Tensor& out` `aten/src/ATen/native/native_functions.yaml` - i registered the out variant `_flash_attention_forward.out` with type `Tensor(a!)` to indicate that it's mutable. this dispatches to `_flash_attention_forward_out` defined in `attention.cu`. `torch/_meta_registrations.py` - added meta registration that calls `meta__flash_attention_forward` but doesn't return out tensor `torch/nn/attention/varlen.py` - added public `varlen_attn_out` and private custom op `_varlen_attn_out` with `mutates_args={"out"}` `test/test_varlen_attention.py` - added out variant to existing tests [ghstack-poisoned]

ghstack-source-id: 8a6bd6c Pull Request resolved: #176015

`aten/src/ATen/native/transformers/cuda/attention.cu` - renamed `_flash_attention_forward` to `_flash_attention_forward_impl`. this is now the core logic and takes `optional<Tensor> out`. - `_flash_attention_forward` is the non-out variant version and is a thin wrapper that calls `_flash_attention_forward_impl` with `out=std::nullopt` - `_flash_attention_forward_out` is the out-variant and calls `_flash_attention_forward_impl` with `Tensor& out` `aten/src/ATen/native/native_functions.yaml` - i registered the out variant `_flash_attention_forward.out` with type `Tensor(a!)` to indicate that it's mutable. this dispatches to `_flash_attention_forward_out` defined in `attention.cu`. `torch/_meta_registrations.py` - added meta registration that calls `meta__flash_attention_forward` but doesn't return out tensor `torch/nn/attention/varlen.py` - added public `varlen_attn_out` and private custom op `_varlen_attn_out` with `mutates_args={"out"}` `test/test_varlen_attention.py` - added out variant to existing tests [ghstack-poisoned]

ghstack-source-id: 5c837e5 Pull Request resolved: #176015

`aten/src/ATen/native/transformers/cuda/attention.cu` - renamed `_flash_attention_forward` to `_flash_attention_forward_impl`. this is now the core logic and takes `optional<Tensor> out`. - `_flash_attention_forward` is the non-out variant version and is a thin wrapper that calls `_flash_attention_forward_impl` with `out=std::nullopt` - `_flash_attention_forward_out` is the out-variant and calls `_flash_attention_forward_impl` with `Tensor& out` `aten/src/ATen/native/native_functions.yaml` - i registered the out variant `_flash_attention_forward.out` with type `Tensor(a!)` to indicate that it's mutable. this dispatches to `_flash_attention_forward_out` defined in `attention.cu`. `torch/_meta_registrations.py` - added meta registration that calls `meta__flash_attention_forward` but doesn't return out tensor `torch/nn/attention/varlen.py` - added public `varlen_attn_out` and private custom op `_varlen_attn_out` with `mutates_args={"out"}` `test/test_varlen_attention.py` - added out variant to existing tests [ghstack-poisoned]

ghstack-source-id: b85f6db Pull Request resolved: #176015

drisspg · 2026-03-04T05:19:46Z

aten/src/ATen/native/native_functions.yaml

    CUDA: _flash_attention_forward
  tags: nondeterministic_seeded

+- func: _flash_attention_forward_out_variant(Tensor(a!) out, Tensor query, Tensor key, Tensor value, Tensor? cum_seq_q, Tensor? cum_seq_k, SymInt max_q, SymInt max_k, float dropout_p, bool is_causal, bool return_debug_mask, *, float? scale=None, SymInt? window_size_left=None, SymInt? window_size_right=None, Tensor? seqused_k=None, Tensor? alibi_slopes=None, Tensor? page_table=None) -> (Tensor softmax_logsumexp, Tensor rng_state, Tensor unused, Tensor debug_attn_mask)


since this is not really an out op or a variant, lets acutally change this this is baiscailly a new op that is a
pseudo inplace op. Alos we probably dont need the debug_atnn_mask, unused or rng state any more right? since we dont plan to support doropout for this API and we rant using debug_attn_mask ?

can we just drop from the impl

Also I know @albanD will hate this but it does feeel like the shortest path. I as you can tell from the neighborhood I have not been a very good Steward of these ops and keeping them minimal. I want to make sure we dont get yelled at :)

Why is this not called _flash_attention_forward_(Tensor(a!) output, ....) ?

changing name to _flash_attention_forward_no_dropout_inplace as discussed offline

discussed with the boss Alban right, if he is cool im cool

aten/src/ATen/native/transformers/cuda/attention.cu

`aten/src/ATen/native/transformers/cuda/attention.cu` - renamed `_flash_attention_forward` to `_flash_attention_forward_impl`. this is now the core logic and takes `optional<Tensor> out`. - `_flash_attention_forward` is the non-out variant version and is a thin wrapper that calls `_flash_attention_forward_impl` with `out=std::nullopt` - `_flash_attention_forward_out` is the out-variant and calls `_flash_attention_forward_impl` with `Tensor& out` `aten/src/ATen/native/native_functions.yaml` - i registered the out variant `_flash_attention_forward.out` with type `Tensor(a!)` to indicate that it's mutable. this dispatches to `_flash_attention_forward_out` defined in `attention.cu`. `torch/_meta_registrations.py` - added meta registration that calls `meta__flash_attention_forward` but doesn't return out tensor `torch/nn/attention/varlen.py` - added public `varlen_attn_out` and private custom op `_varlen_attn_out` with `mutates_args={"out"}` `test/test_varlen_attention.py` - added out variant to existing tests [ghstack-poisoned]

ghstack-source-id: 91a5618 Pull Request resolved: #176015

`aten/src/ATen/native/transformers/cuda/attention.cu` - renamed `_flash_attention_forward` to `_flash_attention_forward_impl`. this is now the core logic and takes `optional<Tensor> out`. - `_flash_attention_forward` is the non-out variant version and is a thin wrapper that calls `_flash_attention_forward_impl` with `out=std::nullopt` - `_flash_attention_forward_out` is the out-variant and calls `_flash_attention_forward_impl` with `Tensor& out` `aten/src/ATen/native/native_functions.yaml` - i registered the out variant `_flash_attention_forward.out` with type `Tensor(a!)` to indicate that it's mutable. this dispatches to `_flash_attention_forward_out` defined in `attention.cu`. `torch/_meta_registrations.py` - added meta registration that calls `meta__flash_attention_forward` but doesn't return out tensor `torch/nn/attention/varlen.py` - added public `varlen_attn_out` and private custom op `_varlen_attn_out` with `mutates_args={"out"}` `test/test_varlen_attention.py` - added out variant to existing tests [ghstack-poisoned]

ghstack-source-id: 0101572 Pull Request resolved: #176015

`aten/src/ATen/native/transformers/cuda/attention.cu` - renamed `_flash_attention_forward` to `_flash_attention_forward_impl`. this is now the core logic and takes `optional<Tensor> out`. - `_flash_attention_forward` is the non-out variant version and is a thin wrapper that calls `_flash_attention_forward_impl` with `out=std::nullopt` - `_flash_attention_forward_out` is the out-variant and calls `_flash_attention_forward_impl` with `Tensor& out` `aten/src/ATen/native/native_functions.yaml` - i registered the out variant `_flash_attention_forward.out` with type `Tensor(a!)` to indicate that it's mutable. this dispatches to `_flash_attention_forward_out` defined in `attention.cu`. `torch/_meta_registrations.py` - added meta registration that calls `meta__flash_attention_forward` but doesn't return out tensor `torch/nn/attention/varlen.py` - added public `varlen_attn_out` and private custom op `_varlen_attn_out` with `mutates_args={"out"}` `test/test_varlen_attention.py` - added out variant to existing tests [ghstack-poisoned]

Pull Request resolved: #176723 Approved by: https://github.com/drisspg ghstack dependencies: #175897, #175924, #175936, #176015

This reverts commit f1e413e. Reverted #176015 on behalf of https://github.com/zou3519 due to sorry I think this broke inductor rocm ([comment](#175897 (comment)))

pytorchmergebot · 2026-03-07T16:30:27Z

@liangel-02 your PR has been reverted as part of the stack under #175897.

`aten/src/ATen/native/transformers/cuda/attention.cu` - renamed `_flash_attention_forward` to `_flash_attention_forward_impl`. this is now the core logic and takes `optional<Tensor> out`. - `_flash_attention_forward` is the non-out variant version and is a thin wrapper that calls `_flash_attention_forward_impl` with `out=std::nullopt` - `_flash_attention_forward_no_dropout_inplace` is the out-variant and calls `_flash_attention_forward_impl` with `Tensor& out` `aten/src/ATen/native/native_functions.yaml` - i registered a new op `_flash_attention_forward_no_dropout_inplace` `torch/_meta_registrations.py` - added meta registration that calls `meta__flash_attention_forward` but doesn't return out tensor `torch/nn/attention/varlen.py` - added public `varlen_attn_out` and private custom op `_varlen_attn_out` with `mutates_args={"out"}` `test/test_varlen_attention.py` - added out variant to existing tests [ghstack-poisoned]

pytorchmergebot · 2026-03-08T18:46:19Z

Starting merge as part of PR stack under #176723

pytorchmergebot · 2026-03-08T23:40:43Z

Starting merge as part of PR stack under #176723

Pull Request resolved: #176723 Approved by: https://github.com/drisspg ghstack dependencies: #175897, #175924, #175936, #176015

This reverts commit 492c742. Reverted #176015 on behalf of https://github.com/huydhn due to Sorry for reverting your change but a bunch of internal builds need to be updated to unblock this change D95758397 ([comment](#175924 (comment)))

pytorchmergebot · 2026-03-10T01:10:22Z

@liangel-02 your PR has been reverted as part of the stack under #175924.

`aten/src/ATen/native/transformers/cuda/attention.cu` - renamed `_flash_attention_forward` to `_flash_attention_forward_impl`. this is now the core logic and takes `optional<Tensor> out`. - `_flash_attention_forward` is the non-out variant version and is a thin wrapper that calls `_flash_attention_forward_impl` with `out=std::nullopt` - `_flash_attention_forward_no_dropout_inplace` is the out-variant and calls `_flash_attention_forward_impl` with `Tensor& out` `aten/src/ATen/native/native_functions.yaml` - i registered a new op `_flash_attention_forward_no_dropout_inplace` `torch/_meta_registrations.py` - added meta registration that calls `meta__flash_attention_forward` but doesn't return out tensor `torch/nn/attention/varlen.py` - added public `varlen_attn_out` and private custom op `_varlen_attn_out` with `mutates_args={"out"}` `test/test_varlen_attention.py` - added out variant to existing tests [ghstack-poisoned]

liangel-02 · 2026-03-10T19:36:43Z

@liangel-02 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

`aten/src/ATen/native/transformers/cuda/attention.cu` - renamed `_flash_attention_forward` to `_flash_attention_forward_impl`. this is now the core logic and takes `optional<Tensor> out`. - `_flash_attention_forward` is the non-out variant version and is a thin wrapper that calls `_flash_attention_forward_impl` with `out=std::nullopt` - `_flash_attention_forward_no_dropout_inplace` is the out-variant and calls `_flash_attention_forward_impl` with `Tensor& out` `aten/src/ATen/native/native_functions.yaml` - i registered a new op `_flash_attention_forward_no_dropout_inplace` `torch/_meta_registrations.py` - added meta registration that calls `meta__flash_attention_forward` but doesn't return out tensor `torch/nn/attention/varlen.py` - added public `varlen_attn_out` and private custom op `_varlen_attn_out` with `mutates_args={"out"}` `test/test_varlen_attention.py` - added out variant to existing tests Differential Revision: [D95996399](https://our.internmc.facebook.com/intern/diff/D95996399) [ghstack-poisoned]

pytorchmergebot · 2026-03-11T20:22:06Z

Starting merge as part of PR stack under #176723

pytorchmergebot · 2026-03-11T20:30:19Z

Starting merge as part of PR stack under #176723

Pull Request resolved: #176723 Approved by: https://github.com/drisspg ghstack dependencies: #175924, #175936, #176015

ghstack-source-id: dba65b3 Pull Request resolved: pytorch/pytorch#176015

add out variant

e36b8e4

[ghstack-poisoned]

liangel-02 requested review from albanD, jbschlosser and mikaylagawarecki as code owners February 27, 2026 21:13

liangel-02 mentioned this pull request Feb 27, 2026

[varlen_attn for inference] add seqused_k #175897

Closed

This was referenced Feb 27, 2026

[varlen_attn for inference] add page_table #175924

Closed

[varlen_attn for inference] add test to aot_inductor #175936

Closed

pytorch-bot bot added ciflow/inductor release notes: inductor (aoti) labels Feb 27, 2026

liangel-02 added a commit that referenced this pull request Feb 27, 2026

add out variant

78777ab

ghstack-source-id: 37c1775 Pull Request resolved: #176015

Update on "add out variant"

f3579fe

[ghstack-poisoned]

liangel-02 added a commit that referenced this pull request Feb 27, 2026

add out variant

5912818

ghstack-source-id: 5104e01 Pull Request resolved: #176015

liangel-02 requested a review from drisspg February 27, 2026 21:44

liangel-02 changed the title ~~add out variant~~ wip: add out variant Feb 27, 2026

liangel-02 added a commit that referenced this pull request Mar 2, 2026

add out variant

698dc49

ghstack-source-id: 8a6bd6c Pull Request resolved: #176015

liangel-02 added a commit that referenced this pull request Mar 2, 2026

add out variant

589a9d2

ghstack-source-id: 5c837e5 Pull Request resolved: #176015

liangel-02 added a commit that referenced this pull request Mar 3, 2026

add out variant

3e5d66a

ghstack-source-id: b85f6db Pull Request resolved: #176015

drisspg reviewed Mar 4, 2026

View reviewed changes

aten/src/ATen/native/transformers/cuda/attention.cu Outdated Show resolved Hide resolved

liangel-02 added a commit that referenced this pull request Mar 4, 2026

add out variant

9ab62e6

ghstack-source-id: 91a5618 Pull Request resolved: #176015

liangel-02 added a commit that referenced this pull request Mar 4, 2026

add out variant

5f36d2a

ghstack-source-id: 0101572 Pull Request resolved: #176015

pytorchmergebot pushed a commit that referenced this pull request Mar 7, 2026

[varlen_attn for inference] remove unnecessary tensor creation (#176723)

c405acd

Pull Request resolved: #176723 Approved by: https://github.com/drisspg ghstack dependencies: #175897, #175924, #175936, #176015

pytorchmergebot added the Merged label Mar 7, 2026

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Mar 7, 2026

pytorchmergebot reopened this Mar 7, 2026

pytorchmergebot closed this in 492c742 Mar 8, 2026

pytorchmergebot pushed a commit that referenced this pull request Mar 8, 2026

[varlen_attn for inference] remove unnecessary tensor creation (#176723)

26dddb9

Pull Request resolved: #176723 Approved by: https://github.com/drisspg ghstack dependencies: #175897, #175924, #175936, #176015

pytorchmergebot reopened this Mar 10, 2026

liangel-02 added 3 commits March 9, 2026 19:12

meta-codesync bot added fb-exported meta-exported labels Mar 11, 2026

pytorchmergebot closed this in ef578cd Mar 11, 2026

pytorchmergebot pushed a commit that referenced this pull request Mar 11, 2026

[varlen_attn for inference] remove unnecessary tensor creation (#176723)

f1f3d70

Pull Request resolved: #176723 Approved by: https://github.com/drisspg ghstack dependencies: #175924, #175936, #176015

sandy-gags pushed a commit to sandy-gags/pytorch that referenced this pull request Mar 12, 2026

add out variant

8d22e4f

ghstack-source-id: dba65b3 Pull Request resolved: pytorch/pytorch#176015

Conversation

liangel-02 commented Feb 27, 2026 • edited by huydhn Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176015

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

github-actions bot commented Feb 27, 2026

Attention! native_functions.yaml was changed

Uh oh!

github-actions bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Attention! PyTorch one of the C-stable API file was changed

Uh oh!

drisspg Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

drisspg Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

albanD Mar 4, 2026 • edited by liangel-02 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liangel-02 Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

drisspg Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pytorchmergebot commented Mar 7, 2026

Uh oh!

pytorchmergebot commented Mar 8, 2026

Uh oh!

pytorchmergebot commented Mar 8, 2026

Uh oh!

pytorchmergebot commented Mar 10, 2026

Uh oh!

liangel-02 commented Mar 10, 2026

Uh oh!

pytorchmergebot commented Mar 11, 2026

Uh oh!

pytorchmergebot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

liangel-02 commented Feb 27, 2026 •

edited by huydhn

Loading

pytorch-bot bot commented Feb 27, 2026 •

edited

Loading

github-actions bot commented Feb 27, 2026 •

edited

Loading

albanD Mar 4, 2026 •

edited by liangel-02

Loading