[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for `sm90`, `sm100` by eqy · Pull Request #149282 · pytorch/pytorch

eqy · 2025-03-16T21:09:20Z

cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

pytorch-bot · 2025-03-16T21:09:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149282

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 1a19605 with merge base d7a855d ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4) (gh) (similar failure)
distributed/test_c10d_nccl.py::ProcessGroupNCCLGroupTest::test_extra_cuda_context

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
/var/lib/jenkins/workspace/xla/torch_xla/csrc/runtime/BUILD:476:14: Compiling torch_xla/csrc/runtime/xla_util_test.cpp failed: (Exit 1): gcc failed: error executing CppCompile command (from target //torch_xla/csrc/runtime:xla_util_test) /usr/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer -g0 -O2 '-D_FORTIFY_SOURCE=1' -DNDEBUG -ffunction-sections ... (remaining 229 arguments skipped)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

eqy · 2025-03-16T21:55:38Z

@pytorchmergebot rebase

pytorchmergebot · 2025-03-16T21:57:06Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-03-16T21:57:08Z

Tried to rebase and push PR #149282, but it was already up to date. Try rebasing against main by issuing:
@pytorchbot rebase -b main

eqy · 2025-03-17T16:55:32Z

@pytorchmergebot rebase

pytorchmergebot · 2025-03-17T16:56:59Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-03-17T16:57:03Z

Successfully rebased cudnnsdparefactor onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cudnnsdparefactor && git pull --rebase)

linux-foundation-easycla · 2025-03-17T23:42:38Z

The committers listed above are authorized under a signed CLA.

✅ login: eqy (c3280b9, 789d517, 3d17443, bc4b6a5, 327be98, 899509b, 654a1c5, d9cd3f9, 42bfd30, 8fe3f7b, 2f4cdb1, c3bd017, 0c93036, c4f9c37, bfaeee7, 0536082, 9d42b53, 08b3a07, 4703976, 3548745, a692cd1, 4f2e18a, 115d73e, 1a19605, c3696e2, 184393a, 7cfdc12, 47272c3, b15391f, 785d034, 4ca03eb)

Skylion007 · 2025-03-18T16:38:01Z

aten/src/ATen/native/cudnn/MHA.cpp

The update method for mhagraphcache should probably use perfect forward up where the update method is defined instead of an lref. And throughout the file should be to remove extra copies.

Suggested change

mhagraphcache.update(key, mha_graph);

mhagraphcache.update(key, std::move(mha_graph));

drisspg · 2025-05-08T23:40:34Z

aten/src/ATen/native/transformers/cuda/sdp_utils.cpp

should we guard consumer GPUS? I guess thats handled in the dispatch

aten/src/ATen/native/transformers/cuda/sdp_utils.cpp

aten/src/ATen/native/cudnn/MHA.cpp

drisspg · 2025-05-08T23:46:47Z

aten/src/ATen/native/cudnn/MHA.cpp

maybe do a NB: sdpa api is transposed vs cudnn

drisspg

any benchmarks?

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>

facebook-github-bot · 2025-08-06T18:57:50Z

@drisspg has imported this pull request. If you are a Meta employee, you can view this in D79744374.

facebook-github-bot · 2025-08-08T20:24:19Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2025-08-08T20:26:06Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-08-08T20:26:24Z

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor / linux-jammy-cpu-py3.9-gcc11-inductor / build

Details for Dev Infra team

Raised by workflow job

izaitsevfb · 2025-08-08T22:20:41Z

@pytorchbot merge -i

pytorchmergebot · 2025-08-08T22:22:24Z

Merge started

Your change will be merged while ignoring the following 4 checks: pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable), inductor / linux-jammy-cpu-py3.9-gcc11-inductor / build, trunk / linux-jammy-rocm-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4), Meta Internal-Only Changes Check

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…st priority bump for `sm90`, `sm100` (pytorch#149282) cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward Pull Request resolved: pytorch#149282 Approved by: https://github.com/drisspg Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>

Opt-in for now, but basically uses the variable-sequence length/ragged path for the common case of BSHD layout to avoid recompiling for different sequence lengths. Built on top of #149282 Tested using a primitive fuzzer, seems at least as stable as default path (with recompilation) on B200 (50000+ cases tested without any failures) Pull Request resolved: #155958 Approved by: https://github.com/drisspg

…#155958) Opt-in for now, but basically uses the variable-sequence length/ragged path for the common case of BSHD layout to avoid recompiling for different sequence lengths. Built on top of pytorch#149282 Tested using a primitive fuzzer, seems at least as stable as default path (with recompilation) on B200 (50000+ cases tested without any failures) Pull Request resolved: pytorch#155958 Approved by: https://github.com/drisspg

…st priority bump for `sm90`, `sm100` (pytorch#149282) cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward Pull Request resolved: pytorch#149282 Approved by: https://github.com/drisspg Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>

…#155958) Opt-in for now, but basically uses the variable-sequence length/ragged path for the common case of BSHD layout to avoid recompiling for different sequence lengths. Built on top of pytorch#149282 Tested using a primitive fuzzer, seems at least as stable as default path (with recompilation) on B200 (50000+ cases tested without any failures) Pull Request resolved: pytorch#155958 Approved by: https://github.com/drisspg

eqy added open source topic: not user facing topic category module: sdpa All things related to torch.nn.functional.scaled_dot_product_attentiion labels Mar 16, 2025

eqy requested a review from syed-ahmed as a code owner March 16, 2025 21:09

eqy changed the title ~~[cuDNN][SDPA] cuDNN SDPA refactor/cleanup~~ [WIP][cuDNN][SDPA] cuDNN SDPA refactor/cleanup Mar 17, 2025

pytorch deleted a comment from pytorch-bot bot Mar 17, 2025

pytorchmergebot force-pushed the cudnnsdparefactor branch from ac66884 to f7c76b8 Compare March 17, 2025 16:57

Skylion007 reviewed Mar 18, 2025

View reviewed changes

eqy force-pushed the cudnnsdparefactor branch from 3848e20 to bd4432a Compare April 7, 2025 23:31

eqy requested review from albanD and soulitzer as code owners April 15, 2025 00:44

eqy changed the title ~~[WIP][cuDNN][SDPA] cuDNN SDPA refactor/cleanup~~ [cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90, sm100 Apr 28, 2025

eqy requested review from drisspg and jbschlosser April 28, 2025 21:58

jerryzh168 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 29, 2025

eqy force-pushed the cudnnsdparefactor branch from b6a75a5 to 0baac8a Compare April 30, 2025 18:42

eqy force-pushed the cudnnsdparefactor branch from c43594b to 0d08279 Compare May 8, 2025 17:44

drisspg reviewed May 8, 2025

View reviewed changes

aten/src/ATen/native/transformers/cuda/sdp_utils.cpp Outdated Show resolved Hide resolved

drisspg reviewed May 8, 2025

View reviewed changes

aten/src/ATen/native/cudnn/MHA.cpp Outdated Show resolved Hide resolved

drisspg reviewed May 8, 2025

View reviewed changes

aten/src/ATen/native/cudnn/MHA.cpp Outdated

Copy link

Contributor

drisspg May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe do a NB: sdpa api is transposed vs cudnn

eqy reacted with thumbs up emoji

drisspg approved these changes May 8, 2025

View reviewed changes

eqy and others added 12 commits August 6, 2025 17:51

fix condition for stride order

c4f9c37

check in

c3bd017

fix head dim 256 in 1.12

184393a

xfail xfail

47272c3

turn off default

9d42b53

fix test ordr

c3696e2

lint and make default opt-in

bc4b6a5

Update NestedTensorTransformerFunctions.cpp

4ca03eb

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>

Update sdp_utils.cpp

8fe3f7b

Update MHA.cpp

115d73e

actually fix rebase

4703976

delete for 1.0.3

1a19605

eqy force-pushed the cudnnsdparefactor branch from 74069f9 to 1a19605 Compare August 6, 2025 18:51

pytorchmergebot added the merging label Aug 8, 2025

pytorchmergebot removed the merging label Aug 8, 2025

pytorchmergebot added the merging label Aug 8, 2025

pytorchmergebot closed this in 1128f4c Aug 8, 2025

pytorchmergebot removed the merging label Aug 8, 2025

	mhagraphcache.update(key, mha_graph);
	mhagraphcache.update(key, std::move(mha_graph));

Conversation

eqy commented Mar 16, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149282

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

eqy commented Mar 16, 2025

Uh oh!

pytorchmergebot commented Mar 16, 2025

Uh oh!

pytorchmergebot commented Mar 16, 2025

Uh oh!

eqy commented Mar 17, 2025

Uh oh!

pytorchmergebot commented Mar 17, 2025

Uh oh!

pytorchmergebot commented Mar 17, 2025

Uh oh!

linux-foundation-easycla bot commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skylion007 Mar 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drisspg May 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

drisspg May 8, 2025

Choose a reason for hiding this comment

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 6, 2025

Uh oh!

facebook-github-bot commented Aug 8, 2025

Uh oh!

pytorchmergebot commented Aug 8, 2025

Merge started

Uh oh!

pytorchmergebot commented Aug 8, 2025

Merge failed

Uh oh!

izaitsevfb commented Aug 8, 2025

Uh oh!

pytorchmergebot commented Aug 8, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

eqy commented Mar 16, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Mar 16, 2025 •

edited

Loading

linux-foundation-easycla bot commented Mar 17, 2025 •

edited

Loading

Skylion007 Mar 18, 2025 •

edited

Loading