Skip to content

[ROCm] SDPA fix mem fault when dropout is enabled#154864

Open
alugorey wants to merge 4 commits intopytorch:mainfrom
alugorey:sdpa_mem_fault_dropout
Open

[ROCm] SDPA fix mem fault when dropout is enabled#154864
alugorey wants to merge 4 commits intopytorch:mainfrom
alugorey:sdpa_mem_fault_dropout

Conversation

@alugorey
Copy link
Contributor

@alugorey alugorey commented Jun 2, 2025

Fixes issue that exhibited a device side memory access fault due to incorrect tensor life management

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @dllehr-amd

@pytorch-bot
Copy link

pytorch-bot bot commented Jun 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154864

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 6 Unrelated Failures

As of commit a5a6140 with merge base ed0b1fe (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Collaborator

@jeffdaily jeffdaily left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add template <int Version = 2>?

@soulitzer soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 2, 2025
@alugorey alugorey force-pushed the sdpa_mem_fault_dropout branch from 6ca22cc to dd7dfe9 Compare June 20, 2025 22:02
@jeffdaily jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Jun 20, 2025
@jeffdaily jeffdaily changed the title Sdpa mem fault dropout [ROCm] SDPA fix mem fault when dropout is enabled Jun 20, 2025
@pytorch-bot pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Jun 20, 2025
jeffdaily
jeffdaily previously approved these changes Jun 20, 2025
Copy link
Collaborator

@jeffdaily jeffdaily left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved, pending CI

@pytorch-bot pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Jun 20, 2025
@jeffdaily jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Jun 20, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Aug 10, 2025
@alugorey alugorey force-pushed the sdpa_mem_fault_dropout branch from ca719d8 to 9c02e37 Compare August 10, 2025 03:16
@jeffdaily jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Aug 11, 2025
@alugorey alugorey force-pushed the sdpa_mem_fault_dropout branch from 9c02e37 to 22ff165 Compare August 11, 2025 22:36
@pytorch-bot pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Aug 11, 2025
@jeffdaily jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Aug 12, 2025
Copy link
Collaborator

@jeffdaily jeffdaily left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hipify changes to torch/csrc/Module.cpp were committed. Please revert those.

@pytorch-bot pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Aug 12, 2025
jeffdaily
jeffdaily previously approved these changes Aug 12, 2025
@jeffdaily jeffdaily added release notes: rocm mandatorylabel ciflow/rocm Trigger "default" config CI on ROCm labels Aug 12, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Aug 12, 2025
@jeffdaily jeffdaily added ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request labels Aug 12, 2025
@jeffdaily
Copy link
Collaborator

Manually added trunk label. Two unrelated tests are timing out that would prevent merging, so trying trunk label for additional signal.

@pruthvistony
Copy link
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@jeffdaily
Copy link
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 7, 2026
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 1, 6, linux.rocm.gpu.gfx942.1)

Details for Dev Infra team Raised by workflow job

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Jan 7, 2026
@jeffdaily
Copy link
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 7, 2026
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check

Details for Dev Infra team Raised by workflow job

@jeffdaily
Copy link
Collaborator

@pytorchbot merge -f "unrelated failures, Meta Internal approved by atalman"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@yangw-dev
Copy link
Contributor

@pytorchbot revert -m "already unland internally since it's broken internal tests, ticket: T251298521" -c ghfirst

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

Reverting PR 154864 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit fd27eb7b00d25438af1109055a85ab203a83c949 returned non-zero exit code 1

Auto-merging torch/_C/__init__.pyi.in
Auto-merging torch/backends/cuda/__init__.py
CONFLICT (content): Merge conflict in torch/backends/cuda/__init__.py
Auto-merging torch/csrc/Module.cpp
Auto-merging torch/testing/_internal/common_cuda.py
error: could not revert fd27eb7b00d... [ROCm] SDPA fix mem fault when dropout is enabled (#154864)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

@yangw-dev
Copy link
Contributor

@pytorchbot revert -m "already unland internally since it's broken internal tests, ticket: T251298521" -c ghfirst

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

@alugorey your PR has been successfully reverted.

@alugorey
Copy link
Contributor Author

@yangw-dev Can we please get some information as to what is breaking? this is the second time this was reverted without receiving any information as to why besides "internal breakage". Is there something we need to change on our end to get this to land?

@jeffdaily
Copy link
Collaborator

Hi @yangw-dev @drisspg. @alugorey has pushed changes that we think addresses your internal issue. Can you please import and run the test that was failing for you?

@alugorey
Copy link
Contributor Author

alugorey commented Feb 4, 2026

Hi @yangw-dev @drisspg. @alugorey has pushed changes that we think addresses your internal issue. Can you please import and run the test that was failing for you?

ping @yangw-dev @drisspg

Can we please get some attention on this matter? the memory access fault we are fixing here is beginning to surface elsewhere so we need this in ASAP

@alugorey
Copy link
Contributor Author

Dropping this PR for #174708

will close both when it lands

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/rocm-mi355 Trigger "default" config CI on ROCm MI355 runners ciflow/trunk Trigger trunk jobs on your pull request Merged module: rocm AMD GPU support for Pytorch open source release notes: rocm mandatorylabel Reverted triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants