[ROCm] SDPA fix mem fault when dropout is enabled#154864
[ROCm] SDPA fix mem fault when dropout is enabled#154864alugorey wants to merge 4 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154864
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 6 Unrelated FailuresAs of commit a5a6140 with merge base ed0b1fe ( NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
jeffdaily
left a comment
There was a problem hiding this comment.
Why did you add template <int Version = 2>?
aten/src/ATen/native/transformers/hip/flash_attn/ck/fmha_bwd.hpp
Outdated
Show resolved
Hide resolved
6ca22cc to
dd7dfe9
Compare
ca719d8 to
9c02e37
Compare
9c02e37 to
22ff165
Compare
jeffdaily
left a comment
There was a problem hiding this comment.
hipify changes to torch/csrc/Module.cpp were committed. Please revert those.
|
Manually added trunk label. Two unrelated tests are timing out that would prevent merging, so trying trunk label for additional signal. |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-jammy-rocm-py3.10 / test (default, 1, 6, linux.rocm.gpu.gfx942.1) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: Meta Internal-Only Changes Check Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -f "unrelated failures, Meta Internal approved by atalman" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -m "already unland internally since it's broken internal tests, ticket: T251298521" -c ghfirst |
|
@pytorchbot successfully started a revert job. Check the current status here. |
Reverting PR 154864 failedReason: Command Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot revert -m "already unland internally since it's broken internal tests, ticket: T251298521" -c ghfirst |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@alugorey your PR has been successfully reverted. |
|
@yangw-dev Can we please get some information as to what is breaking? this is the second time this was reverted without receiving any information as to why besides "internal breakage". Is there something we need to change on our end to get this to land? |
|
Hi @yangw-dev @drisspg. @alugorey has pushed changes that we think addresses your internal issue. Can you please import and run the test that was failing for you? |
ping @yangw-dev @drisspg Can we please get some attention on this matter? the memory access fault we are fixing here is beginning to surface elsewhere so we need this in ASAP |
|
Dropping this PR for #174708 will close both when it lands |
Fixes issue that exhibited a device side memory access fault due to incorrect tensor life management
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @dllehr-amd