Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174708
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit bdddea3 with merge base e45dfba ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
@alugorey can you provide a bit more description on the PR what exactly it fixes and what changes were made after initial attempt got reverted |
|
hi @malfet I updated the PR description. Please take a look |
|
@malfet one more look? :) |
|
@claude explain to me what this PR is doing and what was broken before, does the regression test adequately test this |
|
Claude finished @drisspg's task in 1m 17s —— View job Analysis Complete
What This PR DoesThis PR fixes a device-side memory access fault when using dropout with the CK (Composable Kernel) SDPA backend on ROCm/AMD GPUs. The Bug (Before)In // OLD (buggy) code - mha_fwd_ck.hip:287-298
auto rng_state = at::empty({2}, opts.dtype(at::kLong)); // Created on CPU!
auto rng_state_ptr = reinterpret_cast<uint64_t*>(rng_state.data_ptr());
// Later passed to GPU kernel - MEMORY FAULT!
auto drop_seed_offset = std::make_pair(rng_state_ptr, rng_state_ptr + 1);The issue: The FixThe fix explicitly creates // NEW (fixed) code - mha_fwd_ck.hip:291-293
auto rng_state_options = at::TensorOptions().dtype(at::kUInt64).device(at::kCUDA);
auto rng_state = at::zeros({2}, rng_state_options.dtype(at::kUInt64));Additionally:
Why Previous RevertsThe PR description mentions this is a reland of #154864 which was reverted due to internal test failures. The original fix accidentally changed the output tensor structure of Regression Test AssessmentThe test Strengths:
Concerns:
Is it adequate? For a functional regression test, yes - if the bug were present, the test would crash with a memory access fault when running with Summary
|
|
I dont really trust claude here since opts should already have the correct device can you explain why your code changes fix the issue? |
@alugorey IIRC it was a Tensor lifetime issue. Previously the seed/offset were referencing ptrs to Tensors that went out of scope, but this PR guarantees proper lifetime management of seed/offset now. |
|
Hi @drisspg, Jeff's assessment was right. it was a tensor lifetime issue. This is ready to land 👍 |
|
@pytorchbot merge -f "unrelated lint infra, other lint is passing; others are known flaky" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This PR needs a
|
This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.
This PR needs a
|
|
@pytorchbot label "topic: not user facing" |
1dab287 to
bdddea3
Compare
|
@pytorchbot merge -f "resolved the doc errors that caused the prior revert, all other CI is passing" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -m "sorry, need to revert this due to #172246 being reverted" -c ghfirst |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@alugorey your PR has been successfully reverted. |
This reverts commit 3f60bc4. Reverted #174708 on behalf of https://github.com/liangel-02 due to sorry, need to revert this due to #172246 being reverted ([comment](#174708 (comment)))
This PR was reopened (likely due to being reverted), so your approval was removed. Please request another review.
Fixes a device side memory access fault when dropout is used with CK sdpa backend. There was an bug where the philox seed/offset tensors were not being placed on the GPU correctly thus resulting in a memory access fault.
This PR is an attempted reland of #154864 which was reverted a couple times due to internal tests breaking. This was because the output tensors for mha_fwd were mistakenly changed causing dynamo to complain. In the newest commit, we changed the output tensors back to what they were originally while preserving the bug fix for the device side memory assertion. This allowed the dynamo checks to succeed and avoided the original error.
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @jataylo @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang