[ROCm] Implemented dropout usage for RNN with MIOpen backend by iupaikov-amd · Pull Request #144572 · pytorch/pytorch

iupaikov-amd · 2025-01-10T18:38:51Z

This PR fixes #107183 for ROCm.

Implemented the usage of new RNN descriptor for MIOpen backend that takes into account dropout rate value using dropout descriptor. This fixes associated test_RNN_dropout_state test.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @ColinPeppler @desertfire

pytorch-bot · 2025-01-10T18:38:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144572

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 0741dce with merge base 863ac20 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

rocm / linux-focal-rocm6.3-py3.10 / test (default, 5, 6, linux.rocm.gpu.2) (gh) (disabled by #133228 but the issue was closed recently and a rebase is needed to make it pass)
inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_split_scan

This comment was automatically generated by Dr. CI and updates every 15 minutes.

iupaikov-amd · 2025-01-21T17:13:59Z

Initial questions:
I need opinions on some key issues:

To avoid duplicating a lot of code I currently prepare DropoutDescriptor and the memory for it inside the RNNDescriptors constructor. Other solution would be to move memory management into DropoutDescriptor itself, but it would complicate it beyond just being a simple wrapper and we may lose some control over its' internals from outside. Also all the memory management can be moved inside miopen_rnn functions, but that will produce 3 identical code snippets in all of the functions.
Struct for device memory pointer. I currently use a simple way of doing it through hipmalloc, but we do almost identical thing in Conv_miopen with Workspace struct. Should they be combined and moved to some utils file?
hipmalloc vs c10::hip::HIPCachingAllocator::raw_alloc. I just don't know the difference, what should we use in general situation?
RNG seed is currently unused. There is no way to set it from pytorch api. Should we bother with that and give such control to users or stick to setting the seed randomly?

Addressed issues after getting some feedback:

Decided to leave DropoutDescriptor usage as is, MIOpen API doesn't really allow for a more elegant solution.
Moved memory management into a separate header file for both Conv and RNN files.
Direct usage of hipMalloc breaks hipGraph support. Now HIPCachingAllocator is used everywhere.
Will be addressed in a separate issue/pr.

iupaikov-amd · 2025-01-21T17:14:15Z

@pytorchbot label "topic: not user facing"

iupaikov-amd · 2025-01-21T17:16:53Z

@pytorchbot rebase

pytorchmergebot · 2025-01-21T17:18:19Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-01-21T17:18:22Z

Successfully rebased iupaikov_rnn_dropout_state_fix_upstream onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout iupaikov_rnn_dropout_state_fix_upstream && git pull --rebase)

iupaikov-amd · 2025-01-22T20:25:02Z

@pytorchbot rebase

pytorchmergebot · 2025-01-22T20:26:28Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-01-22T20:26:31Z

Successfully rebased iupaikov_rnn_dropout_state_fix_upstream onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout iupaikov_rnn_dropout_state_fix_upstream && git pull --rebase)

iupaikov-amd · 2025-02-25T17:20:13Z

Thanks everyone extensive code review! Due to the issue with RNG seed usage we decided to postpone this PR to later release. For now I will keep this PR in draft.

The main problem is that when we come down to cpp land for MIOpen API usage we create all of the supporting structs there for each call of the RNN function. This causes RNG machine to generate the same random numbers if we use the same seed. We either will need to keep track of the python context in cpp part of code or have a gpu mem allocated in python and always pass it down.

…OCm (#1970) RNN dropout PR for upstream still in review pytorch#144572 Fixes SWDEV-517343

Currently ROCm doesn't support dropout value for RNN PR to enable RNN dropout on ROCm still in review pytorch#144572

PR to skip test_nn.py::TestNN::test_RNN_dropout_state Currently ROCm doesn't support dropout value for RNN PR to enable RNN dropout on ROCm still in review and blocked #144572 Fixes: #68849 Pull Request resolved: #149446 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

iupaikov-amd · 2025-03-27T16:25:35Z

@pytorchbot rebase

pytorchmergebot · 2025-03-27T16:27:05Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-03-27T16:27:08Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/144572/head returned non-zero exit code 1

Rebasing (1/7)
Rebasing (2/7)
Auto-merging aten/src/ATen/CMakeLists.txt
Auto-merging aten/src/ATen/native/miopen/Conv_miopen.cpp
CONFLICT (content): Merge conflict in aten/src/ATen/native/miopen/Conv_miopen.cpp
error: could not apply dc9591a80b5... Moved gpu memory management to a separate header utility
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply dc9591a80b5... Moved gpu memory management to a separate header utility

Raised by https://github.com/pytorch/pytorch/actions/runs/14111718518

iupaikov-amd · 2025-04-08T15:14:46Z

I implemented a proper way of dropout state restoration that doesn't require setting a seed to random number to produce correct results for a linked test. For now the seed is always set to 0, but this will later be addressed in a different PR. There are some complications with HIP API regarding random number generator tied to a device.

We decided to store each dropout state in thread_local storage to avoid multithreading issues here. Dropout descriptor needs to also be stored in such way for API to function properly. From initial tests this shouldn't cause any issues or impact performance in major way, but we will continue to monitor this and rework this if needed.

Also since we need to store the dropout state size and later probably some other stuff as well if we need a descriptor for each device I decided to leave a struct approach to manage the workspace memory. Tried the proposed variant but readability of the code suffered in a major way and I changed it back to struct with a more specific naming.

PR to skip test_nn.py::TestNN::test_RNN_dropout_state Currently ROCm doesn't support dropout value for RNN PR to enable RNN dropout on ROCm still in review and blocked pytorch#144572 Fixes: pytorch#68849 Pull Request resolved: pytorch#149446 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily

iupaikov-amd · 2025-04-24T16:10:19Z

@jeffdaily can we merge this? The issue is just credentials.

jeffdaily · 2025-04-25T21:04:50Z

@pytorchbot merge -f "flaky test failure on rocm, safe to merge"

pytorchmergebot · 2025-04-25T21:06:33Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchbot added the open source label Jan 10, 2025

pytorch-bot bot added the topic: not user facing topic category label Jan 21, 2025

iupaikov-amd marked this pull request as ready for review January 21, 2025 17:14

iupaikov-amd requested review from jeffdaily and jithunnair-amd as code owners January 21, 2025 17:14

pytorchmergebot force-pushed the iupaikov_rnn_dropout_state_fix_upstream branch from ab5de54 to b221755 Compare January 21, 2025 17:18

jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Jan 21, 2025

pytorch-bot bot temporarily deployed to upload-benchmark-results January 21, 2025 18:35 Inactive

pytorchmergebot force-pushed the iupaikov_rnn_dropout_state_fix_upstream branch from 3f15d02 to 918549d Compare January 22, 2025 20:26

pytorch-bot bot had a problem deploying to upload-benchmark-results January 22, 2025 22:16 Failure

pytorch-bot bot temporarily deployed to upload-benchmark-results January 22, 2025 22:16 Inactive

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 23, 2025

iupaikov-amd marked this pull request as draft February 25, 2025 17:20

dnikolaev-amd mentioned this pull request Mar 17, 2025

[rocm6.4_internal_testing] temporary skip test_RNN_dropout_state on ROCm ROCm/pytorch#1970

Merged

pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Mar 18, 2025

[rocm6.4_internal_testing] temporary skip test_RNN_dropout_state on R…

f24c413

…OCm (#1970) RNN dropout PR for upstream still in review pytorch#144572 Fixes SWDEV-517343

dnikolaev-amd added a commit to ROCm/pytorch that referenced this pull request Mar 18, 2025

[ROCm] skip test_RNN_dropout_state

ea7dab4

Currently ROCm doesn't support dropout value for RNN PR to enable RNN dropout on ROCm still in review pytorch#144572

dnikolaev-amd mentioned this pull request Mar 18, 2025

[ROCm] skip test_RNN_dropout_state #149446

Closed

Applied formatting nitpicks from code review

c6ceccb

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm Trigger "default" config CI on ROCm labels Mar 27, 2025

iupaikov-amd added 2 commits April 8, 2025 14:50

Implemented dropout state restoration

283edcd

Applied linter formatting

0741dce

iupaikov-amd marked this pull request as ready for review April 8, 2025 15:14

jeffdaily approved these changes Apr 15, 2025

View reviewed changes

jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Apr 15, 2025

pytorchmergebot added the merging label Apr 25, 2025

pytorchmergebot closed this in 1aa971a Apr 25, 2025

iupaikov-amd mentioned this pull request Apr 28, 2025

[ROCm] Unskipped test_rnn_dropout_state for ROCm #152339

Closed

Conversation

iupaikov-amd commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144572

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

iupaikov-amd commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iupaikov-amd commented Jan 21, 2025

Uh oh!

iupaikov-amd commented Jan 21, 2025

Uh oh!

pytorchmergebot commented Jan 21, 2025

Uh oh!

pytorchmergebot commented Jan 21, 2025

Uh oh!

iupaikov-amd commented Jan 22, 2025

Uh oh!

pytorchmergebot commented Jan 22, 2025

Uh oh!

pytorchmergebot commented Jan 22, 2025

Uh oh!

iupaikov-amd commented Feb 25, 2025

Uh oh!

iupaikov-amd commented Mar 27, 2025

Uh oh!

pytorchmergebot commented Mar 27, 2025

Uh oh!

pytorchmergebot commented Mar 27, 2025

Uh oh!

iupaikov-amd commented Apr 8, 2025

Uh oh!

iupaikov-amd commented Apr 24, 2025

Uh oh!

jeffdaily commented Apr 25, 2025

Uh oh!

pytorchmergebot commented Apr 25, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

iupaikov-amd commented Jan 10, 2025 •

edited

Loading

pytorch-bot bot commented Jan 10, 2025 •

edited

Loading

iupaikov-amd commented Jan 21, 2025 •

edited

Loading