Skip to content

[ROCm] Implemented dropout usage for RNN with MIOpen backend#144572

Closed
iupaikov-amd wants to merge 11 commits intopytorch:mainfrom
ROCm:iupaikov_rnn_dropout_state_fix_upstream
Closed

[ROCm] Implemented dropout usage for RNN with MIOpen backend#144572
iupaikov-amd wants to merge 11 commits intopytorch:mainfrom
ROCm:iupaikov_rnn_dropout_state_fix_upstream

Conversation

@iupaikov-amd
Copy link
Collaborator

@iupaikov-amd iupaikov-amd commented Jan 10, 2025

This PR fixes #107183 for ROCm.

Implemented the usage of new RNN descriptor for MIOpen backend that takes into account dropout rate value using dropout descriptor. This fixes associated test_RNN_dropout_state test.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @ColinPeppler @desertfire

@pytorch-bot
Copy link

pytorch-bot bot commented Jan 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144572

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 0741dce with merge base 863ac20 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@iupaikov-amd
Copy link
Collaborator Author

iupaikov-amd commented Jan 21, 2025

Initial questions:
I need opinions on some key issues:

  1. To avoid duplicating a lot of code I currently prepare DropoutDescriptor and the memory for it inside the RNNDescriptors constructor. Other solution would be to move memory management into DropoutDescriptor itself, but it would complicate it beyond just being a simple wrapper and we may lose some control over its' internals from outside. Also all the memory management can be moved inside miopen_rnn functions, but that will produce 3 identical code snippets in all of the functions.
  2. Struct for device memory pointer. I currently use a simple way of doing it through hipmalloc, but we do almost identical thing in Conv_miopen with Workspace struct. Should they be combined and moved to some utils file?
  3. hipmalloc vs c10::hip::HIPCachingAllocator::raw_alloc. I just don't know the difference, what should we use in general situation?
  4. RNG seed is currently unused. There is no way to set it from pytorch api. Should we bother with that and give such control to users or stick to setting the seed randomly?

Addressed issues after getting some feedback:

  1. Decided to leave DropoutDescriptor usage as is, MIOpen API doesn't really allow for a more elegant solution.
  2. Moved memory management into a separate header file for both Conv and RNN files.
  3. Direct usage of hipMalloc breaks hipGraph support. Now HIPCachingAllocator is used everywhere.
  4. Will be addressed in a separate issue/pr.

@iupaikov-amd
Copy link
Collaborator Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Jan 21, 2025
@iupaikov-amd iupaikov-amd marked this pull request as ready for review January 21, 2025 17:14
@iupaikov-amd
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased iupaikov_rnn_dropout_state_fix_upstream onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout iupaikov_rnn_dropout_state_fix_upstream && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the iupaikov_rnn_dropout_state_fix_upstream branch from ab5de54 to b221755 Compare January 21, 2025 17:18
@jeffdaily jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Jan 21, 2025
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 21, 2025 18:35 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 21, 2025 18:35 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 21, 2025 18:35 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 21, 2025 18:35 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 21, 2025 18:35 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 21, 2025 18:35 Inactive
@iupaikov-amd
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased iupaikov_rnn_dropout_state_fix_upstream onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout iupaikov_rnn_dropout_state_fix_upstream && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the iupaikov_rnn_dropout_state_fix_upstream branch from 3f15d02 to 918549d Compare January 22, 2025 20:26
@pytorch-bot pytorch-bot bot had a problem deploying to upload-benchmark-results January 22, 2025 22:16 Failure
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 22, 2025 22:16 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 22, 2025 22:16 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 22, 2025 22:16 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 22, 2025 22:16 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 22, 2025 22:16 Inactive
@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 23, 2025
@iupaikov-amd
Copy link
Collaborator Author

Thanks everyone extensive code review! Due to the issue with RNG seed usage we decided to postpone this PR to later release. For now I will keep this PR in draft.

The main problem is that when we come down to cpp land for MIOpen API usage we create all of the supporting structs there for each call of the RNN function. This causes RNG machine to generate the same random numbers if we use the same seed. We either will need to keep track of the python context in cpp part of code or have a gpu mem allocated in python and always pass it down.

@iupaikov-amd iupaikov-amd marked this pull request as draft February 25, 2025 17:20
pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Mar 18, 2025
…OCm (#1970)

RNN dropout PR for upstream still in review pytorch#144572

Fixes SWDEV-517343
dnikolaev-amd added a commit to ROCm/pytorch that referenced this pull request Mar 18, 2025
Currently ROCm doesn't support dropout value for RNN
PR to enable RNN dropout on ROCm still in review pytorch#144572
pytorchmergebot pushed a commit that referenced this pull request Mar 20, 2025
PR to skip test_nn.py::TestNN::test_RNN_dropout_state
Currently ROCm doesn't support dropout value for RNN

PR to enable RNN dropout on ROCm still in review and blocked #144572

Fixes: #68849

Pull Request resolved: #149446
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
@pytorch-bot pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm Trigger "default" config CI on ROCm labels Mar 27, 2025
@iupaikov-amd
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/144572/head returned non-zero exit code 1

Rebasing (1/7)
Rebasing (2/7)
Auto-merging aten/src/ATen/CMakeLists.txt
Auto-merging aten/src/ATen/native/miopen/Conv_miopen.cpp
CONFLICT (content): Merge conflict in aten/src/ATen/native/miopen/Conv_miopen.cpp
error: could not apply dc9591a80b5... Moved gpu memory management to a separate header utility
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply dc9591a80b5... Moved gpu memory management to a separate header utility

Raised by https://github.com/pytorch/pytorch/actions/runs/14111718518

@iupaikov-amd
Copy link
Collaborator Author

I implemented a proper way of dropout state restoration that doesn't require setting a seed to random number to produce correct results for a linked test. For now the seed is always set to 0, but this will later be addressed in a different PR. There are some complications with HIP API regarding random number generator tied to a device.

We decided to store each dropout state in thread_local storage to avoid multithreading issues here. Dropout descriptor needs to also be stored in such way for API to function properly. From initial tests this shouldn't cause any issues or impact performance in major way, but we will continue to monitor this and rework this if needed.

Also since we need to store the dropout state size and later probably some other stuff as well if we need a descriptor for each device I decided to leave a struct approach to manage the workspace memory. Tried the proposed variant but readability of the code suffered in a major way and I changed it back to struct with a more specific naming.

@iupaikov-amd iupaikov-amd marked this pull request as ready for review April 8, 2025 15:14
@jeffdaily jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Apr 15, 2025
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
PR to skip test_nn.py::TestNN::test_RNN_dropout_state
Currently ROCm doesn't support dropout value for RNN

PR to enable RNN dropout on ROCm still in review and blocked pytorch#144572

Fixes: pytorch#68849

Pull Request resolved: pytorch#149446
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
@iupaikov-amd
Copy link
Collaborator Author

@jeffdaily can we merge this? The issue is just credentials.

@jeffdaily
Copy link
Collaborator

@pytorchbot merge -f "flaky test failure on rocm, safe to merge"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/rocm Trigger "default" config CI on ROCm Merged module: inductor module: rocm AMD GPU support for Pytorch open source Reverted topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DISABLED test_RNN_dropout_state (__main__.TestNN)

9 participants