[ROCm] Implemented dropout usage for RNN with MIOpen backend#144572
[ROCm] Implemented dropout usage for RNN with MIOpen backend#144572iupaikov-amd wants to merge 11 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144572
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 0741dce with merge base 863ac20 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Initial questions:
Addressed issues after getting some feedback:
|
|
@pytorchbot label "topic: not user facing" |
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
ab5de54 to
b221755
Compare
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Successfully rebased |
3f15d02 to
918549d
Compare
|
Thanks everyone extensive code review! Due to the issue with RNG seed usage we decided to postpone this PR to later release. For now I will keep this PR in draft. The main problem is that when we come down to cpp land for MIOpen API usage we create all of the supporting structs there for each call of the RNN function. This causes RNG machine to generate the same random numbers if we use the same seed. We either will need to keep track of the python context in cpp part of code or have a gpu mem allocated in python and always pass it down. |
…OCm (#1970) RNN dropout PR for upstream still in review pytorch#144572 Fixes SWDEV-517343
Currently ROCm doesn't support dropout value for RNN PR to enable RNN dropout on ROCm still in review pytorch#144572
PR to skip test_nn.py::TestNN::test_RNN_dropout_state Currently ROCm doesn't support dropout value for RNN PR to enable RNN dropout on ROCm still in review and blocked #144572 Fixes: #68849 Pull Request resolved: #149446 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
|
@pytorchbot rebase |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
|
Rebase failed due to Command Raised by https://github.com/pytorch/pytorch/actions/runs/14111718518 |
|
I implemented a proper way of dropout state restoration that doesn't require setting a seed to random number to produce correct results for a linked test. For now the seed is always set to 0, but this will later be addressed in a different PR. There are some complications with HIP API regarding random number generator tied to a device. We decided to store each dropout state in thread_local storage to avoid multithreading issues here. Dropout descriptor needs to also be stored in such way for API to function properly. From initial tests this shouldn't cause any issues or impact performance in major way, but we will continue to monitor this and rework this if needed. Also since we need to store the dropout state size and later probably some other stuff as well if we need a descriptor for each device I decided to leave a struct approach to manage the workspace memory. Tried the proposed variant but readability of the code suffered in a major way and I changed it back to struct with a more specific naming. |
PR to skip test_nn.py::TestNN::test_RNN_dropout_state Currently ROCm doesn't support dropout value for RNN PR to enable RNN dropout on ROCm still in review and blocked pytorch#144572 Fixes: pytorch#68849 Pull Request resolved: pytorch#149446 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
|
@jeffdaily can we merge this? The issue is just credentials. |
|
@pytorchbot merge -f "flaky test failure on rocm, safe to merge" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This PR fixes #107183 for ROCm.
Implemented the usage of new RNN descriptor for MIOpen backend that takes into account dropout rate value using dropout descriptor. This fixes associated test_RNN_dropout_state test.
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @ColinPeppler @desertfire