Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery#140320
Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery#140320tbennun wants to merge 5 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140320
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit b5f6863 with merge base 8bdcdae ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "topic: not user facing" |
|
I left a more extended comment here #140318 (comment) I'm not sure adding ROCR_VISIBLE_DEVICES to the |
|
Added a new PR here #140398 Which will ensure interoperability between the two visible devices |
jithunnair-amd
left a comment
There was a problem hiding this comment.
Looks right. @tbennun can you please post some examples showing what the updated logic gives as output of _parse_visible_devices()?
@jithunnair-amd Of course, essentially now PyTorch takes @jataylo already gave some examples in #140318 (comment) so I can start from there. Would you like this in the form of a test with, e.g., |
|
@jithunnair-amd @jataylo please re-review. I added tests, fixed the linter issue, and improved the behavior when both environment variables are given. Thanks! |
|
@jithunnair-amd @jataylo Any updates on this PR? Thanks! |
|
LGTM, approved workflow to see if the UT passes. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / win-vs2019-cuda12.1-py3 / build Details for Dev Infra teamRaised by workflow job |
|
The failures seem unrelated to this PR and related to the version of distutils. |
|
@pytorchbot rebase |
|
Rebasing to see if unrelated errors go away :) |
|
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot revert -m 'Sorry for reverting your change but test_hip_device_count is failing in trunk after this land' -c nosignal test_cuda.py::TestCuda::test_hip_device_count GH job link HUD commit link |
|
@pytorchbot successfully started a revert job. Check the current status here. |
|
@tbennun your PR has been successfully reverted. |
…0320)" This reverts commit add4a42. Reverted #140320 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_hip_device_count is failing in trunk after this land ([comment](#140320 (comment)))
|
@huydhn Thanks. The test passed locally, I will check. Strange that it didn't show up in the tests. Are rocm tests not running on PRs? If not, is there a way I can trigger them? |
|
It could be an issue with our target determination when some tests were wrongly skipped. Let me double check that, but you could add |
|
It doesn't looks like the case, I think you just need to add |
|
Turns out the rocm tests did run (and terminated successfully) before: https://github.com/pytorch/pytorch/actions/runs/12199014152/job/34033407881#step:15:961 |
Fixes #140318 Pull Request resolved: #140320 Approved by: https://github.com/eqy, https://github.com/jithunnair-amd, https://github.com/jataylo, https://github.com/jeffdaily Co-authored-by: Jack Taylor <jack.taylor@amd.com>
…0320)" This reverts commit add4a42. Reverted #140320 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_hip_device_count is failing in trunk after this land ([comment](#140320 (comment)))
|
@tbennun I wonder if it has anything to do with the CI jobs setting HIP_VISIBLE_DEVICES=0 before execution then us trying to set ROCR_VISIBLE_DEVICES=0,1,2 at runtime during the UT. The application may only see a single GPU and then try to set device_count to 3. Might need some local testing to try and reproduce. |
Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors. Fixes #140318 Pull Request resolved: #142292 Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors. Fixes #140318 Pull Request resolved: #142292 Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> (cherry picked from commit c0d7106)
Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#142292) Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors. Fixes #140318 Pull Request resolved: #142292 Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> (cherry picked from commit c0d7106) Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
) Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (pytorch#142292) Reland of pytorch#140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors. Fixes pytorch#140318 Pull Request resolved: pytorch#142292 Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> (cherry picked from commit c0d7106) Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> (cherry picked from commit 23e390c)
…covery (pytorch#144026) (#1895) Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (pytorch#142292) Reland of pytorch#140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors. Fixes pytorch#140318 Pull Request resolved: pytorch#142292 Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> (cherry picked from commit c0d7106) Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> (cherry picked from commit 23e390c) Fixes #ISSUE_NUMBER Co-authored-by: pytorchbot <soumith+bot@pytorch.org>
Fixes #140318