Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery#142292
Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery#142292tbennun wants to merge 12 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142292
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit f1ec857 with merge base b0c3f48 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "topic: not user facing" |
|
FYI (previous approvers) @jataylo @jithunnair-amd @eqy @jeffdaily |
torch/cuda/__init__.py
Outdated
| else: | ||
| rocr_count = len(rocr_devics.split(",")) | ||
| if hip_devices is not None: | ||
| if hip_devices.startswith("GPU-"): |
There was a problem hiding this comment.
could HIP_VISIBLE_DEVICES be set to 0,GPU-uuid,2? If so, this check should be on any(dev for dev in hip_devices.split(","))
There was a problem hiding this comment.
@jeffdaily Although not a common use case it could be possible for the user to use ROCR_VISIBLE_DEVICES in combination with HIP_VISIBLE_DEVICES with UUIDs
e.g. may use ROCR to expose only GPUs "0,1" in some environment but the user in that environment may want to use UUIDs in HIP_VISIBLE_DEVICES to ensure they run on the requested gpu.
There was a problem hiding this comment.
I was wrong about ROCR env var only taking ints. It can take UUIDs also. So ROCR and HIP env vars are similar today, but ROCR env var activates first lower-level than HIP env var so it will reduce the number of devices that HIP can select from. I think the best we can do now is go back to the simple check that HIP env var isn't asking for more devices than are visible.
|
@jeffdaily @jataylo I think this is ready for testing before merging, apart from @jataylo 's comment on mixing UUIDs and indices in |
jataylo
left a comment
There was a problem hiding this comment.
Looks good to me, we're good to go. Lets run CI.
(Slight unknown to me is how ROCR_VISIBLE_DEVICES will operate inside the os.env inside the UT, but if we see some issues there lets just remove the test)
|
@jeffdaily can we get your approval then lets reland this and open a cherry pick for release 2.6 |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
Successfully rebased |
7893dbb to
f1ec857
Compare
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot cherry-pick --onto release/2.6 |
|
❌ 🤖 pytorchbot command failed: Try |
|
@pytorchbot cherry-pick --onto release/2.6 -c critical |
Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors. Fixes #140318 Pull Request resolved: #142292 Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> (cherry picked from commit c0d7106)
Cherry picking #142292The cherry pick PR is at #144026 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#142292) Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors. Fixes #140318 Pull Request resolved: #142292 Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> (cherry picked from commit c0d7106) Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
) Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (pytorch#142292) Reland of pytorch#140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors. Fixes pytorch#140318 Pull Request resolved: pytorch#142292 Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> (cherry picked from commit c0d7106) Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> (cherry picked from commit 23e390c)
…covery (pytorch#144026) (#1895) Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (pytorch#142292) Reland of pytorch#140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors. Fixes pytorch#140318 Pull Request resolved: pytorch#142292 Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> (cherry picked from commit c0d7106) Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> (cherry picked from commit 23e390c) Fixes #ISSUE_NUMBER Co-authored-by: pytorchbot <soumith+bot@pytorch.org>
Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors.
Fixes #140318
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd