Skip to content

Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery#142292

Closed
tbennun wants to merge 12 commits intopytorch:mainfrom
tbennun:rocr-visible-devices
Closed

Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery#142292
tbennun wants to merge 12 commits intopytorch:mainfrom
tbennun:rocr-visible-devices

Conversation

@tbennun
Copy link
Contributor

@tbennun tbennun commented Dec 7, 2024

Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors.

Fixes #140318

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 7, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/142292

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f1ec857 with merge base b0c3f48 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@tbennun
Copy link
Contributor Author

tbennun commented Dec 7, 2024

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Dec 7, 2024
@tbennun
Copy link
Contributor Author

tbennun commented Dec 7, 2024

FYI (previous approvers) @jataylo @jithunnair-amd @eqy @jeffdaily

@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 9, 2024
else:
rocr_count = len(rocr_devics.split(","))
if hip_devices is not None:
if hip_devices.startswith("GPU-"):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could HIP_VISIBLE_DEVICES be set to 0,GPU-uuid,2? If so, this check should be on any(dev for dev in hip_devices.split(","))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffdaily Although not a common use case it could be possible for the user to use ROCR_VISIBLE_DEVICES in combination with HIP_VISIBLE_DEVICES with UUIDs

e.g. may use ROCR to expose only GPUs "0,1" in some environment but the user in that environment may want to use UUIDs in HIP_VISIBLE_DEVICES to ensure they run on the requested gpu.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong about ROCR env var only taking ints. It can take UUIDs also. So ROCR and HIP env vars are similar today, but ROCR env var activates first lower-level than HIP env var so it will reduce the number of devices that HIP can select from. I think the best we can do now is go back to the simple check that HIP env var isn't asking for more devices than are visible.

@tbennun
Copy link
Contributor Author

tbennun commented Dec 20, 2024

@jeffdaily @jataylo I think this is ready for testing before merging, apart from @jataylo 's comment on mixing UUIDs and indices in HIP_VISIBLE_DEVICES.

@jeffdaily jeffdaily requested a review from jataylo December 20, 2024 22:28
Copy link
Collaborator

@jataylo jataylo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, we're good to go. Lets run CI.

(Slight unknown to me is how ROCR_VISIBLE_DEVICES will operate inside the os.env inside the UT, but if we see some issues there lets just remove the test)

@jataylo
Copy link
Collaborator

jataylo commented Dec 23, 2024

@jeffdaily can we get your approval then lets reland this and open a cherry pick for release 2.6

@pruthvistony pruthvistony added module: rocm AMD GPU support for Pytorch rocm This tag is for PRs from ROCm team ciflow/rocm Trigger "default" config CI on ROCm labels Dec 23, 2024
@jataylo jataylo requested review from atalman and huydhn December 23, 2024 17:38
@jataylo
Copy link
Collaborator

jataylo commented Dec 23, 2024

@atalman @huydhn Hi both, could you help us re-land this PR, we would like to get this bug fix into 2.6 cherry picks also.

@huydhn
Copy link
Contributor

huydhn commented Dec 23, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 23, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased rocr-visible-devices onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout rocr-visible-devices && git pull --rebase)

@jataylo
Copy link
Collaborator

jataylo commented Dec 24, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@jataylo
Copy link
Collaborator

jataylo commented Dec 31, 2024

@pytorchbot cherry-pick --onto release/2.6

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 31, 2024

❌ 🤖 pytorchbot command failed:

@pytorchbot cherry-pick: error: the following arguments are required: -c/--classification

usage: @pytorchbot cherry-pick --onto ONTO [--fixes FIXES] -c
                               {regression,critical,fixnewfeature,docs,release}

Try @pytorchbot --help for more info.

@jataylo
Copy link
Collaborator

jataylo commented Dec 31, 2024

@pytorchbot cherry-pick --onto release/2.6 -c critical

pytorchbot pushed a commit that referenced this pull request Dec 31, 2024
Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors.

Fixes #140318

Pull Request resolved: #142292
Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
(cherry picked from commit c0d7106)
@pytorchbot
Copy link
Collaborator

Cherry picking #142292

The cherry pick PR is at #144026 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

kit1980 pushed a commit that referenced this pull request Jan 6, 2025
Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (#142292)

Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors.

Fixes #140318

Pull Request resolved: #142292
Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
(cherry picked from commit c0d7106)

Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
jataylo pushed a commit to jataylo/pytorch that referenced this pull request Feb 12, 2025
)

Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (pytorch#142292)

Reland of pytorch#140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors.

Fixes pytorch#140318

Pull Request resolved: pytorch#142292
Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
(cherry picked from commit c0d7106)

Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
(cherry picked from commit 23e390c)
pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Feb 19, 2025
…covery (pytorch#144026) (#1895)

Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (pytorch#142292)

Reland of pytorch#140320 after failing test on trunk. Fixes potential
environment clobbering in test, makes ROCr+HIP devices (if specified
together) more robust to index errors.

Fixes pytorch#140318

Pull Request resolved: pytorch#142292
Approved by: https://github.com/jataylo, https://github.com/huydhn,
https://github.com/jeffdaily

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
(cherry picked from commit c0d7106)

Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
(cherry picked from commit 23e390c)

Fixes #ISSUE_NUMBER

Co-authored-by: pytorchbot <soumith+bot@pytorch.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request Merged module: rocm AMD GPU support for Pytorch open source rocm This tag is for PRs from ROCm team topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Visible devices are not respected on AMD systems

9 participants