Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery by pytorchbot · Pull Request #144026 · pytorch/pytorch

pytorchbot · 2024-12-31T11:06:54Z

Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors.

Fixes #140318

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors. Fixes #140318 Pull Request resolved: #142292 Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> (cherry picked from commit c0d7106)

pytorch-bot · 2024-12-31T11:06:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144026

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1587ee3 with merge base f7e621c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

) Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (pytorch#142292) Reland of pytorch#140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors. Fixes pytorch#140318 Pull Request resolved: pytorch#142292 Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> (cherry picked from commit c0d7106) Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> (cherry picked from commit 23e390c)

…covery (pytorch#144026) (#1895) Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (pytorch#142292) Reland of pytorch#140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors. Fixes pytorch#140318 Pull Request resolved: pytorch#142292 Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> (cherry picked from commit c0d7106) Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com> (cherry picked from commit 23e390c) Fixes #ISSUE_NUMBER Co-authored-by: pytorchbot <soumith+bot@pytorch.org>

From v2.6.0, torch respects ROCR_VISIBLE_DEVICES on AMD GPU device discovery: pytorch/pytorch#144026 So we no longer need to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` when we run OpenRLHF on AMD nodes with ray, when this PR is merged together with vllm-project/vllm#15246 at the vLLM side. Signed-off-by: Hollow Man <hollowman@opensuse.org>

…Fix AMD support) (#1465) ### Checklist Before Starting - [X] Search for similar PR(s). ### What does this PR do? Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD support ### High-Level Design Current approach for supporting AMD in verl is fundamentally not correct, and is just working out of the luck: Calls such as `torch.cuda.is_available()` or `torch.cuda.get_device_name()` will initialize the CUDA/ROCm environment: https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392 Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized will not take effect (Please check pytorch/pytorch#141678), which means that all current code that wrapped inside `[SUPPORT AMD: torch]` are mostly noops. CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of AMD migrated software call those `torch.cuda.*` during importing, e.g.: - ROCm/TransformerEngine#183 - vllm-project/vllm#15246 While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which cause those `torch.cuda.*` to poison the current process if the CUDA/ROCm environment is initialized before the manipulation happens. So, here, it would be a good solution to use only one environment variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA one. Note that we must pay attention if both HIP/CUDA and ROCR env vars are set as they have different meanings. Both env vars accept either a list of ints or a list of UUIDs. The ROCR env var is processed first which then reduces the number of GPUs that HIP can select from. (Refering to pytorch/pytorch#144026) To avoid the complexity of this, we simply gives out error if both are set (Also to keep consistency with ray's practice with 2.45.0). For the poisoning issue, before those 2 PRs are merged, we will need to ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or `RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer manipulates these variables, and make verl workable when there is no `*_VISIBLE_DEVICES`. Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`), we also need this patch: ray-project/ray#52794 ### Test Tested manually on both megatron and fsdp beckend with vllm. ### Additional Info. - **Issue Number**: none - **Training**: both FSDP and Megatron - **Inference**: both vLLM and SGLang ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] Add CI test(s) if neccessary. Signed-off-by: Hollow Man <hollowman@opensuse.org>

…Fix AMD support) (verl-project#1465) ### Checklist Before Starting - [X] Search for similar PR(s). ### What does this PR do? Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD support ### High-Level Design Current approach for supporting AMD in verl is fundamentally not correct, and is just working out of the luck: Calls such as `torch.cuda.is_available()` or `torch.cuda.get_device_name()` will initialize the CUDA/ROCm environment: https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392 Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized will not take effect (Please check pytorch/pytorch#141678), which means that all current code that wrapped inside `[SUPPORT AMD: torch]` are mostly noops. CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of AMD migrated software call those `torch.cuda.*` during importing, e.g.: - ROCm/TransformerEngine#183 - vllm-project/vllm#15246 While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which cause those `torch.cuda.*` to poison the current process if the CUDA/ROCm environment is initialized before the manipulation happens. So, here, it would be a good solution to use only one environment variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA one. Note that we must pay attention if both HIP/CUDA and ROCR env vars are set as they have different meanings. Both env vars accept either a list of ints or a list of UUIDs. The ROCR env var is processed first which then reduces the number of GPUs that HIP can select from. (Refering to pytorch/pytorch#144026) To avoid the complexity of this, we simply gives out error if both are set (Also to keep consistency with ray's practice with 2.45.0). For the poisoning issue, before those 2 PRs are merged, we will need to ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or `RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer manipulates these variables, and make verl workable when there is no `*_VISIBLE_DEVICES`. Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`), we also need this patch: ray-project/ray#52794 ### Test Tested manually on both megatron and fsdp beckend with vllm. ### Additional Info. - **Issue Number**: none - **Training**: both FSDP and Megatron - **Inference**: both vLLM and SGLang ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] Add CI test(s) if neccessary. Signed-off-by: Hollow Man <hollowman@opensuse.org>

pytorchbot requested review from eqy and syed-ahmed as code owners December 31, 2024 11:06

pytorchbot mentioned this pull request Dec 31, 2024

[v.2.6.0] Release Tracker #142814

Closed

pytorchbot mentioned this pull request Dec 31, 2024

Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery #142292

Closed

pytorchbot added the open source label Dec 31, 2024

kit1980 approved these changes Jan 6, 2025

View reviewed changes

kit1980 merged commit 23e390c into release/2.6 Jan 6, 2025

github-actions bot deleted the cherry-pick-142292-by-pytorch_bot_bot_ branch February 9, 2025 02:10

This was referenced Mar 20, 2025

[ROCm] Get rid of RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES vllm-project/vllm#15246

Closed

Pop ROCR_VISIBLE_DEVICES as well when starting LLMRayActor OpenRLHF/OpenRLHF#895

Merged

HollowMan6 mentioned this pull request May 9, 2025

[Feat] Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES (Fix AMD support) verl-project/verl#1465

Merged

6 tasks

vickytsang mentioned this pull request Jun 17, 2025

Fix ray import error when both ROCR_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES are set ray-project/ray#53757

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery#144026

Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery#144026
kit1980 merged 1 commit intorelease/2.6from
cherry-pick-142292-by-pytorch_bot_bot_

pytorchbot commented Dec 31, 2024

Uh oh!

pytorch-bot bot commented Dec 31, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pytorchbot commented Dec 31, 2024

Uh oh!

pytorch-bot bot commented Dec 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144026

✅ No Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pytorch-bot bot commented Dec 31, 2024 •

edited

Loading