Skip to content

Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery#144026

Merged
kit1980 merged 1 commit intorelease/2.6from
cherry-pick-142292-by-pytorch_bot_bot_
Jan 6, 2025
Merged

Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery#144026
kit1980 merged 1 commit intorelease/2.6from
cherry-pick-142292-by-pytorch_bot_bot_

Conversation

@pytorchbot
Copy link
Collaborator

Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors.

Fixes #140318

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

Reland of #140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors.

Fixes #140318

Pull Request resolved: #142292
Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
(cherry picked from commit c0d7106)
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 31, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/144026

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1587ee3 with merge base f7e621c (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@kit1980 kit1980 merged commit 23e390c into release/2.6 Jan 6, 2025
@github-actions github-actions bot deleted the cherry-pick-142292-by-pytorch_bot_bot_ branch February 9, 2025 02:10
jataylo pushed a commit to jataylo/pytorch that referenced this pull request Feb 12, 2025
)

Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (pytorch#142292)

Reland of pytorch#140320 after failing test on trunk. Fixes potential environment clobbering in test, makes ROCr+HIP devices (if specified together) more robust to index errors.

Fixes pytorch#140318

Pull Request resolved: pytorch#142292
Approved by: https://github.com/jataylo, https://github.com/huydhn, https://github.com/jeffdaily

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
(cherry picked from commit c0d7106)

Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
(cherry picked from commit 23e390c)
pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request Feb 19, 2025
…covery (pytorch#144026) (#1895)

Respect ROCR_VISIBLE_DEVICES on AMD GPU device discovery (pytorch#142292)

Reland of pytorch#140320 after failing test on trunk. Fixes potential
environment clobbering in test, makes ROCr+HIP devices (if specified
together) more robust to index errors.

Fixes pytorch#140318

Pull Request resolved: pytorch#142292
Approved by: https://github.com/jataylo, https://github.com/huydhn,
https://github.com/jeffdaily

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
(cherry picked from commit c0d7106)

Co-authored-by: Tal Ben-Nun <tbennun@users.noreply.github.com>
(cherry picked from commit 23e390c)

Fixes #ISSUE_NUMBER

Co-authored-by: pytorchbot <soumith+bot@pytorch.org>
hijkzzz pushed a commit to OpenRLHF/OpenRLHF that referenced this pull request Mar 20, 2025
From v2.6.0, torch respects ROCR_VISIBLE_DEVICES on AMD GPU device
discovery: pytorch/pytorch#144026 So we no
longer need to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES`
when we run OpenRLHF on AMD nodes with ray, when this PR is merged
together with vllm-project/vllm#15246 at the
vLLM side.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
eric-haibin-lin pushed a commit to verl-project/verl that referenced this pull request Jun 2, 2025
…Fix AMD support) (#1465)

### Checklist Before Starting

- [X] Search for similar PR(s).

### What does this PR do?

Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD
support

### High-Level Design

Current approach for supporting AMD in verl is fundamentally not
correct, and is just working out of the luck:

Calls such as `torch.cuda.is_available()` or
`torch.cuda.get_device_name()` will initialize the CUDA/ROCm
environment:

https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized
will not take effect (Please check
pytorch/pytorch#141678), which means that all
current code that wrapped inside `[SUPPORT AMD: torch]` are mostly
noops.

CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of
AMD migrated software call those `torch.cuda.*` during importing, e.g.:

- ROCm/TransformerEngine#183
- vllm-project/vllm#15246

While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which
cause those `torch.cuda.*` to poison the current process if the
CUDA/ROCm environment is initialized before the manipulation happens.

So, here, it would be a good solution to use only one environment
variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and
hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA
one. Note that we must pay attention if both HIP/CUDA and ROCR env vars
are set as they have different meanings. Both env vars accept either a
list of ints or a list of UUIDs. The ROCR env var is processed first
which then reduces the number of GPUs that HIP can select from.
(Refering to pytorch/pytorch#144026) To avoid
the complexity of this, we simply gives out error if both are set (Also
to keep consistency with ray's practice with 2.45.0).

For the poisoning issue, before those 2 PRs are merged, we will need to
ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer
manipulates these variables, and make verl workable when there is no
`*_VISIBLE_DEVICES`.

Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`),
we also need this patch: ray-project/ray#52794

### Test

Tested manually on both megatron and fsdp beckend with vllm.

### Additional Info.

- **Issue Number**: none
- **Training**: both FSDP and Megatron
- **Inference**: both vLLM and SGLang

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title if it breaks any API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add CI test(s) if neccessary.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
yzlnew pushed a commit to yzlnew/verl that referenced this pull request Jun 4, 2025
…Fix AMD support) (verl-project#1465)

### Checklist Before Starting

- [X] Search for similar PR(s).

### What does this PR do?

Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD
support

### High-Level Design

Current approach for supporting AMD in verl is fundamentally not
correct, and is just working out of the luck:

Calls such as `torch.cuda.is_available()` or
`torch.cuda.get_device_name()` will initialize the CUDA/ROCm
environment:

https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized
will not take effect (Please check
pytorch/pytorch#141678), which means that all
current code that wrapped inside `[SUPPORT AMD: torch]` are mostly
noops.

CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of
AMD migrated software call those `torch.cuda.*` during importing, e.g.:

- ROCm/TransformerEngine#183
- vllm-project/vllm#15246

While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which
cause those `torch.cuda.*` to poison the current process if the
CUDA/ROCm environment is initialized before the manipulation happens.

So, here, it would be a good solution to use only one environment
variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and
hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA
one. Note that we must pay attention if both HIP/CUDA and ROCR env vars
are set as they have different meanings. Both env vars accept either a
list of ints or a list of UUIDs. The ROCR env var is processed first
which then reduces the number of GPUs that HIP can select from.
(Refering to pytorch/pytorch#144026) To avoid
the complexity of this, we simply gives out error if both are set (Also
to keep consistency with ray's practice with 2.45.0).

For the poisoning issue, before those 2 PRs are merged, we will need to
ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer
manipulates these variables, and make verl workable when there is no
`*_VISIBLE_DEVICES`.

Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`),
we also need this patch: ray-project/ray#52794

### Test

Tested manually on both megatron and fsdp beckend with vllm.

### Additional Info.

- **Issue Number**: none
- **Training**: both FSDP and Megatron
- **Inference**: both vLLM and SGLang

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title if it breaks any API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add CI test(s) if neccessary.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Jun 6, 2025
…Fix AMD support) (verl-project#1465)

### Checklist Before Starting

- [X] Search for similar PR(s).

### What does this PR do?

Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD
support

### High-Level Design

Current approach for supporting AMD in verl is fundamentally not
correct, and is just working out of the luck:

Calls such as `torch.cuda.is_available()` or
`torch.cuda.get_device_name()` will initialize the CUDA/ROCm
environment:

https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized
will not take effect (Please check
pytorch/pytorch#141678), which means that all
current code that wrapped inside `[SUPPORT AMD: torch]` are mostly
noops.

CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of
AMD migrated software call those `torch.cuda.*` during importing, e.g.:

- ROCm/TransformerEngine#183
- vllm-project/vllm#15246

While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which
cause those `torch.cuda.*` to poison the current process if the
CUDA/ROCm environment is initialized before the manipulation happens.

So, here, it would be a good solution to use only one environment
variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and
hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA
one. Note that we must pay attention if both HIP/CUDA and ROCR env vars
are set as they have different meanings. Both env vars accept either a
list of ints or a list of UUIDs. The ROCR env var is processed first
which then reduces the number of GPUs that HIP can select from.
(Refering to pytorch/pytorch#144026) To avoid
the complexity of this, we simply gives out error if both are set (Also
to keep consistency with ray's practice with 2.45.0).

For the poisoning issue, before those 2 PRs are merged, we will need to
ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer
manipulates these variables, and make verl workable when there is no
`*_VISIBLE_DEVICES`.

Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`),
we also need this patch: ray-project/ray#52794

### Test

Tested manually on both megatron and fsdp beckend with vllm.

### Additional Info.

- **Issue Number**: none
- **Training**: both FSDP and Megatron
- **Inference**: both vLLM and SGLang

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title if it breaks any API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add CI test(s) if neccessary.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
wwwjn pushed a commit to wwwjn/verl that referenced this pull request Jun 10, 2025
…Fix AMD support) (verl-project#1465)

### Checklist Before Starting

- [X] Search for similar PR(s).

### What does this PR do?

Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD
support

### High-Level Design

Current approach for supporting AMD in verl is fundamentally not
correct, and is just working out of the luck:

Calls such as `torch.cuda.is_available()` or
`torch.cuda.get_device_name()` will initialize the CUDA/ROCm
environment:

https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized
will not take effect (Please check
pytorch/pytorch#141678), which means that all
current code that wrapped inside `[SUPPORT AMD: torch]` are mostly
noops.

CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of
AMD migrated software call those `torch.cuda.*` during importing, e.g.:

- ROCm/TransformerEngine#183
- vllm-project/vllm#15246

While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which
cause those `torch.cuda.*` to poison the current process if the
CUDA/ROCm environment is initialized before the manipulation happens.

So, here, it would be a good solution to use only one environment
variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and
hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA
one. Note that we must pay attention if both HIP/CUDA and ROCR env vars
are set as they have different meanings. Both env vars accept either a
list of ints or a list of UUIDs. The ROCR env var is processed first
which then reduces the number of GPUs that HIP can select from.
(Refering to pytorch/pytorch#144026) To avoid
the complexity of this, we simply gives out error if both are set (Also
to keep consistency with ray's practice with 2.45.0).

For the poisoning issue, before those 2 PRs are merged, we will need to
ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer
manipulates these variables, and make verl workable when there is no
`*_VISIBLE_DEVICES`.

Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`),
we also need this patch: ray-project/ray#52794

### Test

Tested manually on both megatron and fsdp beckend with vllm.

### Additional Info.

- **Issue Number**: none
- **Training**: both FSDP and Megatron
- **Inference**: both vLLM and SGLang

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title if it breaks any API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add CI test(s) if neccessary.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants