Skip to content

Avoid poisoning process with CUDA calls as soon as importing#183

Merged
ipanfilo merged 1 commit intoROCm:devfrom
HollowMan6:poison
May 14, 2025
Merged

Avoid poisoning process with CUDA calls as soon as importing#183
ipanfilo merged 1 commit intoROCm:devfrom
HollowMan6:poison

Conversation

@HollowMan6
Copy link
Contributor

@HollowMan6 HollowMan6 commented May 9, 2025

Description

Let's not call is_fp8_fnuz outside any function, as that will call torch.cuda.get_device_capability and lazily init the CUDA environment:
https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Current way will then make all the CUDA/HIP/ROCM_VISIBLE_DEVICES changes not effective after importing:
pytorch/pytorch#141678

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Package all is_fp8_fnuz related checks that are outside the functions to be inside the lambda functions.

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Let's not call `is_fp8_fnuz` outside any function, as that will
call `torch.cuda.get_device_capability` and lazily init the
CUDA environment:
https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Current way will then make all the `CUDA/HIP/ROCM_VISIBLE_DEVICES`
changes not effective after importing:
pytorch/pytorch#141678

Signed-off-by: Hollow Man <hollowman@opensuse.org>
@ipanfilo ipanfilo merged commit 864405c into ROCm:dev May 14, 2025
5 checks passed
eric-haibin-lin pushed a commit to verl-project/verl that referenced this pull request Jun 2, 2025
…Fix AMD support) (#1465)

### Checklist Before Starting

- [X] Search for similar PR(s).

### What does this PR do?

Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD
support

### High-Level Design

Current approach for supporting AMD in verl is fundamentally not
correct, and is just working out of the luck:

Calls such as `torch.cuda.is_available()` or
`torch.cuda.get_device_name()` will initialize the CUDA/ROCm
environment:

https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized
will not take effect (Please check
pytorch/pytorch#141678), which means that all
current code that wrapped inside `[SUPPORT AMD: torch]` are mostly
noops.

CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of
AMD migrated software call those `torch.cuda.*` during importing, e.g.:

- ROCm/TransformerEngine#183
- vllm-project/vllm#15246

While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which
cause those `torch.cuda.*` to poison the current process if the
CUDA/ROCm environment is initialized before the manipulation happens.

So, here, it would be a good solution to use only one environment
variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and
hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA
one. Note that we must pay attention if both HIP/CUDA and ROCR env vars
are set as they have different meanings. Both env vars accept either a
list of ints or a list of UUIDs. The ROCR env var is processed first
which then reduces the number of GPUs that HIP can select from.
(Refering to pytorch/pytorch#144026) To avoid
the complexity of this, we simply gives out error if both are set (Also
to keep consistency with ray's practice with 2.45.0).

For the poisoning issue, before those 2 PRs are merged, we will need to
ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer
manipulates these variables, and make verl workable when there is no
`*_VISIBLE_DEVICES`.

Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`),
we also need this patch: ray-project/ray#52794

### Test

Tested manually on both megatron and fsdp beckend with vllm.

### Additional Info.

- **Issue Number**: none
- **Training**: both FSDP and Megatron
- **Inference**: both vLLM and SGLang

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title if it breaks any API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add CI test(s) if neccessary.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
yzlnew pushed a commit to yzlnew/verl that referenced this pull request Jun 4, 2025
…Fix AMD support) (verl-project#1465)

### Checklist Before Starting

- [X] Search for similar PR(s).

### What does this PR do?

Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD
support

### High-Level Design

Current approach for supporting AMD in verl is fundamentally not
correct, and is just working out of the luck:

Calls such as `torch.cuda.is_available()` or
`torch.cuda.get_device_name()` will initialize the CUDA/ROCm
environment:

https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized
will not take effect (Please check
pytorch/pytorch#141678), which means that all
current code that wrapped inside `[SUPPORT AMD: torch]` are mostly
noops.

CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of
AMD migrated software call those `torch.cuda.*` during importing, e.g.:

- ROCm/TransformerEngine#183
- vllm-project/vllm#15246

While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which
cause those `torch.cuda.*` to poison the current process if the
CUDA/ROCm environment is initialized before the manipulation happens.

So, here, it would be a good solution to use only one environment
variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and
hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA
one. Note that we must pay attention if both HIP/CUDA and ROCR env vars
are set as they have different meanings. Both env vars accept either a
list of ints or a list of UUIDs. The ROCR env var is processed first
which then reduces the number of GPUs that HIP can select from.
(Refering to pytorch/pytorch#144026) To avoid
the complexity of this, we simply gives out error if both are set (Also
to keep consistency with ray's practice with 2.45.0).

For the poisoning issue, before those 2 PRs are merged, we will need to
ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer
manipulates these variables, and make verl workable when there is no
`*_VISIBLE_DEVICES`.

Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`),
we also need this patch: ray-project/ray#52794

### Test

Tested manually on both megatron and fsdp beckend with vllm.

### Additional Info.

- **Issue Number**: none
- **Training**: both FSDP and Megatron
- **Inference**: both vLLM and SGLang

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title if it breaks any API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add CI test(s) if neccessary.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Jun 6, 2025
…Fix AMD support) (verl-project#1465)

### Checklist Before Starting

- [X] Search for similar PR(s).

### What does this PR do?

Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD
support

### High-Level Design

Current approach for supporting AMD in verl is fundamentally not
correct, and is just working out of the luck:

Calls such as `torch.cuda.is_available()` or
`torch.cuda.get_device_name()` will initialize the CUDA/ROCm
environment:

https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized
will not take effect (Please check
pytorch/pytorch#141678), which means that all
current code that wrapped inside `[SUPPORT AMD: torch]` are mostly
noops.

CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of
AMD migrated software call those `torch.cuda.*` during importing, e.g.:

- ROCm/TransformerEngine#183
- vllm-project/vllm#15246

While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which
cause those `torch.cuda.*` to poison the current process if the
CUDA/ROCm environment is initialized before the manipulation happens.

So, here, it would be a good solution to use only one environment
variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and
hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA
one. Note that we must pay attention if both HIP/CUDA and ROCR env vars
are set as they have different meanings. Both env vars accept either a
list of ints or a list of UUIDs. The ROCR env var is processed first
which then reduces the number of GPUs that HIP can select from.
(Refering to pytorch/pytorch#144026) To avoid
the complexity of this, we simply gives out error if both are set (Also
to keep consistency with ray's practice with 2.45.0).

For the poisoning issue, before those 2 PRs are merged, we will need to
ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer
manipulates these variables, and make verl workable when there is no
`*_VISIBLE_DEVICES`.

Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`),
we also need this patch: ray-project/ray#52794

### Test

Tested manually on both megatron and fsdp beckend with vllm.

### Additional Info.

- **Issue Number**: none
- **Training**: both FSDP and Megatron
- **Inference**: both vLLM and SGLang

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title if it breaks any API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add CI test(s) if neccessary.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
wwwjn pushed a commit to wwwjn/verl that referenced this pull request Jun 10, 2025
…Fix AMD support) (verl-project#1465)

### Checklist Before Starting

- [X] Search for similar PR(s).

### What does this PR do?

Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD
support

### High-Level Design

Current approach for supporting AMD in verl is fundamentally not
correct, and is just working out of the luck:

Calls such as `torch.cuda.is_available()` or
`torch.cuda.get_device_name()` will initialize the CUDA/ROCm
environment:

https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized
will not take effect (Please check
pytorch/pytorch#141678), which means that all
current code that wrapped inside `[SUPPORT AMD: torch]` are mostly
noops.

CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of
AMD migrated software call those `torch.cuda.*` during importing, e.g.:

- ROCm/TransformerEngine#183
- vllm-project/vllm#15246

While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which
cause those `torch.cuda.*` to poison the current process if the
CUDA/ROCm environment is initialized before the manipulation happens.

So, here, it would be a good solution to use only one environment
variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and
hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA
one. Note that we must pay attention if both HIP/CUDA and ROCR env vars
are set as they have different meanings. Both env vars accept either a
list of ints or a list of UUIDs. The ROCR env var is processed first
which then reduces the number of GPUs that HIP can select from.
(Refering to pytorch/pytorch#144026) To avoid
the complexity of this, we simply gives out error if both are set (Also
to keep consistency with ray's practice with 2.45.0).

For the poisoning issue, before those 2 PRs are merged, we will need to
ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer
manipulates these variables, and make verl workable when there is no
`*_VISIBLE_DEVICES`.

Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`),
we also need this patch: ray-project/ray#52794

### Test

Tested manually on both megatron and fsdp beckend with vllm.

### Additional Info.

- **Issue Number**: none
- **Training**: both FSDP and Megatron
- **Inference**: both vLLM and SGLang

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title if it breaks any API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add CI test(s) if neccessary.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
@HollowMan6 HollowMan6 mentioned this pull request Jun 28, 2025
13 tasks
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…Fix AMD support) (verl-project#1465)

### Checklist Before Starting

- [X] Search for similar PR(s).

### What does this PR do?

Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD
support

### High-Level Design

Current approach for supporting AMD in verl is fundamentally not
correct, and is just working out of the luck:

Calls such as `torch.cuda.is_available()` or
`torch.cuda.get_device_name()` will initialize the CUDA/ROCm
environment:

https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized
will not take effect (Please check
pytorch/pytorch#141678), which means that all
current code that wrapped inside `[SUPPORT AMD: torch]` are mostly
noops.

CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of
AMD migrated software call those `torch.cuda.*` during importing, e.g.:

- ROCm/TransformerEngine#183
- vllm-project/vllm#15246

While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which
cause those `torch.cuda.*` to poison the current process if the
CUDA/ROCm environment is initialized before the manipulation happens.

So, here, it would be a good solution to use only one environment
variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and
hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA
one. Note that we must pay attention if both HIP/CUDA and ROCR env vars
are set as they have different meanings. Both env vars accept either a
list of ints or a list of UUIDs. The ROCR env var is processed first
which then reduces the number of GPUs that HIP can select from.
(Refering to pytorch/pytorch#144026) To avoid
the complexity of this, we simply gives out error if both are set (Also
to keep consistency with ray's practice with 2.45.0).

For the poisoning issue, before those 2 PRs are merged, we will need to
ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer
manipulates these variables, and make verl workable when there is no
`*_VISIBLE_DEVICES`.

Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`),
we also need this patch: ray-project/ray#52794

### Test

Tested manually on both megatron and fsdp beckend with vllm.

### Additional Info.

- **Issue Number**: none
- **Training**: both FSDP and Megatron
- **Inference**: both vLLM and SGLang

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title if it breaks any API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add CI test(s) if neccessary.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
paolo328 added a commit to paolo328/Verl that referenced this pull request Nov 27, 2025
…Fix AMD support) (#1465)

### Checklist Before Starting

- [X] Search for similar PR(s).

### What does this PR do?

Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD
support

### High-Level Design

Current approach for supporting AMD in verl is fundamentally not
correct, and is just working out of the luck:

Calls such as `torch.cuda.is_available()` or
`torch.cuda.get_device_name()` will initialize the CUDA/ROCm
environment:

https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized
will not take effect (Please check
pytorch/pytorch#141678), which means that all
current code that wrapped inside `[SUPPORT AMD: torch]` are mostly
noops.

CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of
AMD migrated software call those `torch.cuda.*` during importing, e.g.:

- ROCm/TransformerEngine#183
- vllm-project/vllm#15246

While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which
cause those `torch.cuda.*` to poison the current process if the
CUDA/ROCm environment is initialized before the manipulation happens.

So, here, it would be a good solution to use only one environment
variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and
hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA
one. Note that we must pay attention if both HIP/CUDA and ROCR env vars
are set as they have different meanings. Both env vars accept either a
list of ints or a list of UUIDs. The ROCR env var is processed first
which then reduces the number of GPUs that HIP can select from.
(Refering to pytorch/pytorch#144026) To avoid
the complexity of this, we simply gives out error if both are set (Also
to keep consistency with ray's practice with 2.45.0).

For the poisoning issue, before those 2 PRs are merged, we will need to
ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer
manipulates these variables, and make verl workable when there is no
`*_VISIBLE_DEVICES`.

Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`),
we also need this patch: ray-project/ray#52794

### Test

Tested manually on both megatron and fsdp beckend with vllm.

### Additional Info.

- **Issue Number**: none
- **Training**: both FSDP and Megatron
- **Inference**: both vLLM and SGLang

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title if it breaks any API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add CI test(s) if neccessary.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
…Fix AMD support) (verl-project#1465)

### Checklist Before Starting

- [X] Search for similar PR(s).

### What does this PR do?

Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD
support

### High-Level Design

Current approach for supporting AMD in verl is fundamentally not
correct, and is just working out of the luck:

Calls such as `torch.cuda.is_available()` or
`torch.cuda.get_device_name()` will initialize the CUDA/ROCm
environment:

https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized
will not take effect (Please check
pytorch/pytorch#141678), which means that all
current code that wrapped inside `[SUPPORT AMD: torch]` are mostly
noops.

CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of
AMD migrated software call those `torch.cuda.*` during importing, e.g.:

- ROCm/TransformerEngine#183
- vllm-project/vllm#15246

While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which
cause those `torch.cuda.*` to poison the current process if the
CUDA/ROCm environment is initialized before the manipulation happens.

So, here, it would be a good solution to use only one environment
variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and
hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA
one. Note that we must pay attention if both HIP/CUDA and ROCR env vars
are set as they have different meanings. Both env vars accept either a
list of ints or a list of UUIDs. The ROCR env var is processed first
which then reduces the number of GPUs that HIP can select from.
(Refering to pytorch/pytorch#144026) To avoid
the complexity of this, we simply gives out error if both are set (Also
to keep consistency with ray's practice with 2.45.0).

For the poisoning issue, before those 2 PRs are merged, we will need to
ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer
manipulates these variables, and make verl workable when there is no
`*_VISIBLE_DEVICES`.

Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`),
we also need this patch: ray-project/ray#52794

### Test

Tested manually on both megatron and fsdp beckend with vllm.

### Additional Info.

- **Issue Number**: none
- **Training**: both FSDP and Megatron
- **Inference**: both vLLM and SGLang

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title if it breaks any API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add CI test(s) if neccessary.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
…Fix AMD support) (verl-project#1465)

### Checklist Before Starting

- [X] Search for similar PR(s).

### What does this PR do?

Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD
support

### High-Level Design

Current approach for supporting AMD in verl is fundamentally not
correct, and is just working out of the luck:

Calls such as `torch.cuda.is_available()` or
`torch.cuda.get_device_name()` will initialize the CUDA/ROCm
environment:

https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized
will not take effect (Please check
pytorch/pytorch#141678), which means that all
current code that wrapped inside `[SUPPORT AMD: torch]` are mostly
noops.

CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of
AMD migrated software call those `torch.cuda.*` during importing, e.g.:

- ROCm/TransformerEngine#183
- vllm-project/vllm#15246

While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which
cause those `torch.cuda.*` to poison the current process if the
CUDA/ROCm environment is initialized before the manipulation happens.

So, here, it would be a good solution to use only one environment
variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and
hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA
one. Note that we must pay attention if both HIP/CUDA and ROCR env vars
are set as they have different meanings. Both env vars accept either a
list of ints or a list of UUIDs. The ROCR env var is processed first
which then reduces the number of GPUs that HIP can select from.
(Refering to pytorch/pytorch#144026) To avoid
the complexity of this, we simply gives out error if both are set (Also
to keep consistency with ray's practice with 2.45.0).

For the poisoning issue, before those 2 PRs are merged, we will need to
ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer
manipulates these variables, and make verl workable when there is no
`*_VISIBLE_DEVICES`.

Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`),
we also need this patch: ray-project/ray#52794

### Test

Tested manually on both megatron and fsdp beckend with vllm.

### Additional Info.

- **Issue Number**: none
- **Training**: both FSDP and Megatron
- **Inference**: both vLLM and SGLang

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title if it breaks any API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add CI test(s) if neccessary.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants