Skip to content

[core][rocm] Allow CUDA_VISIBLE_DEVICS and HIP_VISIBLE_DEVICES#52794

Closed
vickytsang wants to merge 0 commit intoray-project:masterfrom
ROCm:amd/fix-issue-52701
Closed

[core][rocm] Allow CUDA_VISIBLE_DEVICS and HIP_VISIBLE_DEVICES#52794
vickytsang wants to merge 0 commit intoray-project:masterfrom
ROCm:amd/fix-issue-52701

Conversation

@vickytsang
Copy link
Contributor

Why are these changes needed?

Both CUDA_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES should be allowed if:

Only one of the two environment variables is set.
If both are set, they must have the same value.

Related issue number

Closes #52701

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@vickytsang
Copy link
Contributor Author

AMD GPU test results

image image

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[outdated] Have you manually tested it?

[updated] I saw your screenshot.

@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label May 5, 2025
@kevin85421
Copy link
Member

Please ping me when all CI tests pass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would also need to possibly fetch all the device ids from CUDA_VISIBLE_DEVICES_ENV_VAR instead of HIP_VISIBLE_DEVICES_ENV_VAR if CUDA_VISIBLE_DEVICES_ENV_VAR is only set in this case, since the code here fetches only from HIP_VISIBLE_DEVICES_ENV_VAR.

def get_visible_accelerator_ids_env_var() -> str:
return HIP_VISIBLE_DEVICES_ENV_VAR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjyao
Copy link
Contributor

jjyao commented May 6, 2025

@vickytsang could you rebase with master? The failed test is fixed.

@vickytsang
Copy link
Contributor Author

results from added pytests

image

eric-haibin-lin pushed a commit to verl-project/verl that referenced this pull request May 7, 2025
## Checklist Before Starting

- [x] Search for similar PR(s). 

## What does this PR do?

1. Base Docker Image: Upgraded the base sglang docker to
`lmsysorg/sglang:v0.4.6.post1-rocm630` along with `torch_memory_saver
(hip version)`, which resolves the ROCm/aiter compatibility
[issue](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/verl/amd-verl-dev/dev.md).

2. vLLM-0.6.3 Rollout Fix: Adjusted the rollout logic to ensure the
latest VeRL upstream codebase remains both compatible with `vLLM
versions ≤ 0.6.3`, along with sync mechanism, and `vLLM versions >=
0.6.3`, along with async mechanism.

3. Update the ray version to
[2.45.0](https://github.com/ray-project/ray/releases/tag/ray-2.45.0):
[PR#52794](ray-project/ray#52794) and also
support `ray>=2.45.0` within verl - resolve
[verl-issues#1399](#1399).

- [To-do-1] 3rd party lib - `torch_memory_saver` - rocm virtual memory
allocator issue should be resolved within the [HIP
version](fzyzcjy/torch_memory_saver#9).
- [To-do-2]  New PR for hardware-agnostic vllm/sglang rollout.


## Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide)
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting)
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [ ] Add CI test(s) if necessary.

---------

Co-authored-by: Yusheng Su <yushensu@pduks-slu000010.amd.com>
@jjyao
Copy link
Contributor

jjyao commented May 7, 2025

@vickytsang I think the PR is corrupted: +3,506 −1,400

@vickytsang vickytsang force-pushed the amd/fix-issue-52701 branch from 5499386 to c6d00d3 Compare May 7, 2025 17:32
@hainesmichaelc hainesmichaelc added the community-contribution Contributed by the community label May 7, 2025
@aslonnie aslonnie removed request for a team May 8, 2025 05:28
@hongpeng-guo
Copy link
Contributor

@vickytsang May I know if there are any blockers on merging this PR? It seems a few oss projects for AMD are depending on this PR, thx.

@vickytsang
Copy link
Contributor Author

@vickytsang May I know if there are any blockers on merging this PR? It seems a few oss projects for AMD are depending on this PR, thx.

I will do some tests on recent versions of pytorch today.

eric-haibin-lin pushed a commit to verl-project/verl that referenced this pull request Jun 2, 2025
…Fix AMD support) (#1465)

### Checklist Before Starting

- [X] Search for similar PR(s).

### What does this PR do?

Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD
support

### High-Level Design

Current approach for supporting AMD in verl is fundamentally not
correct, and is just working out of the luck:

Calls such as `torch.cuda.is_available()` or
`torch.cuda.get_device_name()` will initialize the CUDA/ROCm
environment:

https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized
will not take effect (Please check
pytorch/pytorch#141678), which means that all
current code that wrapped inside `[SUPPORT AMD: torch]` are mostly
noops.

CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of
AMD migrated software call those `torch.cuda.*` during importing, e.g.:

- ROCm/TransformerEngine#183
- vllm-project/vllm#15246

While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which
cause those `torch.cuda.*` to poison the current process if the
CUDA/ROCm environment is initialized before the manipulation happens.

So, here, it would be a good solution to use only one environment
variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and
hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA
one. Note that we must pay attention if both HIP/CUDA and ROCR env vars
are set as they have different meanings. Both env vars accept either a
list of ints or a list of UUIDs. The ROCR env var is processed first
which then reduces the number of GPUs that HIP can select from.
(Refering to pytorch/pytorch#144026) To avoid
the complexity of this, we simply gives out error if both are set (Also
to keep consistency with ray's practice with 2.45.0).

For the poisoning issue, before those 2 PRs are merged, we will need to
ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer
manipulates these variables, and make verl workable when there is no
`*_VISIBLE_DEVICES`.

Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`),
we also need this patch: ray-project/ray#52794

### Test

Tested manually on both megatron and fsdp beckend with vllm.

### Additional Info.

- **Issue Number**: none
- **Training**: both FSDP and Megatron
- **Inference**: both vLLM and SGLang

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title if it breaks any API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add CI test(s) if neccessary.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
@vickytsang
Copy link
Contributor Author

@vickytsang May I know if there are any blockers on merging this PR? It seems a few oss projects for AMD are depending on this PR, thx.

I will do some tests on recent versions of pytorch today.

@hongpeng-guo No blockers. Please merge.

@hongpeng-guo
Copy link
Contributor

@kevin85421 cc @HollowMan6 pls free to merge if there are no more blockers, thx!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should change the value of HIP_VISIBLE_DEVICES_ENV_VAR. It should be treated as a constant. Instead can we just

@staticmethod
    def get_visible_accelerator_ids_env_var() -> str:
        hip_val = os.environ.get(HIP_VISIBLE_DEVICES_ENV_VAR, None)
        cuda_val = os.environ.get(CUDA_VISIBLE_DEVICES_ENV_VAR, None)
        if hip_val is None and cuda_val is not None:
            return CUDA_VISIBLE_DEVICES_ENV_VAR
        else:
            return HIP_VISIBLE_DEVICES_ENV_VAR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we raise ValueError instead since it's not a program bug but a user error.

@jjyao jjyao assigned jjyao and unassigned kevin85421 Jun 3, 2025
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be ValueError.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file running in the CI? Maybe you need to update related Bazel files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is weird. It calls os.environ.get to determine which environment variable to return. Then, that environment variable is used in another os.environ.get call.

os.environ.get(
            AMDGPUAcceleratorManager.get_visible_accelerator_ids_env_var(), None)

@vickytsang vickytsang force-pushed the amd/fix-issue-52701 branch from 5d193c2 to 1c4edac Compare June 3, 2025 20:51
@vickytsang
Copy link
Contributor Author

I changed the implementation slightly such that HIP_VISIBLE_DEVICES must be set to the same value as CUDA_VISIBLE_DEVICES if CUDA_VISIBLE_DEVICES is not None rather than using CUDA_VISIBLE_DEVICES as an alternate environment variable.

image

@vickytsang vickytsang force-pushed the amd/fix-issue-52701 branch 2 times, most recently from 28e40ec to 98e7dad Compare June 3, 2025 21:25
@vickytsang vickytsang closed this Jun 3, 2025
@vickytsang vickytsang force-pushed the amd/fix-issue-52701 branch from 98e7dad to f53a0c9 Compare June 3, 2025 21:49
@vickytsang
Copy link
Contributor Author

@kevin85421 @jjyao @hongpeng-guo @HollowMan6 Sorry I messed up the rebase. Please see #53531

yzlnew pushed a commit to yzlnew/verl that referenced this pull request Jun 4, 2025
…Fix AMD support) (verl-project#1465)

### Checklist Before Starting

- [X] Search for similar PR(s).

### What does this PR do?

Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD
support

### High-Level Design

Current approach for supporting AMD in verl is fundamentally not
correct, and is just working out of the luck:

Calls such as `torch.cuda.is_available()` or
`torch.cuda.get_device_name()` will initialize the CUDA/ROCm
environment:

https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized
will not take effect (Please check
pytorch/pytorch#141678), which means that all
current code that wrapped inside `[SUPPORT AMD: torch]` are mostly
noops.

CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of
AMD migrated software call those `torch.cuda.*` during importing, e.g.:

- ROCm/TransformerEngine#183
- vllm-project/vllm#15246

While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which
cause those `torch.cuda.*` to poison the current process if the
CUDA/ROCm environment is initialized before the manipulation happens.

So, here, it would be a good solution to use only one environment
variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and
hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA
one. Note that we must pay attention if both HIP/CUDA and ROCR env vars
are set as they have different meanings. Both env vars accept either a
list of ints or a list of UUIDs. The ROCR env var is processed first
which then reduces the number of GPUs that HIP can select from.
(Refering to pytorch/pytorch#144026) To avoid
the complexity of this, we simply gives out error if both are set (Also
to keep consistency with ray's practice with 2.45.0).

For the poisoning issue, before those 2 PRs are merged, we will need to
ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer
manipulates these variables, and make verl workable when there is no
`*_VISIBLE_DEVICES`.

Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`),
we also need this patch: ray-project/ray#52794

### Test

Tested manually on both megatron and fsdp beckend with vllm.

### Additional Info.

- **Issue Number**: none
- **Training**: both FSDP and Megatron
- **Inference**: both vLLM and SGLang

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title if it breaks any API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add CI test(s) if neccessary.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Jun 6, 2025
…Fix AMD support) (verl-project#1465)

### Checklist Before Starting

- [X] Search for similar PR(s).

### What does this PR do?

Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD
support

### High-Level Design

Current approach for supporting AMD in verl is fundamentally not
correct, and is just working out of the luck:

Calls such as `torch.cuda.is_available()` or
`torch.cuda.get_device_name()` will initialize the CUDA/ROCm
environment:

https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized
will not take effect (Please check
pytorch/pytorch#141678), which means that all
current code that wrapped inside `[SUPPORT AMD: torch]` are mostly
noops.

CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of
AMD migrated software call those `torch.cuda.*` during importing, e.g.:

- ROCm/TransformerEngine#183
- vllm-project/vllm#15246

While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which
cause those `torch.cuda.*` to poison the current process if the
CUDA/ROCm environment is initialized before the manipulation happens.

So, here, it would be a good solution to use only one environment
variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and
hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA
one. Note that we must pay attention if both HIP/CUDA and ROCR env vars
are set as they have different meanings. Both env vars accept either a
list of ints or a list of UUIDs. The ROCR env var is processed first
which then reduces the number of GPUs that HIP can select from.
(Refering to pytorch/pytorch#144026) To avoid
the complexity of this, we simply gives out error if both are set (Also
to keep consistency with ray's practice with 2.45.0).

For the poisoning issue, before those 2 PRs are merged, we will need to
ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer
manipulates these variables, and make verl workable when there is no
`*_VISIBLE_DEVICES`.

Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`),
we also need this patch: ray-project/ray#52794

### Test

Tested manually on both megatron and fsdp beckend with vllm.

### Additional Info.

- **Issue Number**: none
- **Training**: both FSDP and Megatron
- **Inference**: both vLLM and SGLang

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title if it breaks any API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add CI test(s) if neccessary.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
wwwjn pushed a commit to wwwjn/verl that referenced this pull request Jun 10, 2025
…Fix AMD support) (verl-project#1465)

### Checklist Before Starting

- [X] Search for similar PR(s).

### What does this PR do?

Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD
support

### High-Level Design

Current approach for supporting AMD in verl is fundamentally not
correct, and is just working out of the luck:

Calls such as `torch.cuda.is_available()` or
`torch.cuda.get_device_name()` will initialize the CUDA/ROCm
environment:

https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392

Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized
will not take effect (Please check
pytorch/pytorch#141678), which means that all
current code that wrapped inside `[SUPPORT AMD: torch]` are mostly
noops.

CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of
AMD migrated software call those `torch.cuda.*` during importing, e.g.:

- ROCm/TransformerEngine#183
- vllm-project/vllm#15246

While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which
cause those `torch.cuda.*` to poison the current process if the
CUDA/ROCm environment is initialized before the manipulation happens.

So, here, it would be a good solution to use only one environment
variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and
hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA
one. Note that we must pay attention if both HIP/CUDA and ROCR env vars
are set as they have different meanings. Both env vars accept either a
list of ints or a list of UUIDs. The ROCR env var is processed first
which then reduces the number of GPUs that HIP can select from.
(Refering to pytorch/pytorch#144026) To avoid
the complexity of this, we simply gives out error if both are set (Also
to keep consistency with ray's practice with 2.45.0).

For the poisoning issue, before those 2 PRs are merged, we will need to
ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or
`RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer
manipulates these variables, and make verl workable when there is no
`*_VISIBLE_DEVICES`.

Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`),
we also need this patch: ray-project/ray#52794

### Test

Tested manually on both megatron and fsdp beckend with vllm.

### Additional Info.

- **Issue Number**: none
- **Training**: both FSDP and Megatron
- **Inference**: both vLLM and SGLang

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title if it breaks any API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] Add CI test(s) if neccessary.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests stability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core][ROCm] Setting CUDA_VISIBLE_DEVICES leads to an assertion

7 participants