[core][rocm] Allow CUDA_VISIBLE_DEVICS and HIP_VISIBLE_DEVICES by vickytsang · Pull Request #52794 · ray-project/ray

vickytsang · 2025-05-05T20:56:49Z

Why are these changes needed?

Both CUDA_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES should be allowed if:

Only one of the two environment variables is set.
If both are set, they must have the same value.

Related issue number

Closes #52701

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

vickytsang · 2025-05-05T21:00:48Z

AMD GPU test results

kevin85421

[outdated] Have you manually tested it?

[updated] I saw your screenshot.

kevin85421 · 2025-05-05T22:59:11Z

Please ping me when all CI tests pass.

HollowMan6 · 2025-05-06T18:05:21Z

python/ray/_private/accelerators/amd_gpu.py

I think we would also need to possibly fetch all the device ids from CUDA_VISIBLE_DEVICES_ENV_VAR instead of HIP_VISIBLE_DEVICES_ENV_VAR if CUDA_VISIBLE_DEVICES_ENV_VAR is only set in this case, since the code here fetches only from HIP_VISIBLE_DEVICES_ENV_VAR.

ray/python/ray/_private/accelerators/amd_gpu.py

Lines 34 to 35 in 8a165da

def get_visible_accelerator_ids_env_var() -> str:

return HIP_VISIBLE_DEVICES_ENV_VAR

cc @vickytsang

jjyao · 2025-05-06T20:22:23Z

@vickytsang could you rebase with master? The failed test is fixed.

vickytsang · 2025-05-06T23:08:16Z

results from added pytests

## Checklist Before Starting - [x] Search for similar PR(s). ## What does this PR do? 1. Base Docker Image: Upgraded the base sglang docker to `lmsysorg/sglang:v0.4.6.post1-rocm630` along with `torch_memory_saver (hip version)`, which resolves the ROCm/aiter compatibility [issue](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/verl/amd-verl-dev/dev.md). 2. vLLM-0.6.3 Rollout Fix: Adjusted the rollout logic to ensure the latest VeRL upstream codebase remains both compatible with `vLLM versions ≤ 0.6.3`, along with sync mechanism, and `vLLM versions >= 0.6.3`, along with async mechanism. 3. Update the ray version to [2.45.0](https://github.com/ray-project/ray/releases/tag/ray-2.45.0): [PR#52794](ray-project/ray#52794) and also support `ray>=2.45.0` within verl - resolve [verl-issues#1399](#1399). - [To-do-1] 3rd party lib - `torch_memory_saver` - rocm virtual memory allocator issue should be resolved within the [HIP version](fzyzcjy/torch_memory_saver#9). - [To-do-2] New PR for hardware-agnostic vllm/sglang rollout. ## Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide) - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting) - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary. --------- Co-authored-by: Yusheng Su <yushensu@pduks-slu000010.amd.com>

jjyao · 2025-05-07T04:37:15Z

@vickytsang I think the PR is corrupted: +3,506 −1,400

hongpeng-guo · 2025-06-01T22:50:40Z

@vickytsang May I know if there are any blockers on merging this PR? It seems a few oss projects for AMD are depending on this PR, thx.

vickytsang · 2025-06-02T16:41:11Z

@vickytsang May I know if there are any blockers on merging this PR? It seems a few oss projects for AMD are depending on this PR, thx.

I will do some tests on recent versions of pytorch today.

…Fix AMD support) (#1465) ### Checklist Before Starting - [X] Search for similar PR(s). ### What does this PR do? Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD support ### High-Level Design Current approach for supporting AMD in verl is fundamentally not correct, and is just working out of the luck: Calls such as `torch.cuda.is_available()` or `torch.cuda.get_device_name()` will initialize the CUDA/ROCm environment: https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392 Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized will not take effect (Please check pytorch/pytorch#141678), which means that all current code that wrapped inside `[SUPPORT AMD: torch]` are mostly noops. CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of AMD migrated software call those `torch.cuda.*` during importing, e.g.: - ROCm/TransformerEngine#183 - vllm-project/vllm#15246 While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which cause those `torch.cuda.*` to poison the current process if the CUDA/ROCm environment is initialized before the manipulation happens. So, here, it would be a good solution to use only one environment variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA one. Note that we must pay attention if both HIP/CUDA and ROCR env vars are set as they have different meanings. Both env vars accept either a list of ints or a list of UUIDs. The ROCR env var is processed first which then reduces the number of GPUs that HIP can select from. (Refering to pytorch/pytorch#144026) To avoid the complexity of this, we simply gives out error if both are set (Also to keep consistency with ray's practice with 2.45.0). For the poisoning issue, before those 2 PRs are merged, we will need to ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or `RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer manipulates these variables, and make verl workable when there is no `*_VISIBLE_DEVICES`. Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`), we also need this patch: ray-project/ray#52794 ### Test Tested manually on both megatron and fsdp beckend with vllm. ### Additional Info. - **Issue Number**: none - **Training**: both FSDP and Megatron - **Inference**: both vLLM and SGLang ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] Add CI test(s) if neccessary. Signed-off-by: Hollow Man <hollowman@opensuse.org>

vickytsang · 2025-06-02T20:34:26Z

@vickytsang May I know if there are any blockers on merging this PR? It seems a few oss projects for AMD are depending on this PR, thx.

I will do some tests on recent versions of pytorch today.

@hongpeng-guo No blockers. Please merge.

hongpeng-guo · 2025-06-03T06:40:50Z

@kevin85421 cc @HollowMan6 pls free to merge if there are no more blockers, thx!

jjyao · 2025-06-03T06:56:48Z

python/ray/_private/accelerators/amd_gpu.py

I don't think we should change the value of HIP_VISIBLE_DEVICES_ENV_VAR. It should be treated as a constant. Instead can we just

@staticmethod def get_visible_accelerator_ids_env_var() -> str: hip_val = os.environ.get(HIP_VISIBLE_DEVICES_ENV_VAR, None) cuda_val = os.environ.get(CUDA_VISIBLE_DEVICES_ENV_VAR, None) if hip_val is None and cuda_val is not None: return CUDA_VISIBLE_DEVICES_ENV_VAR else: return HIP_VISIBLE_DEVICES_ENV_VAR

jjyao · 2025-06-03T06:57:42Z

python/ray/_private/accelerators/amd_gpu.py

Can we raise ValueError instead since it's not a program bug but a user error.

kevin85421 · 2025-06-03T06:48:07Z

python/ray/_private/accelerators/amd_gpu.py

This seems to be ValueError.

kevin85421 · 2025-06-03T06:49:26Z

python/ray/tests/accelerators/test_amd_gpu1.py

Is this file running in the CI? Maybe you need to update related Bazel files.

kevin85421 · 2025-06-03T06:54:44Z

python/ray/_private/accelerators/amd_gpu.py

This function is weird. It calls os.environ.get to determine which environment variable to return. Then, that environment variable is used in another os.environ.get call.

os.environ.get( AMDGPUAcceleratorManager.get_visible_accelerator_ids_env_var(), None)

vickytsang · 2025-06-03T21:01:49Z

I changed the implementation slightly such that HIP_VISIBLE_DEVICES must be set to the same value as CUDA_VISIBLE_DEVICES if CUDA_VISIBLE_DEVICES is not None rather than using CUDA_VISIBLE_DEVICES as an alternate environment variable.

vickytsang · 2025-06-03T21:57:38Z

@kevin85421 @jjyao @hongpeng-guo @HollowMan6 Sorry I messed up the rebase. Please see #53531

…Fix AMD support) (verl-project#1465) ### Checklist Before Starting - [X] Search for similar PR(s). ### What does this PR do? Add support for RAY_EXPERIMENTAL_NOSET_*_VISIBLE_DEVICES, also Fix AMD support ### High-Level Design Current approach for supporting AMD in verl is fundamentally not correct, and is just working out of the luck: Calls such as `torch.cuda.is_available()` or `torch.cuda.get_device_name()` will initialize the CUDA/ROCm environment: https://github.com/pytorch/pytorch/blob/c65ee728f069ea9544bdcac815eb0825f45d1633/torch/cuda/__init__.py#L342-L392 Setting CUDA/HIP/ROCR_VISIBLE_DEVICES after CUDA/ROCm is initialized will not take effect (Please check pytorch/pytorch#141678), which means that all current code that wrapped inside `[SUPPORT AMD: torch]` are mostly noops. CUDA_VISIBLE_DEVICES also works for AMD, but it's because that a lot of AMD migrated software call those `torch.cuda.*` during importing, e.g.: - ROCm/TransformerEngine#183 - vllm-project/vllm#15246 While ray/vllm manipulates those *_VISIBLE_DEVICES during runtime, which cause those `torch.cuda.*` to poison the current process if the CUDA/ROCm environment is initialized before the manipulation happens. So, here, it would be a good solution to use only one environment variable for all (`CUDA_VISIBLE_DEVICES`) for consistency and hardware-agnostic, move all the other `*_VISIBLE_DEVICES` to the CUDA one. Note that we must pay attention if both HIP/CUDA and ROCR env vars are set as they have different meanings. Both env vars accept either a list of ints or a list of UUIDs. The ROCR env var is processed first which then reduces the number of GPUs that HIP can select from. (Refering to pytorch/pytorch#144026) To avoid the complexity of this, we simply gives out error if both are set (Also to keep consistency with ray's practice with 2.45.0). For the poisoning issue, before those 2 PRs are merged, we will need to ask the users to set `RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES` or `RAY_EXPERIMENTAL_NOSET_HIP_VISIBLE_DEVICES`, so that ray no longer manipulates these variables, and make verl workable when there is no `*_VISIBLE_DEVICES`. Note that for latest ray (after their switch to `HIP_VISIBLE_DEVICES`), we also need this patch: ray-project/ray#52794 ### Test Tested manually on both megatron and fsdp beckend with vllm. ### Additional Info. - **Issue Number**: none - **Training**: both FSDP and Megatron - **Inference**: both vLLM and SGLang ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] Add CI test(s) if neccessary. Signed-off-by: Hollow Man <hollowman@opensuse.org>

kevin85421 approved these changes May 5, 2025

View reviewed changes

kevin85421 added the go add ONLY when ready to merge, run all tests label May 5, 2025

yushengsu-thu mentioned this pull request May 5, 2025

[Hardware] Support AMD (ROCMm Kernel) - Upgrade dockerfile and verl codebase verl-project/verl#1369

Merged

5 tasks

HollowMan6 suggested changes May 6, 2025

View reviewed changes

jjyao assigned kevin85421 May 6, 2025

vickytsang requested review from a team, GeneDer, akshay-anyscale, edoakes, maxpumperla, pcmoritz, raulchen, simonsays1980, sven1977 and zcin as code owners May 6, 2025 20:27

vickytsang force-pushed the amd/fix-issue-52701 branch from 5499386 to c6d00d3 Compare May 7, 2025 17:32

hainesmichaelc added the community-contribution Contributed by the community label May 7, 2025

aslonnie removed request for a team May 8, 2025 05:28

hainesmichaelc added community-backlog and removed community-backlog labels May 22, 2025

ysy-phoenix mentioned this pull request May 27, 2025

[SGLang Async Rollout] Add monkey_patch_torch_reductions() for AsyncSGLangRollout verl-project/verl#1641

Closed

6 tasks

jjyao reviewed Jun 3, 2025

View reviewed changes

python/ray/_private/accelerators/amd_gpu.py Outdated

Copy link

Contributor

jjyao Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we raise ValueError instead since it's not a program bug but a user error.

jjyao assigned jjyao and unassigned kevin85421 Jun 3, 2025

kevin85421 requested changes Jun 3, 2025

View reviewed changes

vickytsang force-pushed the amd/fix-issue-52701 branch from 5d193c2 to 1c4edac Compare June 3, 2025 20:51

vickytsang requested review from a team, SongGuyang, aslonnie and richardliaw as code owners June 3, 2025 20:51

vickytsang force-pushed the amd/fix-issue-52701 branch 2 times, most recently from 28e40ec to 98e7dad Compare June 3, 2025 21:25

vickytsang closed this Jun 3, 2025

vickytsang force-pushed the amd/fix-issue-52701 branch from 98e7dad to f53a0c9 Compare June 3, 2025 21:49

vickytsang mentioned this pull request Jun 3, 2025

[core][rocm] Allow CUDA_VISIBLE_DEVICS and HIP_VISIBLE_DEVICES #53531

Merged

5 tasks

	def get_visible_accelerator_ids_env_var() -> str:
	return HIP_VISIBLE_DEVICES_ENV_VAR

Conversation

vickytsang commented May 5, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

vickytsang commented May 5, 2025

Uh oh!

kevin85421 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevin85421 commented May 5, 2025

Uh oh!

HollowMan6 May 6, 2025

Choose a reason for hiding this comment

Uh oh!

jjyao May 6, 2025

Choose a reason for hiding this comment

Uh oh!

jjyao commented May 6, 2025

Uh oh!

vickytsang commented May 6, 2025

Uh oh!

jjyao commented May 7, 2025

Uh oh!

hongpeng-guo commented Jun 1, 2025

Uh oh!

vickytsang commented Jun 2, 2025

Uh oh!

vickytsang commented Jun 2, 2025

Uh oh!

hongpeng-guo commented Jun 3, 2025

Uh oh!

jjyao Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

jjyao Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

kevin85421 Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

vickytsang commented Jun 3, 2025

Uh oh!

vickytsang commented Jun 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

kevin85421 left a comment •

edited

Loading