[core][rocm] Allow CUDA_VISIBLE_DEVICS and HIP_VISIBLE_DEVICES by vickytsang · Pull Request #53531 · ray-project/ray

vickytsang · 2025-06-03T21:56:06Z

Why are these changes needed?

Why are these changes needed?
Both CUDA_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES should be allowed if:

Only one of the two environment variables is set.
If both are set, they must have the same value.

Redoing PR #52794

Related issue number

Closes #52701

Checks

[x ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
[x ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- [x ] Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Vicky Tsang <vtsang@amd.com>

jjyao · 2025-06-03T22:03:17Z

python/ray/_private/accelerators/amd_gpu.py

+        hip_val = os.environ.get(HIP_VISIBLE_DEVICES_ENV_VAR, None)
+        if cuda_val := os.environ.get(CUDA_VISIBLE_DEVICES_ENV_VAR, None):
+            if hip_val is not None:
+                if hip_val != cuda_val:
+                    raise ValueError(
+                        f"Inconsistant values found. Please use either {HIP_VISIBLE_DEVICES_ENV_VAR} or {CUDA_VISIBLE_DEVICES_ENV_VAR}."
+                    )
+            else:
+                # if CUDA_VISIBLE_DEVCIES is not None, it must have the same value as HIP_VISIBLE_DEVICES
+                os.environ[HIP_VISIBLE_DEVICES_ENV_VAR] = cuda_val
+


Actually do we need to change this function at all? I feel we only need to change get_current_process_visible_accelerator_ids to consider CUDA_VISIBLE_DEVICES_ENV_VAR

What are the typical use cases for HIP_VISIBLE_DEVICES_ENV_VAR and CUDA_VISIBLE_DEVIES_ENV_VAR? It seems like it's possible to explicitly HIP_VISIBLE_DEVICES_ENV_VAR without calling this function. If we return CUDA_VISIBLE_DEVICES_ENV_VAR and allow explicit usage of HIP_VISBILE_DEVICES_ENV_VAR, it can introduce inconsistencies.

Could you elaborate. What will be the issue if this function always returns HIP_VISIBLE_DEVICES_ENV_VAR as before?

It looks like Ray is able to correctly identify the correct number of device in either case with your suggestion as show in the screenshot below. If other reviewers (@HollowMan6 @kevin85421) have no concerns, this function can always return HIP_VISIBLE_DEVICES_ENV_VAR

@jjyao is there anything else I can do to move this PR forward?

I think this function is supposed to be a pure getter function that doesn't modify os.environ.

At high level should it be:

def get_current_process_visible_accelerator_ids(): // decide which env var to use env_var = HIP_VISIBLE_DEVICE if CUDA_VISIBLE_DEVICE is not None: if HIP_VISIBLE_DEVICE is not None: // make sure the value is the same else: env_var = CUDA_VISIBLE_DEVICE return os.envion[env_var].split(",")

@vickytsang does the above comment make sense to you?

@jjyao yes thank you. Please standby. I will push an other commit in the next hour.

vickytsang · 2025-06-03T22:03:49Z

pytests

vickytsang · 2025-06-03T22:04:49Z

ValueError when CUDA_VISIBLE_DEVICES != HIP_VISIBLE_DEVICES

Signed-off-by: Vicky Tsang <vtsang@amd.com>

jjyao · 2025-06-17T18:10:35Z

python/ray/tests/accelerators/test_amd_gpu1.py

@@ -0,0 +1,89 @@
+import os


we can merge this with the existing test_amd_gpu.py

would you like the test_amd_gpu1.py to be removed from this PR?

Yes. Add the tests in test_amd_gpu1.py to test_amd_gpu.py. I think some tests can be combined with pytest.param()

Signed-off-by: root <vtsang@amd.com>

jjyao · 2025-06-18T17:31:56Z

python/ray/tests/accelerators/test_amd_gpu1.py

@@ -0,0 +1,89 @@
+import os


This file can be removed

@vickytsang can you delete this file?

jjyao · 2025-06-18T17:36:58Z

python/ray/tests/accelerators/test_amd_gpu.py

+    "ray._private.accelerators.AMDGPUAcceleratorManager.get_current_node_num_accelerators",  # noqa: E501
+    return_value=4,
+)
+def test_cuda_visible_amd_gpu_ids(


This can be combined with test_visible_amd_gpu_ids using pytest parametrization. You can do something like

@pytest.mark.parametrize("visible_devices_env_var", ("HIP_VISIBLE_DEVICES", "CUDA_VISIBLE_DEVICES")) @patch( "ray._private.accelerators.AMDGPUAcceleratorManager.get_current_node_num_accelerators", # noqa: E501 return_value=4, ) def test_visible_amd_gpu_ids(mock_get_num_accelerators, visible_devices_env_var, monkeypatch, shutdown_only): monkeypatch.setenv(visible_devices_env_var, "0,1,2") # Delete the cache so it can be re-populated the next time # we call get_accelerator_manager_for_resource del get_accelerator_manager_for_resource._resource_name_to_accelerator_manager ray.init() _ = mock_get_num_accelerators.called assert ray.available_resources()["GPU"] == 3

so you don't need to repeat most of the tests.

@jjyao Done. All CI tests are passing. Please review.

Signed-off-by: root <vtsang@amd.com>

…roject#53531) Signed-off-by: Vicky Tsang <vtsang@amd.com> Signed-off-by: root <vtsang@amd.com>

Signed-off-by: Vicky Tsang <vtsang@amd.com> Signed-off-by: root <vtsang@amd.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

[core][rocm] Allow CUDA_VISIBLE_DEVICS and HIP_VISIBLE_DEVICES

00ff369

Signed-off-by: Vicky Tsang <vtsang@amd.com>

vickytsang mentioned this pull request Jun 3, 2025

[core][rocm] Allow CUDA_VISIBLE_DEVICS and HIP_VISIBLE_DEVICES #52794

Closed

8 tasks

jjyao reviewed Jun 3, 2025

View reviewed changes

vickytsang added 7 commits June 3, 2025 22:24

lint

ee3b3ea

Signed-off-by: Vicky Tsang <vtsang@amd.com>

return HIP_VISIBLE_DEVICES_ENV_VAR for AMD

35894ef

Signed-off-by: Vicky Tsang <vtsang@amd.com>

Merge branch 'master' into amd/fix-issue-52701-redo

2ac0199

Use CUDA as env var if HIP is not set

171141a

Signed-off-by: Vicky Tsang <vtsang@amd.com>

fix lint

b78f323

Signed-off-by: Vicky Tsang <vtsang@amd.com>

fix lint

ff1ef75

Signed-off-by: Vicky Tsang <vtsang@amd.com>

fix error message

a4279e7

Signed-off-by: Vicky Tsang <vtsang@amd.com>

jjyao added the go add ONLY when ready to merge, run all tests label Jun 17, 2025

jjyao reviewed Jun 17, 2025

View reviewed changes

vickytsang added 2 commits June 17, 2025 15:57

Merge branch 'master' into amd/fix-issue-52701-redo

082c66b

remove test_amd_gpu1.py

0431875

Signed-off-by: root <vtsang@amd.com>

jjyao reviewed Jun 18, 2025

View reviewed changes

vickytsang added 2 commits June 18, 2025 23:40

use pytest parametrization

4e282f7

Signed-off-by: root <vtsang@amd.com>

delete unused test file

40e4b97

Signed-off-by: root <vtsang@amd.com>

jjyao approved these changes Jun 19, 2025

View reviewed changes

jjyao merged commit 5b1f823 into ray-project:master Jun 19, 2025
5 checks passed

minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025

[core][rocm] Allow CUDA_VISIBLE_DEVICS and HIP_VISIBLE_DEVICES (ray-p…

1867146

…roject#53531) Signed-off-by: Vicky Tsang <vtsang@amd.com> Signed-off-by: root <vtsang@amd.com>

elliot-barn pushed a commit that referenced this pull request Jul 2, 2025

[core][rocm] Allow CUDA_VISIBLE_DEVICS and HIP_VISIBLE_DEVICES (#53531)

96de356

Signed-off-by: Vicky Tsang <vtsang@amd.com> Signed-off-by: root <vtsang@amd.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

Conversation

vickytsang commented Jun 3, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vickytsang commented Jun 3, 2025

Uh oh!

vickytsang commented Jun 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjyao Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jjyao Jun 18, 2025 •

edited

Loading