Skip to content

[core][rocm] Allow CUDA_VISIBLE_DEVICS and HIP_VISIBLE_DEVICES#53531

Merged
jjyao merged 12 commits intoray-project:masterfrom
ROCm:amd/fix-issue-52701-redo
Jun 19, 2025
Merged

[core][rocm] Allow CUDA_VISIBLE_DEVICS and HIP_VISIBLE_DEVICES#53531
jjyao merged 12 commits intoray-project:masterfrom
ROCm:amd/fix-issue-52701-redo

Conversation

@vickytsang
Copy link
Contributor

Why are these changes needed?

Why are these changes needed?
Both CUDA_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES should be allowed if:

Only one of the two environment variables is set.
If both are set, they must have the same value.

Redoing PR #52794

Related issue number

Closes #52701

Checks

  • [x ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • [x ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [x ] Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Vicky Tsang <vtsang@amd.com>
Comment on lines +35 to +45
hip_val = os.environ.get(HIP_VISIBLE_DEVICES_ENV_VAR, None)
if cuda_val := os.environ.get(CUDA_VISIBLE_DEVICES_ENV_VAR, None):
if hip_val is not None:
if hip_val != cuda_val:
raise ValueError(
f"Inconsistant values found. Please use either {HIP_VISIBLE_DEVICES_ENV_VAR} or {CUDA_VISIBLE_DEVICES_ENV_VAR}."
)
else:
# if CUDA_VISIBLE_DEVCIES is not None, it must have the same value as HIP_VISIBLE_DEVICES
os.environ[HIP_VISIBLE_DEVICES_ENV_VAR] = cuda_val

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually do we need to change this function at all? I feel we only need to change get_current_process_visible_accelerator_ids to consider CUDA_VISIBLE_DEVICES_ENV_VAR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the typical use cases for HIP_VISIBLE_DEVICES_ENV_VAR and CUDA_VISIBLE_DEVIES_ENV_VAR? It seems like it's possible to explicitly HIP_VISIBLE_DEVICES_ENV_VAR without calling this function. If we return CUDA_VISIBLE_DEVICES_ENV_VAR and allow explicit usage of HIP_VISBILE_DEVICES_ENV_VAR, it can introduce inconsistencies.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate. What will be the issue if this function always returns HIP_VISIBLE_DEVICES_ENV_VAR as before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like Ray is able to correctly identify the correct number of device in either case with your suggestion as show in the screenshot below. If other reviewers (@HollowMan6 @kevin85421) have no concerns, this function can always return HIP_VISIBLE_DEVICES_ENV_VAR
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjyao is there anything else I can do to move this PR forward?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function is supposed to be a pure getter function that doesn't modify os.environ.

At high level should it be:

def get_current_process_visible_accelerator_ids():

// decide which env var to use
env_var = HIP_VISIBLE_DEVICE
if CUDA_VISIBLE_DEVICE is not None:
   if HIP_VISIBLE_DEVICE is not None:
      // make sure the value is the same
   else:
      env_var = CUDA_VISIBLE_DEVICE
      
  return os.envion[env_var].split(",")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vickytsang does the above comment make sense to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjyao yes thank you. Please standby. I will push an other commit in the next hour.

@vickytsang
Copy link
Contributor Author

pytests
image

@vickytsang
Copy link
Contributor Author

ValueError when CUDA_VISIBLE_DEVICES != HIP_VISIBLE_DEVICES
image

Signed-off-by: Vicky Tsang <vtsang@amd.com>
Signed-off-by: Vicky Tsang <vtsang@amd.com>
Signed-off-by: Vicky Tsang <vtsang@amd.com>
Signed-off-by: Vicky Tsang <vtsang@amd.com>
Signed-off-by: Vicky Tsang <vtsang@amd.com>
Signed-off-by: Vicky Tsang <vtsang@amd.com>
@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Jun 17, 2025
@@ -0,0 +1,89 @@
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can merge this with the existing test_amd_gpu.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you like the test_amd_gpu1.py to be removed from this PR?

Copy link
Contributor

@jjyao jjyao Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Add the tests in test_amd_gpu1.py to test_amd_gpu.py. I think some tests can be combined with pytest.param()

@@ -0,0 +1,89 @@
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file can be removed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vickytsang can you delete this file?

"ray._private.accelerators.AMDGPUAcceleratorManager.get_current_node_num_accelerators", # noqa: E501
return_value=4,
)
def test_cuda_visible_amd_gpu_ids(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be combined with test_visible_amd_gpu_ids using pytest parametrization. You can do something like

@pytest.mark.parametrize("visible_devices_env_var", ("HIP_VISIBLE_DEVICES", "CUDA_VISIBLE_DEVICES"))
@patch(
    "ray._private.accelerators.AMDGPUAcceleratorManager.get_current_node_num_accelerators",  # noqa: E501
    return_value=4,
)
def test_visible_amd_gpu_ids(mock_get_num_accelerators, visible_devices_env_var, monkeypatch, shutdown_only):
    monkeypatch.setenv(visible_devices_env_var, "0,1,2")
    # Delete the cache so it can be re-populated the next time
    # we call get_accelerator_manager_for_resource
    del get_accelerator_manager_for_resource._resource_name_to_accelerator_manager
    ray.init()
    _ = mock_get_num_accelerators.called
    assert ray.available_resources()["GPU"] == 3

so you don't need to repeat most of the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjyao Done. All CI tests are passing. Please review.

Signed-off-by: root <vtsang@amd.com>
Signed-off-by: root <vtsang@amd.com>
@jjyao jjyao merged commit 5b1f823 into ray-project:master Jun 19, 2025
5 checks passed
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
…roject#53531)

Signed-off-by: Vicky Tsang <vtsang@amd.com>
Signed-off-by: root <vtsang@amd.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
Signed-off-by: Vicky Tsang <vtsang@amd.com>
Signed-off-by: root <vtsang@amd.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core][ROCm] Setting CUDA_VISIBLE_DEVICES leads to an assertion

2 participants