[core][rocm] Allow CUDA_VISIBLE_DEVICS and HIP_VISIBLE_DEVICES#53531
[core][rocm] Allow CUDA_VISIBLE_DEVICS and HIP_VISIBLE_DEVICES#53531jjyao merged 12 commits intoray-project:masterfrom
Conversation
Signed-off-by: Vicky Tsang <vtsang@amd.com>
| hip_val = os.environ.get(HIP_VISIBLE_DEVICES_ENV_VAR, None) | ||
| if cuda_val := os.environ.get(CUDA_VISIBLE_DEVICES_ENV_VAR, None): | ||
| if hip_val is not None: | ||
| if hip_val != cuda_val: | ||
| raise ValueError( | ||
| f"Inconsistant values found. Please use either {HIP_VISIBLE_DEVICES_ENV_VAR} or {CUDA_VISIBLE_DEVICES_ENV_VAR}." | ||
| ) | ||
| else: | ||
| # if CUDA_VISIBLE_DEVCIES is not None, it must have the same value as HIP_VISIBLE_DEVICES | ||
| os.environ[HIP_VISIBLE_DEVICES_ENV_VAR] = cuda_val | ||
|
|
There was a problem hiding this comment.
Actually do we need to change this function at all? I feel we only need to change get_current_process_visible_accelerator_ids to consider CUDA_VISIBLE_DEVICES_ENV_VAR
There was a problem hiding this comment.
What are the typical use cases for HIP_VISIBLE_DEVICES_ENV_VAR and CUDA_VISIBLE_DEVIES_ENV_VAR? It seems like it's possible to explicitly HIP_VISIBLE_DEVICES_ENV_VAR without calling this function. If we return CUDA_VISIBLE_DEVICES_ENV_VAR and allow explicit usage of HIP_VISBILE_DEVICES_ENV_VAR, it can introduce inconsistencies.
There was a problem hiding this comment.
Could you elaborate. What will be the issue if this function always returns HIP_VISIBLE_DEVICES_ENV_VAR as before?
There was a problem hiding this comment.
It looks like Ray is able to correctly identify the correct number of device in either case with your suggestion as show in the screenshot below. If other reviewers (@HollowMan6 @kevin85421) have no concerns, this function can always return HIP_VISIBLE_DEVICES_ENV_VAR

There was a problem hiding this comment.
@jjyao is there anything else I can do to move this PR forward?
There was a problem hiding this comment.
I think this function is supposed to be a pure getter function that doesn't modify os.environ.
At high level should it be:
def get_current_process_visible_accelerator_ids():
// decide which env var to use
env_var = HIP_VISIBLE_DEVICE
if CUDA_VISIBLE_DEVICE is not None:
if HIP_VISIBLE_DEVICE is not None:
// make sure the value is the same
else:
env_var = CUDA_VISIBLE_DEVICE
return os.envion[env_var].split(",")
There was a problem hiding this comment.
@vickytsang does the above comment make sense to you?
There was a problem hiding this comment.
@jjyao yes thank you. Please standby. I will push an other commit in the next hour.
Signed-off-by: Vicky Tsang <vtsang@amd.com>
Signed-off-by: Vicky Tsang <vtsang@amd.com>
Signed-off-by: Vicky Tsang <vtsang@amd.com>
| @@ -0,0 +1,89 @@ | |||
| import os | |||
There was a problem hiding this comment.
we can merge this with the existing test_amd_gpu.py
There was a problem hiding this comment.
would you like the test_amd_gpu1.py to be removed from this PR?
There was a problem hiding this comment.
Yes. Add the tests in test_amd_gpu1.py to test_amd_gpu.py. I think some tests can be combined with pytest.param()
Signed-off-by: root <vtsang@amd.com>
| @@ -0,0 +1,89 @@ | |||
| import os | |||
| "ray._private.accelerators.AMDGPUAcceleratorManager.get_current_node_num_accelerators", # noqa: E501 | ||
| return_value=4, | ||
| ) | ||
| def test_cuda_visible_amd_gpu_ids( |
There was a problem hiding this comment.
This can be combined with test_visible_amd_gpu_ids using pytest parametrization. You can do something like
@pytest.mark.parametrize("visible_devices_env_var", ("HIP_VISIBLE_DEVICES", "CUDA_VISIBLE_DEVICES"))
@patch(
"ray._private.accelerators.AMDGPUAcceleratorManager.get_current_node_num_accelerators", # noqa: E501
return_value=4,
)
def test_visible_amd_gpu_ids(mock_get_num_accelerators, visible_devices_env_var, monkeypatch, shutdown_only):
monkeypatch.setenv(visible_devices_env_var, "0,1,2")
# Delete the cache so it can be re-populated the next time
# we call get_accelerator_manager_for_resource
del get_accelerator_manager_for_resource._resource_name_to_accelerator_manager
ray.init()
_ = mock_get_num_accelerators.called
assert ray.available_resources()["GPU"] == 3
so you don't need to repeat most of the tests.
There was a problem hiding this comment.
@jjyao Done. All CI tests are passing. Please review.
Signed-off-by: root <vtsang@amd.com>
Signed-off-by: root <vtsang@amd.com>
…roject#53531) Signed-off-by: Vicky Tsang <vtsang@amd.com> Signed-off-by: root <vtsang@amd.com>
Signed-off-by: Vicky Tsang <vtsang@amd.com> Signed-off-by: root <vtsang@amd.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>


Why are these changes needed?
Why are these changes needed?
Both CUDA_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES should be allowed if:
Only one of the two environment variables is set.
If both are set, they must have the same value.
Redoing PR #52794
Related issue number
Closes #52701
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.