Skip to content

Replace AMD device env var with HIP_VISIBLE_DEVICES#51104

Merged
jjyao merged 9 commits intoray-project:masterfrom
ROCm:amd/use-hip-visible-devices
Mar 20, 2025
Merged

Replace AMD device env var with HIP_VISIBLE_DEVICES#51104
jjyao merged 9 commits intoray-project:masterfrom
ROCm:amd/use-hip-visible-devices

Conversation

@vickytsang
Copy link
Copy Markdown
Contributor

Why are these changes needed?

Replace AMD device env var with HIP_VISIBLE_DEVICES to align with PYTORCH semantics

HIP_VISIBLE_DEVICES has the same semantics as CUDA_VISIBLE_DEVICES in PYTORCH. Using ROCR_VISIBLE_DEVICES with can cause unexpected behavior

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: root <vtsang@amd.com>
Copy link
Copy Markdown
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes for core looks good to me. cc @hongpeng-guo would you mind reviewing the changes of Ray Train?

Copy link
Copy Markdown
Contributor

@hongpeng-guo hongpeng-guo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good in general! Could you please show a screen shot or log of the tests on AMD machines? Our CI env doesn't include AMD test for the moment.

Signed-off-by: root <vtsang@amd.com>
@vickytsang
Copy link
Copy Markdown
Contributor Author

envvar_chk.log
pytest.log

@hongpeng-guo hongpeng-guo added the go add ONLY when ready to merge, run all tests label Mar 9, 2025
@hongpeng-guo
Copy link
Copy Markdown
Contributor

hongpeng-guo commented Mar 9, 2025

copy pasting the previous unit test results here for better visibility:

(py_3.9) root@hpe-hq-08:/app/ray# HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 CUDA_VISIBLE_DEVICES=0,1 python -m pytest -v -s python/ray/tests/accelerators/test_amd_gpu.py
Test session starts (platform: linux, Python 3.9.19, pytest 7.4.4, pytest-sugar 0.9.5)
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/app/ray/.hypothesis/examples')
rootdir: /app/ray
configfile: pytest.ini
plugins: hypothesis-5.35.1, flakefinder-1.1.0, xdist-3.3.1, cpp-2.3.0, xdoctest-1.1.0, lazy-fixture-0.6.3, httpserver-1.0.6, shutil-1.7.0, aiohttp-0.3.0, timeout-2.1.0, anyio-3.7.1, sugar-0.9.5, rerunfailures-11.1.2, sphinx-0.5.1.dev0, virtualenv-1.7.0, docker-tools-3.1.3, asyncio-0.17.0, forked-1.4.0
timeout: 180.0s
timeout method: signal
timeout func_only: False
collecting ... Usage stats collection is enabled by default for nightly wheels. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2025-03-07 17:03:57,465 INFO worker.py:1842 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265

 python/ray/tests/accelerators/test_amd_gpu.py::test_visible_amd_gpu_ids ✓                                                            20% ██        Usage stats collection is enabled by default for nightly wheels. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2025-03-07 17:04:02,442 INFO worker.py:1842 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265

 python/ray/tests/accelerators/test_amd_gpu.py::test_visible_amd_gpu_type ✓                                                           40% ████      Usage stats collection is enabled by default for nightly wheels. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2025-03-07 17:04:07,082 INFO worker.py:1842 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265

 python/ray/tests/accelerators/test_amd_gpu.py::test_visible_amd_gpu_type_bad_device_id ✓                                             60% ██████
 python/ray/tests/accelerators/test_amd_gpu.py::test_get_current_process_visible_accelerator_ids ✓                                    80% ████████
 python/ray/tests/accelerators/test_amd_gpu.py::test_set_current_process_visible_accelerator_ids ✓                                   100% ██████████

Results (14.39s):
       5 passed

cc @vickytsang Could you help me understand why both env vars HIP_VISIBLE_DEVICES and CUDA_VISIBLE_DEVICES are needed, and how are these two variables affect the visible amd devices for each individual GPU?

@vickytsang
Copy link
Copy Markdown
Contributor Author

copy pasting the previous unit test results here for better visibility:

(py_3.9) root@hpe-hq-08:/app/ray# HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 CUDA_VISIBLE_DEVICES=0,1 python -m pytest -v -s python/ray/tests/accelerators/test_amd_gpu.py
Test session starts (platform: linux, Python 3.9.19, pytest 7.4.4, pytest-sugar 0.9.5)
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/app/ray/.hypothesis/examples')
rootdir: /app/ray
configfile: pytest.ini
plugins: hypothesis-5.35.1, flakefinder-1.1.0, xdist-3.3.1, cpp-2.3.0, xdoctest-1.1.0, lazy-fixture-0.6.3, httpserver-1.0.6, shutil-1.7.0, aiohttp-0.3.0, timeout-2.1.0, anyio-3.7.1, sugar-0.9.5, rerunfailures-11.1.2, sphinx-0.5.1.dev0, virtualenv-1.7.0, docker-tools-3.1.3, asyncio-0.17.0, forked-1.4.0
timeout: 180.0s
timeout method: signal
timeout func_only: False
collecting ... Usage stats collection is enabled by default for nightly wheels. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2025-03-07 17:03:57,465 INFO worker.py:1842 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265

 python/ray/tests/accelerators/test_amd_gpu.py::test_visible_amd_gpu_ids ✓                                                            20% ██        Usage stats collection is enabled by default for nightly wheels. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2025-03-07 17:04:02,442 INFO worker.py:1842 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265

 python/ray/tests/accelerators/test_amd_gpu.py::test_visible_amd_gpu_type ✓                                                           40% ████      Usage stats collection is enabled by default for nightly wheels. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2025-03-07 17:04:07,082 INFO worker.py:1842 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265

 python/ray/tests/accelerators/test_amd_gpu.py::test_visible_amd_gpu_type_bad_device_id ✓                                             60% ██████
 python/ray/tests/accelerators/test_amd_gpu.py::test_get_current_process_visible_accelerator_ids ✓                                    80% ████████
 python/ray/tests/accelerators/test_amd_gpu.py::test_set_current_process_visible_accelerator_ids ✓                                   100% ██████████

Results (14.39s):
       5 passed

cc @vickytsang Could you help me understand why both env vars HIP_VISIBLE_DEVICES and CUDA_VISIBLE_DEVICES are needed, and how are these two variables affect the visible amd devices for each individual GPU?

master_2b56e60_pytest.log
previous pytest results from master branch attached.

CUDA_VISIBLE_DEVICES should not be needed. I added it to my command line as a way to show that the conflict between HIP_VISIBLE_DEVICES and CUDA_VISIBLE_DEVICES values did not interfere with importing ray. It should not be relevant to the outcome of the pytest results.

if "HIP_VISIBLE_DEVICES" in os.environ:
hip_val = os.environ["HIP_VISIBLE_DEVICES"]
if cuda_val := os.environ.get("CUDA_VISIBLE_DEVICES", None):
logger.warning(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they unequal?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line "assert hip_val == cuda_val" from the previous commit always firing even if they are equal. I used the logger warning as a workaround. Can you suggest a better way to handle this?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding checks outside of functions makes it difficult to determine when they will be executed, especially since users might import packages in different orders.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about verifying it in get_current_process_visible_accelerator_ids?

)
)
else:
os.environ["CUDA_VISIBLE_DEVICES"] = hip_val
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason to set CUDA_VISIBLE_DEVICES? In addition, would you mind using the existing variable instead of hard-coding CUDA_VISIBLE_DEVICES?

@kevin85421 kevin85421 removed the go add ONLY when ready to merge, run all tests label Mar 11, 2025
Copy link
Copy Markdown
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you mind rebasing with the master branch?

Signed-off-by: vickytsang <vtsang@amd.com>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the changes in this file. We recently decide to move all accelerator-specific constants from ray_constants.py to xxx_xpu.py.

if "HIP_VISIBLE_DEVICES" in os.environ:
hip_val = os.environ["HIP_VISIBLE_DEVICES"]
if cuda_val := os.environ.get("CUDA_VISIBLE_DEVICES", None):
logger.warning(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding checks outside of functions makes it difficult to determine when they will be executed, especially since users might import packages in different orders.

if "HIP_VISIBLE_DEVICES" in os.environ:
hip_val = os.environ["HIP_VISIBLE_DEVICES"]
if cuda_val := os.environ.get("CUDA_VISIBLE_DEVICES", None):
logger.warning(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about verifying it in get_current_process_visible_accelerator_ids?

@vickytsang
Copy link
Copy Markdown
Contributor Author

Test logs from latest commits
pytest_test_0311.log
envvar_test_0311.log

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will happen if they are not equal?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are not equal the assertion on line 46 will fail.

(py3.9) root@quanta-ccs-aus-f11-01:/app/ray/python/ray# HIP_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=0,1 python
Python 3.9.21 | packaged by conda-forge | (main, Dec 5 2024, 13:51:40)
[GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import ray
Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/envs/py3.9/lib/python3.9/site-packages/ray-3.0.0.dev0-py3.9-linux-x86_64.egg/ray/init.py", line 114, in
from ray._private.worker import ( # noqa: E402,F401
File "/opt/conda/envs/py3.9/lib/python3.9/site-packages/ray-3.0.0.dev0-py3.9-linux-x86_64.egg/ray/_private/worker.py", line 1272, in
global_worker = Worker()
File "/opt/conda/envs/py3.9/lib/python3.9/site-packages/ray-3.0.0.dev0-py3.9-linux-x86_64.egg/ray/_private/worker.py", line 447, in init
ray._private.utils.get_visible_accelerator_ids()
File "/opt/conda/envs/py3.9/lib/python3.9/site-packages/ray-3.0.0.dev0-py3.9-linux-x86_64.egg/ray/_private/utils.py", line 304, in get_visible_accelerator_ids
return {
File "/opt/conda/envs/py3.9/lib/python3.9/site-packages/ray-3.0.0.dev0-py3.9-linux-x86_64.egg/ray/_private/utils.py", line 305, in
accelerator_resource_name: get_accelerator_manager_for_resource(
File "/opt/conda/envs/py3.9/lib/python3.9/site-packages/ray-3.0.0.dev0-py3.9-linux-x86_64.egg/ray/_private/accelerators/amd_gpu.py", line 46, in get_current_process_visible_accelerator_ids
assert hip_val == cuda_val
AssertionError

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry—I'm not referring to this PR. I'm referring to what will happen to the GPUs or GPU drivers if CUDA_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES are set. For example, the wrong GPUs might be used.

That is, I don't know why they should be the same if CUDA_VISIBLE_DEVICES is set.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA_VISIBLE_DEVICES, HIP_VISIBLE_DEVICE can have different effects when used together with various applications and frameworks. We want to be certain of the users intent in setting these variables. Perhaps a warning would be more appropriate?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Assertion is better for me.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray_constants doesn't have HIP_VISIBLE_DEVICES_ENV_VAR.

@vickytsang vickytsang force-pushed the amd/use-hip-visible-devices branch from c24e63c to 0d0c305 Compare March 17, 2025 17:04
vickytsang and others added 4 commits March 17, 2025 10:05
Signed-off-by: root <root@quanta-ccs-aus-f11-01.cs-aus.dcgpu>
Signed-off-by: vickytang <vtsang@amd.com>
Copy link
Copy Markdown
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me


hip_val = os.environ.get(HIP_VISIBLE_DEVICES_ENV_VAR, None)
if cuda_val := os.environ.get(CUDA_VISIBLE_DEVICES_ENV_VAR, None):
assert hip_val == cuda_val
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you mind providing an actionable error message?

Signed-off-by: root <vtsang@amd.com>
@kevin85421
Copy link
Copy Markdown
Member

cc @justinvyu @matthewdeng, would you mind reviewing the changes in Ray Train? It's just a small change. Thanks!

Copy link
Copy Markdown
Contributor

@hongpeng-guo hongpeng-guo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Mar 19, 2025
@kevin85421
Copy link
Copy Markdown
Member

This PR doesn't update doc, so I think the doc CI failure is unrelated. @edoakes @jjyao would you mind reviewing or merging this PR? Thanks!

@jjyao jjyao merged commit 3b9e729 into ray-project:master Mar 20, 2025
5 of 6 checks passed
dhakshin32 pushed a commit to dhakshin32/ray that referenced this pull request Mar 27, 2025
…ct#51104)

Signed-off-by: root <vtsang@amd.com>
Signed-off-by: vickytsang <vtsang@amd.com>
Signed-off-by: Dhakshin Suriakannu <d_suriakannu@apple.com>
hijkzzz pushed a commit to OpenRLHF/OpenRLHF that referenced this pull request Apr 2, 2025
Now ray uses HIP_VISIBLE_DEVICES instead of ROCR_VISIBLE_DEVICES
in recent changes for AMD devices.
ray-project/ray#51104

This PR integrates such changes here.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Levi080513 pushed a commit to neutree-ai/ray that referenced this pull request Jul 10, 2025
…ct#51104)

Signed-off-by: root <vtsang@amd.com>
Signed-off-by: vickytsang <vtsang@amd.com>
LXXXXR pushed a commit to LXXXXR/AESL that referenced this pull request Feb 15, 2026
Now ray uses HIP_VISIBLE_DEVICES instead of ROCR_VISIBLE_DEVICES
in recent changes for AMD devices.
ray-project/ray#51104

This PR integrates such changes here.

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants