Replace AMD device env var with HIP_VISIBLE_DEVICES#51104
Replace AMD device env var with HIP_VISIBLE_DEVICES#51104jjyao merged 9 commits intoray-project:masterfrom
Conversation
Signed-off-by: root <vtsang@amd.com>
Signed-off-by: root <vtsang@amd.com>
kevin85421
left a comment
There was a problem hiding this comment.
The changes for core looks good to me. cc @hongpeng-guo would you mind reviewing the changes of Ray Train?
hongpeng-guo
left a comment
There was a problem hiding this comment.
The code looks good in general! Could you please show a screen shot or log of the tests on AMD machines? Our CI env doesn't include AMD test for the moment.
Signed-off-by: root <vtsang@amd.com>
|
copy pasting the previous unit test results here for better visibility: cc @vickytsang Could you help me understand why both env vars |
master_2b56e60_pytest.log
|
| if "HIP_VISIBLE_DEVICES" in os.environ: | ||
| hip_val = os.environ["HIP_VISIBLE_DEVICES"] | ||
| if cuda_val := os.environ.get("CUDA_VISIBLE_DEVICES", None): | ||
| logger.warning( |
There was a problem hiding this comment.
The line "assert hip_val == cuda_val" from the previous commit always firing even if they are equal. I used the logger warning as a workaround. Can you suggest a better way to handle this?
There was a problem hiding this comment.
Adding checks outside of functions makes it difficult to determine when they will be executed, especially since users might import packages in different orders.
There was a problem hiding this comment.
How about verifying it in get_current_process_visible_accelerator_ids?
| ) | ||
| ) | ||
| else: | ||
| os.environ["CUDA_VISIBLE_DEVICES"] = hip_val |
There was a problem hiding this comment.
what's the reason to set CUDA_VISIBLE_DEVICES? In addition, would you mind using the existing variable instead of hard-coding CUDA_VISIBLE_DEVICES?
kevin85421
left a comment
There was a problem hiding this comment.
would you mind rebasing with the master branch?
Signed-off-by: vickytsang <vtsang@amd.com>
python/ray/_private/ray_constants.py
Outdated
There was a problem hiding this comment.
Remove the changes in this file. We recently decide to move all accelerator-specific constants from ray_constants.py to xxx_xpu.py.
| if "HIP_VISIBLE_DEVICES" in os.environ: | ||
| hip_val = os.environ["HIP_VISIBLE_DEVICES"] | ||
| if cuda_val := os.environ.get("CUDA_VISIBLE_DEVICES", None): | ||
| logger.warning( |
There was a problem hiding this comment.
Adding checks outside of functions makes it difficult to determine when they will be executed, especially since users might import packages in different orders.
| if "HIP_VISIBLE_DEVICES" in os.environ: | ||
| hip_val = os.environ["HIP_VISIBLE_DEVICES"] | ||
| if cuda_val := os.environ.get("CUDA_VISIBLE_DEVICES", None): | ||
| logger.warning( |
There was a problem hiding this comment.
How about verifying it in get_current_process_visible_accelerator_ids?
|
Test logs from latest commits |
There was a problem hiding this comment.
what will happen if they are not equal?
There was a problem hiding this comment.
If they are not equal the assertion on line 46 will fail.
(py3.9) root@quanta-ccs-aus-f11-01:/app/ray/python/ray# HIP_VISIBLE_DEVICES=0 CUDA_VISIBLE_DEVICES=0,1 python
Python 3.9.21 | packaged by conda-forge | (main, Dec 5 2024, 13:51:40)
[GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
import ray
Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/envs/py3.9/lib/python3.9/site-packages/ray-3.0.0.dev0-py3.9-linux-x86_64.egg/ray/init.py", line 114, in
from ray._private.worker import ( # noqa: E402,F401
File "/opt/conda/envs/py3.9/lib/python3.9/site-packages/ray-3.0.0.dev0-py3.9-linux-x86_64.egg/ray/_private/worker.py", line 1272, in
global_worker = Worker()
File "/opt/conda/envs/py3.9/lib/python3.9/site-packages/ray-3.0.0.dev0-py3.9-linux-x86_64.egg/ray/_private/worker.py", line 447, in init
ray._private.utils.get_visible_accelerator_ids()
File "/opt/conda/envs/py3.9/lib/python3.9/site-packages/ray-3.0.0.dev0-py3.9-linux-x86_64.egg/ray/_private/utils.py", line 304, in get_visible_accelerator_ids
return {
File "/opt/conda/envs/py3.9/lib/python3.9/site-packages/ray-3.0.0.dev0-py3.9-linux-x86_64.egg/ray/_private/utils.py", line 305, in
accelerator_resource_name: get_accelerator_manager_for_resource(
File "/opt/conda/envs/py3.9/lib/python3.9/site-packages/ray-3.0.0.dev0-py3.9-linux-x86_64.egg/ray/_private/accelerators/amd_gpu.py", line 46, in get_current_process_visible_accelerator_ids
assert hip_val == cuda_val
AssertionError
There was a problem hiding this comment.
Oh, sorry—I'm not referring to this PR. I'm referring to what will happen to the GPUs or GPU drivers if CUDA_VISIBLE_DEVICES and HIP_VISIBLE_DEVICES are set. For example, the wrong GPUs might be used.
That is, I don't know why they should be the same if CUDA_VISIBLE_DEVICES is set.
There was a problem hiding this comment.
CUDA_VISIBLE_DEVICES, HIP_VISIBLE_DEVICE can have different effects when used together with various applications and frameworks. We want to be certain of the users intent in setting these variables. Perhaps a warning would be more appropriate?
There was a problem hiding this comment.
Got it. Assertion is better for me.
python/ray/_private/ray_constants.py
Outdated
There was a problem hiding this comment.
ray_constants doesn't have HIP_VISIBLE_DEVICES_ENV_VAR.
c24e63c to
0d0c305
Compare
Signed-off-by: root <root@quanta-ccs-aus-f11-01.cs-aus.dcgpu>
Signed-off-by: vickytang <vtsang@amd.com>
…/ray into amd/use-hip-visible-devices
|
|
||
| hip_val = os.environ.get(HIP_VISIBLE_DEVICES_ENV_VAR, None) | ||
| if cuda_val := os.environ.get(CUDA_VISIBLE_DEVICES_ENV_VAR, None): | ||
| assert hip_val == cuda_val |
There was a problem hiding this comment.
would you mind providing an actionable error message?
Signed-off-by: root <vtsang@amd.com>
|
cc @justinvyu @matthewdeng, would you mind reviewing the changes in Ray Train? It's just a small change. Thanks! |
…ct#51104) Signed-off-by: root <vtsang@amd.com> Signed-off-by: vickytsang <vtsang@amd.com> Signed-off-by: Dhakshin Suriakannu <d_suriakannu@apple.com>
Now ray uses HIP_VISIBLE_DEVICES instead of ROCR_VISIBLE_DEVICES in recent changes for AMD devices. ray-project/ray#51104 This PR integrates such changes here. Signed-off-by: Hollow Man <hollowman@opensuse.org>
…ct#51104) Signed-off-by: root <vtsang@amd.com> Signed-off-by: vickytsang <vtsang@amd.com>
Now ray uses HIP_VISIBLE_DEVICES instead of ROCR_VISIBLE_DEVICES in recent changes for AMD devices. ray-project/ray#51104 This PR integrates such changes here. Signed-off-by: Hollow Man <hollowman@opensuse.org>
Why are these changes needed?
Replace AMD device env var with HIP_VISIBLE_DEVICES to align with PYTORCH semantics
HIP_VISIBLE_DEVICES has the same semantics as CUDA_VISIBLE_DEVICES in PYTORCH. Using ROCR_VISIBLE_DEVICES with can cause unexpected behavior
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.