[core] Remove override accelerator warning and change default behavior#62492
Conversation
There was a problem hiding this comment.
Code Review
This pull request modifies Ray's behavior to prevent the overriding of accelerator environment variables, such as CUDA_VISIBLE_DEVICES, when zero accelerators are allocated. Key changes include setting the default value of RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO to False, removing the corresponding FutureWarning, and updating test cases to reflect this new default behavior. A review comment suggests improving the robustness of the tests by explicitly setting and asserting the preservation of environment variables to ensure they are not being cleared or modified during initialization.
| **{"RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO": "0"}, | ||
| ), | ||
| ) | ||
| run_string_as_driver(not_override_check_script) |
There was a problem hiding this comment.
The test for the new default behavior in not_override_check_script could be more robust. It currently asserts that CUDA_VISIBLE_DEVICES is not set, which relies on the assumption that it's not set in the test execution environment.
A stronger test would be to explicitly set CUDA_VISIBLE_DEVICES to a specific value before ray.init() and then assert that this value is preserved within the remote task/actor. This would more accurately verify that the environment variable is not being overridden when num_gpus=0.
Here's a suggested improvement for not_override_check_script:
not_override_check_script = """
import os
import ray
# Set a specific value to check for preservation
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2"
ray.init()
@ray.remote(num_gpus=0)
def check():
import os
assert os.environ.get("CUDA_VISIBLE_DEVICES") == "0,1,2"
@ray.remote(num_gpus=0)
class Actor:
def check(self):
import os
assert os.environ.get("CUDA_VISIBLE_DEVICES") == "0,1,2"
print("task check", ray.get(check.remote()))
print("actor check", ray.get(Actor.options(num_gpus=0).remote().check.remote()))
"""This change would make the test more explicit and less dependent on the environment configuration.
|
@Sparks0219 some relevant test failures |
…celerator-override-warning-and-switch-default-behavior
…celerator-override-warning-and-switch-default-behavior
Signed-off-by: Joshua Lee <joshlee@anyscale.com>
Signed-off-by: Joshua Lee <joshlee@anyscale.com>
…celerator-override-warning-and-switch-default-behavior
|
many failing tests |
the remaining ones are due to some java_plugin thing and not related, I think premerge is broken right now 😪 |
…g-and-switch-default-behavior
…celerator-override-warning-and-switch-default-behavior
…celerator-override-warning-and-switch-default-behavior
Signed-off-by: Joshua Lee <joshlee@anyscale.com>
Signed-off-by: Joshua Lee <joshlee@anyscale.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 4b7a979. Configure here.
ray-project#62492) Following up on ray-project#54928 where we originally introduced a feature flag to give users the option to not set CUDA_VISIBLE_DEVICES when num_gpus=0 or None. We also output an error informing users that the default behavior will be changed in a future ray version. Since it's been around 8 months since we introduced this feature flag and the error is a bit distracting, we're now setting this as the default behavior meaning we will no longer override CUDA_VISIBLE_DEVICES when num_gpus = 0 or None. --------- Signed-off-by: Joshua Lee <joshlee@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: doanxem99 <nguyendinhphuongnam99@gmail.com>
… scrubbing on num_gpus=0 actors #62492 flipped the default of RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO from True to False, so Ray no longer overrides CUDA_VISIBLE_DEVICES for actors with num_gpus=0. 11 test cases in test_torch_tensor_transport.py relied on the old behavior where bare Actor.remote() workers would have CUDA_VISIBLE_DEVICES="" set, causing torch to raise "No CUDA GPUs are available" on .to("cuda"). Adds a per-test fixture that sets RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=1 via monkeypatch before ray_start_regular boots Ray, restoring the old behavior for just the affected tests. No production code is changed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… scrubbing on num_gpus=0 actors #62492 flipped the default of RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO from True to False, so Ray no longer overrides CUDA_VISIBLE_DEVICES for actors with num_gpus=0. 11 test cases in test_torch_tensor_transport.py relied on the old behavior where bare Actor.remote() workers would have CUDA_VISIBLE_DEVICES="" set, causing torch to raise "No CUDA GPUs are available" on .to("cuda"). Adds a per-test fixture that sets RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=1 via monkeypatch before ray_start_regular boots Ray, restoring the old behavior for just the affected tests. No production code is changed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
After #62492 we no longer set CUDA_VISIBLE_DEVIES ="" when num_gpus=0 or not set. Torch if it detects that CUDA_VISIBLE_DEVIES ="" throws a runtime error, however now that CUDA_VISIBLE_DEVIES is not set at all it falls back to the nvidia driver to get the device ids. Following up on #62653 and instead checking for the default cuda:0 gpu id in these tests. --------- Signed-off-by: Joshua Lee <joshlee@anyscale.com>
ray-project#62492) Following up on ray-project#54928 where we originally introduced a feature flag to give users the option to not set CUDA_VISIBLE_DEVICES when num_gpus=0 or None. We also output an error informing users that the default behavior will be changed in a future ray version. Since it's been around 8 months since we introduced this feature flag and the error is a bit distracting, we're now setting this as the default behavior meaning we will no longer override CUDA_VISIBLE_DEVICES when num_gpus = 0 or None. --------- Signed-off-by: Joshua Lee <joshlee@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
After ray-project#62492 we no longer set CUDA_VISIBLE_DEVIES ="" when num_gpus=0 or not set. Torch if it detects that CUDA_VISIBLE_DEVIES ="" throws a runtime error, however now that CUDA_VISIBLE_DEVIES is not set at all it falls back to the nvidia driver to get the device ids. Following up on ray-project#62653 and instead checking for the default cuda:0 gpu id in these tests. --------- Signed-off-by: Joshua Lee <joshlee@anyscale.com>

Following up on #54928 where we originally introduced a feature flag to give users the option to not set CUDA_VISIBLE_DEVICES when num_gpus=0 or None. We also output an error informing users that the default behavior will be changed in a future ray version. Since it's been around 8 months since we introduced this feature flag and the error is a bit distracting, we're now setting this as the default behavior meaning we will no longer override CUDA_VISIBLE_DEVICES when num_gpus = 0 or None.