[core] Fix test_torch_tensor_transport expecting CUDA_VISIBLE_DEVICES…#62653
[core] Fix test_torch_tensor_transport expecting CUDA_VISIBLE_DEVICES…#62653elliot-barn wants to merge 1 commit intomasterfrom
Conversation
… scrubbing on num_gpus=0 actors #62492 flipped the default of RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO from True to False, so Ray no longer overrides CUDA_VISIBLE_DEVICES for actors with num_gpus=0. 11 test cases in test_torch_tensor_transport.py relied on the old behavior where bare Actor.remote() workers would have CUDA_VISIBLE_DEVICES="" set, causing torch to raise "No CUDA GPUs are available" on .to("cuda"). Adds a per-test fixture that sets RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=1 via monkeypatch before ray_start_regular boots Ray, restoring the old behavior for just the affected tests. No production code is changed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
bcbcd3e to
f1f2c05
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a new pytest fixture, override_accelerator_env_on_zero, to the test_torch_tensor_transport.py file. This fixture sets the RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO environment variable to restore legacy behavior where Ray clears CUDA_VISIBLE_DEVICES for actors without assigned GPUs, ensuring that tests expecting CUDA unavailability errors on CPU-only actors function correctly. Multiple test cases have been updated to include this fixture. I have no feedback to provide.
Sparks0219
left a comment
There was a problem hiding this comment.
TBH I think it would be better to test the default RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO since that's what users would actually use, however this is also a cgraph test so I don't have a strong opinion
yeah let's just do this |
After #62492 we no longer set CUDA_VISIBLE_DEVIES ="" when num_gpus=0 or not set. Torch if it detects that CUDA_VISIBLE_DEVIES ="" throws a runtime error, however now that CUDA_VISIBLE_DEVIES is not set at all it falls back to the nvidia driver to get the device ids. Following up on #62653 and instead checking for the default cuda:0 gpu id in these tests. --------- Signed-off-by: Joshua Lee <joshlee@anyscale.com>
After ray-project#62492 we no longer set CUDA_VISIBLE_DEVIES ="" when num_gpus=0 or not set. Torch if it detects that CUDA_VISIBLE_DEVIES ="" throws a runtime error, however now that CUDA_VISIBLE_DEVIES is not set at all it falls back to the nvidia driver to get the device ids. Following up on ray-project#62653 and instead checking for the default cuda:0 gpu id in these tests. --------- Signed-off-by: Joshua Lee <joshlee@anyscale.com>
… scrubbing on num_gpus=0 actors
#62492 flipped the default of RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO from True to False, so Ray no longer overrides CUDA_VISIBLE_DEVICES for actors with num_gpus=0. 11 test cases in test_torch_tensor_transport.py relied on the old behavior where bare Actor.remote() workers would have CUDA_VISIBLE_DEVICES="" set, causing torch to raise "No CUDA GPUs are available" on .to("cuda").
Adds a per-test fixture that sets RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=1 via monkeypatch before ray_start_regular boots Ray, restoring the old behavior for just the affected tests. No production code is changed.
failing postmerge tests: https://buildkite.com/ray-project/postmerge/builds/17053
successful postmerge run: https://buildkite.com/ray-project/postmerge/builds/17060