Skip to content

[core] Fix test_torch_tensor_transport expecting CUDA_VISIBLE_DEVICES…#62653

Closed
elliot-barn wants to merge 1 commit intomasterfrom
override_accelerator_env_in_test_torch_tensor_transport
Closed

[core] Fix test_torch_tensor_transport expecting CUDA_VISIBLE_DEVICES…#62653
elliot-barn wants to merge 1 commit intomasterfrom
override_accelerator_env_in_test_torch_tensor_transport

Conversation

@elliot-barn
Copy link
Copy Markdown
Collaborator

@elliot-barn elliot-barn commented Apr 16, 2026

… scrubbing on num_gpus=0 actors

#62492 flipped the default of RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO from True to False, so Ray no longer overrides CUDA_VISIBLE_DEVICES for actors with num_gpus=0. 11 test cases in test_torch_tensor_transport.py relied on the old behavior where bare Actor.remote() workers would have CUDA_VISIBLE_DEVICES="" set, causing torch to raise "No CUDA GPUs are available" on .to("cuda").

Adds a per-test fixture that sets RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=1 via monkeypatch before ray_start_regular boots Ray, restoring the old behavior for just the affected tests. No production code is changed.

failing postmerge tests: https://buildkite.com/ray-project/postmerge/builds/17053

successful postmerge run: https://buildkite.com/ray-project/postmerge/builds/17060

… scrubbing on num_gpus=0 actors

#62492 flipped the default of RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO from True to
False, so Ray no longer overrides CUDA_VISIBLE_DEVICES for actors with
num_gpus=0. 11 test cases in test_torch_tensor_transport.py relied on the old
behavior where bare Actor.remote() workers would have CUDA_VISIBLE_DEVICES=""
set, causing torch to raise "No CUDA GPUs are available" on .to("cuda").

Adds a per-test fixture that sets RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=1 via
monkeypatch before ray_start_regular boots Ray, restoring the old behavior for
just the affected tests. No production code is changed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
@elliot-barn elliot-barn force-pushed the override_accelerator_env_in_test_torch_tensor_transport branch from bcbcd3e to f1f2c05 Compare April 16, 2026 03:54
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new pytest fixture, override_accelerator_env_on_zero, to the test_torch_tensor_transport.py file. This fixture sets the RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO environment variable to restore legacy behavior where Ray clears CUDA_VISIBLE_DEVICES for actors without assigned GPUs, ensuring that tests expecting CUDA unavailability errors on CPU-only actors function correctly. Multiple test cases have been updated to include this fixture. I have no feedback to provide.

@ray-gardener ray-gardener Bot added the core Issues that should be addressed in Ray Core label Apr 16, 2026
Copy link
Copy Markdown
Contributor

@Sparks0219 Sparks0219 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I think it would be better to test the default RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO since that's what users would actually use, however this is also a cgraph test so I don't have a strong opinion

@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Apr 16, 2026
@edoakes
Copy link
Copy Markdown
Collaborator

edoakes commented Apr 16, 2026

TBH I think it would be better to test the default RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO since that's what users would actually use, however this is also a cgraph test so I don't have a strong opinion

yeah let's just do this

richardliaw pushed a commit that referenced this pull request Apr 18, 2026
After #62492 we no longer set CUDA_VISIBLE_DEVIES ="" when num_gpus=0 or
not set. Torch if it detects that CUDA_VISIBLE_DEVIES ="" throws a
runtime error, however now that CUDA_VISIBLE_DEVIES is not set at all it
falls back to the nvidia driver to get the device ids. Following up on
#62653 and instead checking for the default cuda:0 gpu id in these
tests.

---------

Signed-off-by: Joshua Lee <joshlee@anyscale.com>
HLDKNotFound pushed a commit to chichic21039/ray that referenced this pull request Apr 22, 2026
After ray-project#62492 we no longer set CUDA_VISIBLE_DEVIES ="" when num_gpus=0 or
not set. Torch if it detects that CUDA_VISIBLE_DEVIES ="" throws a
runtime error, however now that CUDA_VISIBLE_DEVIES is not set at all it
falls back to the nvidia driver to get the device ids. Following up on
ray-project#62653 and instead checking for the default cuda:0 gpu id in these
tests.

---------

Signed-off-by: Joshua Lee <joshlee@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants