[bug fix] CUDA error: invalid device ordinal when CUDA_VISIBLE_DEVICES start id is not equal to 0 by guapisolo · Pull Request #116 · THUDM/slime

guapisolo · 2025-07-29T09:49:16Z

NOTE: This PR may have better implementation after #54928 in ray is merged.

The bug should be devided into two problems:

The start gpu id has to be 0.
Visible gpu id list has to be continuous.

And the expected behavior is: We can simply assign the physical GPU id by setting CUDA_VISIBLE_DEVICES before ray start and ray job submit.

On train side:

LOCAL_RANK in TrainRayActor should be logical gpu id rather than physical gpu id. Then torch.cuda.set_device will works correctly.
[deprecated] use ray allocated gpu id.

On rollout side:

The GPU allocated for RolloutRayActor (set to 0.2) is just a placeholder. In the original behavior, SglangEngine class clear the CUDA_VISIBLE_DEVICES settings, and allocate GPU for sglang starting from physical GPU 0 continously. This behavior leads to the two problems.
In this PR, we set RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES in RolloutRayActor to avoid ray from modifying CUDA_VISIBLE_DEVICES. So SGLang server can be launched using external CUDA_VISIBLE_DEVICES, which resolve the two problems.

Test result:
After logging nvidia-smi every 3 seconds, there is no GPU utilization out of CUDA_VISIBLE_DEVICES. We can assume the modification is correct.

…S start id is not equal to 0 (THUDM#116) * fix on train side * fix on rollout side * revert noset cvd on rollout side * add notice in --rollout-num-gpus-per-node

guapisolo marked this pull request as ready for review July 29, 2025 19:09

guapisolo force-pushed the gpu-non-0 branch 4 times, most recently from 3bd94ff to 18e12e9 Compare August 2, 2025 01:21

zhuzilin force-pushed the main branch from 23c1c2e to e9e42e5 Compare August 6, 2025 07:31

guapisolo mentioned this pull request Aug 10, 2025

[core] Not overriding accelerator id env vars when num_accelerators is 0 or not set ray-project/ray#54928

Merged

8 tasks

guapisolo added 2 commits August 11, 2025 01:30

fix on train side

3a0a3cd

fix on rollout side

6a2bd0a

guapisolo force-pushed the gpu-non-0 branch from 18e12e9 to 6a2bd0a Compare August 11, 2025 01:32

guapisolo added 2 commits August 11, 2025 05:57

revert noset cvd on rollout side

69f9f31

add notice in --rollout-num-gpus-per-node

1766b55

guapisolo force-pushed the gpu-non-0 branch from 150d4f5 to 1766b55 Compare August 11, 2025 06:15

zhuzilin merged commit c22f55b into THUDM:main Aug 11, 2025
1 check failed

This was referenced Aug 13, 2025

GPU memory unbalanced error #62

Closed

CUDA error: invalid device ordinal when CUDA_VISIBLE_DEVICES start id is not equal to 0 #111

Closed

guapisolo deleted the gpu-non-0 branch September 15, 2025 22:17

yueming-yuan pushed a commit to yueming-yuan/slime that referenced this pull request Dec 29, 2025

Enhance the 4B FSDP script and use Typer (THUDM#116)

7d70599

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug fix] CUDA error: invalid device ordinal when CUDA_VISIBLE_DEVICES start id is not equal to 0#116

[bug fix] CUDA error: invalid device ordinal when CUDA_VISIBLE_DEVICES start id is not equal to 0#116
zhuzilin merged 4 commits intoTHUDM:mainfrom
guapisolo:gpu-non-0

guapisolo commented Jul 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

guapisolo commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

guapisolo commented Jul 29, 2025 •

edited

Loading