Skip to content

[bug fix] CUDA error: invalid device ordinal when CUDA_VISIBLE_DEVICES start id is not equal to 0#116

Merged
zhuzilin merged 4 commits intoTHUDM:mainfrom
guapisolo:gpu-non-0
Aug 11, 2025
Merged

[bug fix] CUDA error: invalid device ordinal when CUDA_VISIBLE_DEVICES start id is not equal to 0#116
zhuzilin merged 4 commits intoTHUDM:mainfrom
guapisolo:gpu-non-0

Conversation

@guapisolo
Copy link
Copy Markdown
Contributor

@guapisolo guapisolo commented Jul 29, 2025

NOTE: This PR may have better implementation after #54928 in ray is merged.

The bug should be devided into two problems:

  1. The start gpu id has to be 0.
  2. Visible gpu id list has to be continuous.

And the expected behavior is: We can simply assign the physical GPU id by setting CUDA_VISIBLE_DEVICES before ray start and ray job submit.

On train side:

  • LOCAL_RANK in TrainRayActor should be logical gpu id rather than physical gpu id. Then torch.cuda.set_device will works correctly.
  • [deprecated] use ray allocated gpu id.

On rollout side:

  • The GPU allocated for RolloutRayActor (set to 0.2) is just a placeholder. In the original behavior, SglangEngine class clear the CUDA_VISIBLE_DEVICES settings, and allocate GPU for sglang starting from physical GPU 0 continously. This behavior leads to the two problems.

  • In this PR, we set RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES in RolloutRayActor to avoid ray from modifying CUDA_VISIBLE_DEVICES. So SGLang server can be launched using external CUDA_VISIBLE_DEVICES, which resolve the two problems.

Test result:
After logging nvidia-smi every 3 seconds, there is no GPU utilization out of CUDA_VISIBLE_DEVICES. We can assume the modification is correct.

@zhuzilin zhuzilin merged commit c22f55b into THUDM:main Aug 11, 2025
1 check failed
fy1214 pushed a commit to fy1214/slime that referenced this pull request Aug 13, 2025
…S start id is not equal to 0 (THUDM#116)

* fix on train side

* fix on rollout side

* revert noset cvd on rollout side

* add notice in --rollout-num-gpus-per-node
rysaya pushed a commit to rysaya/slime that referenced this pull request Aug 15, 2025
…S start id is not equal to 0 (THUDM#116)

* fix on train side

* fix on rollout side

* revert noset cvd on rollout side

* add notice in --rollout-num-gpus-per-node
@guapisolo guapisolo deleted the gpu-non-0 branch September 15, 2025 22:17
llltttwww pushed a commit to llltttwww/slime that referenced this pull request Nov 30, 2025
…S start id is not equal to 0 (THUDM#116)

* fix on train side

* fix on rollout side

* revert noset cvd on rollout side

* add notice in --rollout-num-gpus-per-node
yueming-yuan pushed a commit to yueming-yuan/slime that referenced this pull request Dec 29, 2025
Yangruipis pushed a commit to redai-infra/slime that referenced this pull request Feb 28, 2026
…S start id is not equal to 0 (THUDM#116)

* fix on train side

* fix on rollout side

* revert noset cvd on rollout side

* add notice in --rollout-num-gpus-per-node
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants