Skip to content

Fix multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device#149248

Closed
fzyzcjy wants to merge 1 commit intopytorch:mainfrom
fzyzcjy:feat/ac3289
Closed

Fix multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device#149248
fzyzcjy wants to merge 1 commit intopytorch:mainfrom
fzyzcjy:feat/ac3289

Conversation

@fzyzcjy
Copy link
Copy Markdown
Contributor

@fzyzcjy fzyzcjy commented Mar 15, 2025

Fixes #149196

This is merely a proof-of-concept PR. I would like to hear a bit of feedback - is the direction acceptable - before working on it deeper.

Things that will be added if the direction of PR looks acceptable: Unit tests, caches, implement-in-C++ (to speedup), etc.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 15, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149248

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 98aef5c with merge base 1e37e5b (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@fzyzcjy
Copy link
Copy Markdown
Contributor Author

fzyzcjy commented Mar 15, 2025

@pytorchbot label "release notes: distributed (miscellaneous)"

@albanD albanD added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 17, 2025
@albanD
Copy link
Copy Markdown
Collaborator

albanD commented Mar 17, 2025

Let's discuss on the issue

@github-actions
Copy link
Copy Markdown
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions Bot added the Stale label May 16, 2025
@fzyzcjy
Copy link
Copy Markdown
Contributor Author

fzyzcjy commented May 16, 2025

Hi is it possible to be merged?

@albanD
Copy link
Copy Markdown
Collaborator

albanD commented May 20, 2025

@fzyzcjy I'm afraid it is not as this is very much breaking the current behavior. In particular for the many distributed users that rely on always using device=0 by setting appropriate CUDA_VISIBLE_DEVICE={rank}. This patch would make it impossible for these users to send Tensors across processes.

@fzyzcjy
Copy link
Copy Markdown
Contributor Author

fzyzcjy commented May 21, 2025

@albanD I see, thank you! However I do feel it to be weird: when we see such a tensor with "device=0", it indeed does not mean that it is on the 0th device, but mean no another device :/

@fzyzcjy
Copy link
Copy Markdown
Contributor Author

fzyzcjy commented May 21, 2025

(I also replied in sgl-project/sglang#4565)

@github-actions github-actions Bot closed this Jun 20, 2025
lostkevin added a commit to alibaba/ChatLearn that referenced this pull request Sep 1, 2025
REF: pytorch/pytorch#149248

Fix issues:

1. performance degradation when TP > 1 if sync w/o bucketing
2. CudaError when TP > 1 if sync w/ bucketing

* add patch for sglang sync when TP > 1

* fix pylint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

open source release notes: distributed (miscellaneous) Stale triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

(Will PR) Multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device

3 participants