Fix multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device#149248
Fix multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device#149248fzyzcjy wants to merge 1 commit intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149248
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 98aef5c with merge base 1e37e5b ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "release notes: distributed (miscellaneous)" |
|
Let's discuss on the issue |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
|
Hi is it possible to be merged? |
|
@fzyzcjy I'm afraid it is not as this is very much breaking the current behavior. In particular for the many distributed users that rely on always using |
|
@albanD I see, thank you! However I do feel it to be weird: when we see such a tensor with "device=0", it indeed does not mean that it is on the 0th device, but mean no another device :/ |
|
(I also replied in sgl-project/sglang#4565) |
REF: pytorch/pytorch#149248 Fix issues: 1. performance degradation when TP > 1 if sync w/o bucketing 2. CudaError when TP > 1 if sync w/ bucketing * add patch for sglang sync when TP > 1 * fix pylint
Fixes #149196
This is merely a proof-of-concept PR. I would like to hear a bit of feedback - is the direction acceptable - before working on it deeper.
Things that will be added if the direction of PR looks acceptable: Unit tests, caches, implement-in-C++ (to speedup), etc.