Patch PyTorch's bug that cross-process tensor transfer will lead to wrong device#4565
Patch PyTorch's bug that cross-process tensor transfer will lead to wrong device#4565zhyncs merged 57 commits intosgl-project:mainfrom
Conversation
…into feat/patch_torch
This reverts commit 77bcfd4.
| TestFile("test_openai_server.py", 124), | ||
| TestFile("test_penalty.py", 41), | ||
| TestFile("test_page_size.py", 60), | ||
| TestFile("test_patch_torch.py", 60), |
There was a problem hiding this comment.
test_patch_torch change to test_patch_torch_mem_savor
There was a problem hiding this comment.
I currently name like that because it is a general patch that fixes a pytorch real bug. Though currently it is only utilized by memory saver + sgl.VerlEngine, but theoretically it can be useful in other scenarios.
|
we will continue on this |
|
FYI from the PyTorch side, it is clear that whether this is a bug or a feature is not a clear cut :D In the context of this project, I'm curious if finer grained control of CUDA_VISIBLE_DEVICES would allow you to avoid the issue? Also it is still very unclear to me how you get in a situation where two gpus are on the same machine (as they get confused via CUDA_VISIBLE_DEVICES) but don't have peer access? |
(replied in pytorch/pytorch#149248)
Seems not very trivial for the current code (but the scenario may change in the future)
Firstly, IIRC I do see some users report this, maybe some 4090s etc. Secondly, it is because I do https://github.com/fzyzcjy/torch_memory_saver, and thus it is more sensitive to correct gpu devices. (I can also choose to enable peer access of all gpus but that would be slightly slower) |
That is unfortunate :/
Very cool project :) |
Thank you :)
Yes it can backup/restore data as well, will be on master branch later. It originally get lost b/c in RL people do not need it.
Yes I know it :) Choose the current one s.t. I can even release all pytorch low-level memory etc, and also use pytorch's caching allocator powerfulness. |
Motivation
Monkey patch PyTorch before pytorch/pytorch#149248 is fixed
Modifications
Checklist