Skip to content

Patch PyTorch's bug that cross-process tensor transfer will lead to wrong device#4565

Merged
zhyncs merged 57 commits intosgl-project:mainfrom
fzyzcjy:feat/patch_torch
Mar 27, 2025
Merged

Patch PyTorch's bug that cross-process tensor transfer will lead to wrong device#4565
zhyncs merged 57 commits intosgl-project:mainfrom
fzyzcjy:feat/patch_torch

Conversation

@fzyzcjy
Copy link
Copy Markdown
Collaborator

@fzyzcjy fzyzcjy commented Mar 19, 2025

Motivation

Monkey patch PyTorch before pytorch/pytorch#149248 is fixed

Modifications

Checklist

Comment thread test/srt/run_suite.py
TestFile("test_openai_server.py", 124),
TestFile("test_penalty.py", 41),
TestFile("test_page_size.py", 60),
TestFile("test_patch_torch.py", 60),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_patch_torch change to test_patch_torch_mem_savor

Copy link
Copy Markdown
Collaborator Author

@fzyzcjy fzyzcjy Mar 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I currently name like that because it is a general patch that fixes a pytorch real bug. Though currently it is only utilized by memory saver + sgl.VerlEngine, but theoretically it can be useful in other scenarios.

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

verl-project/verl#756

we will continue on this

@zhyncs zhyncs merged commit 92bb49a into sgl-project:main Mar 27, 2025
34 of 39 checks passed
jimoosciuc pushed a commit to Furion-cn/sglang that referenced this pull request Apr 17, 2025
@albanD
Copy link
Copy Markdown

albanD commented May 21, 2025

FYI from the PyTorch side, it is clear that whether this is a bug or a feature is not a clear cut :D
We have users that depend on the current behavior and so this patch might have surprising behavior for them.

In the context of this project, I'm curious if finer grained control of CUDA_VISIBLE_DEVICES would allow you to avoid the issue?

Also it is still very unclear to me how you get in a situation where two gpus are on the same machine (as they get confused via CUDA_VISIBLE_DEVICES) but don't have peer access?

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 21, 2025

@albanD

We have users that depend on the current behavior and so this patch might have surprising behavior for them.

(replied in pytorch/pytorch#149248)

In the context of this project, I'm curious if finer grained control of CUDA_VISIBLE_DEVICES would allow you to avoid the issue?

Seems not very trivial for the current code (but the scenario may change in the future)

Also it is still very unclear to me how you get in a situation where two gpus are on the same machine (as they get confused via CUDA_VISIBLE_DEVICES) but don't have peer access?

Firstly, IIRC I do see some users report this, maybe some 4090s etc.

Secondly, it is because I do https://github.com/fzyzcjy/torch_memory_saver, and thus it is more sensitive to correct gpu devices. (I can also choose to enable peer access of all gpus but that would be slightly slower)

@albanD
Copy link
Copy Markdown

albanD commented May 21, 2025

Seems not very trivial for the current code (but the scenario may change in the future)

That is unfortunate :/

Secondly, it is because I do https://github.com/fzyzcjy/torch_memory_saver, and thus it is more sensitive to correct gpu devices.

Very cool project :)
All the data is lost after a pause/resume though? CPU offloading the data could be a cool option !
Also we have https://docs.pytorch.org/docs/stable/notes/cuda.html#mixing-different-cuda-system-allocators-in-the-same-program that might be of interest for you for this kind of advanced usage.

@fzyzcjy
Copy link
Copy Markdown
Collaborator Author

fzyzcjy commented May 21, 2025

Very cool project :)

Thank you :)

All the data is lost after a pause/resume though? CPU offloading the data could be a cool option !

Yes it can backup/restore data as well, will be on master branch later. It originally get lost b/c in RL people do not need it.

Also we have docs.pytorch.org/docs/stable/notes/cuda.html#mixing-different-cuda-system-allocators-in-the-same-program that might be of interest for you for this kind of advanced usage.

Yes I know it :) Choose the current one s.t. I can even release all pytorch low-level memory etc, and also use pytorch's caching allocator powerfulness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants