Patch PyTorch's bug that cross-process tensor transfer will lead to wrong device by fzyzcjy · Pull Request #4565 · sgl-project/sglang

fzyzcjy · 2025-03-19T03:28:22Z

Motivation

Monkey patch PyTorch before pytorch/pytorch#149248 is fixed

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

…into feat/patch_torch

This reverts commit 77bcfd4.

zhaochenyang20 · 2025-03-22T02:31:49Z

        TestFile("test_openai_server.py", 124),
        TestFile("test_penalty.py", 41),
        TestFile("test_page_size.py", 60),
+        TestFile("test_patch_torch.py", 60),


test_patch_torch change to test_patch_torch_mem_savor

I currently name like that because it is a general patch that fixes a pytorch real bug. Though currently it is only utilized by memory saver + sgl.VerlEngine, but theoretically it can be useful in other scenarios.

zhaochenyang20 · 2025-03-27T03:02:51Z

verl-project/verl#756

we will continue on this

…rong device (sgl-project#4565)

albanD · 2025-05-21T00:53:54Z

FYI from the PyTorch side, it is clear that whether this is a bug or a feature is not a clear cut :D
We have users that depend on the current behavior and so this patch might have surprising behavior for them.

In the context of this project, I'm curious if finer grained control of CUDA_VISIBLE_DEVICES would allow you to avoid the issue?

Also it is still very unclear to me how you get in a situation where two gpus are on the same machine (as they get confused via CUDA_VISIBLE_DEVICES) but don't have peer access?

fzyzcjy · 2025-05-21T05:21:28Z

@albanD

We have users that depend on the current behavior and so this patch might have surprising behavior for them.

(replied in pytorch/pytorch#149248)

In the context of this project, I'm curious if finer grained control of CUDA_VISIBLE_DEVICES would allow you to avoid the issue?

Seems not very trivial for the current code (but the scenario may change in the future)

Also it is still very unclear to me how you get in a situation where two gpus are on the same machine (as they get confused via CUDA_VISIBLE_DEVICES) but don't have peer access?

Firstly, IIRC I do see some users report this, maybe some 4090s etc.

Secondly, it is because I do https://github.com/fzyzcjy/torch_memory_saver, and thus it is more sensitive to correct gpu devices. (I can also choose to enable peer access of all gpus but that would be slightly slower)

albanD · 2025-05-21T18:17:01Z

Seems not very trivial for the current code (but the scenario may change in the future)

That is unfortunate :/

Secondly, it is because I do https://github.com/fzyzcjy/torch_memory_saver, and thus it is more sensitive to correct gpu devices.

Very cool project :)
All the data is lost after a pause/resume though? CPU offloading the data could be a cool option !
Also we have https://docs.pytorch.org/docs/stable/notes/cuda.html#mixing-different-cuda-system-allocators-in-the-same-program that might be of interest for you for this kind of advanced usage.

fzyzcjy · 2025-05-21T23:37:00Z

Very cool project :)

Thank you :)

All the data is lost after a pause/resume though? CPU offloading the data could be a cool option !

Yes it can backup/restore data as well, will be on master branch later. It originally get lost b/c in RL people do not need it.

Also we have docs.pytorch.org/docs/stable/notes/cuda.html#mixing-different-cuda-system-allocators-in-the-same-program that might be of interest for you for this kind of advanced usage.

Yes I know it :) Choose the current one s.t. I can even release all pytorch low-level memory etc, and also use pytorch's caching allocator powerfulness.

fzyzcjy added 30 commits March 15, 2025 15:04

more

cb6f05b

more

d6b44cd

more

6674b41

more

01f14fd

more

50e57cd

more

e051b34

more

adbeb2a

more

7f4e062

empty

f3fe95e

more

b56ac33

more

046808c

more

f9ac93e

more

8073d6d

more

6c8f0f7

more

72c222e

more

d6008d5

fmt

e0c7a50

more

28686bf

Merge branch 'main-upstream' into feat/patch_torch

2cdf029

cleanup

34e5022

cleanup

908029e

cleanup

50604c0

cleanup

1d2d8db

cleanup

f6ba043

cleanup

014da9c

more

ebf6efb

empty

06ce388

empty

de3fadf

more

3464446

more

808b131

fzyzcjy requested review from ByronHsu and ispobock as code owners March 19, 2025 03:28

fzyzcjy and others added 7 commits March 19, 2025 11:39

Merge branch 'main' into feat/patch_torch

9541c7e

fix linter

5613774

Merge branch 'feat/patch_torch' of https://github.com/fzyzcjy/sglang …

5ffa135

…into feat/patch_torch

bump ci

77bcfd4

Revert "bump ci"

96b78a3

This reverts commit 77bcfd4.

Merge branch 'main' into feat/patch_torch

336a607

Merge branch 'main' into feat/patch_torch

836ba14

zhaochenyang20 approved these changes Mar 22, 2025

View reviewed changes

zhyncs added the high priority label Mar 22, 2025

zhyncs assigned merrymercy Mar 22, 2025

Merge branch 'main' into feat/patch_torch

163ae8d

zhaochenyang20 mentioned this pull request Mar 27, 2025

veRL-SGLang Roadmap zhaochenyang20/Awesome-ML-SYS-Tutorial#74

Open

13 tasks

ocss884 mentioned this pull request Mar 27, 2025

fix #647 , sglang working with torch memory saver verl-project/verl#732

Closed

merrymercy approved these changes Mar 27, 2025

View reviewed changes

zhyncs merged commit 92bb49a into sgl-project:main Mar 27, 2025
34 of 39 checks passed

jimoosciuc pushed a commit to Furion-cn/sglang that referenced this pull request Apr 17, 2025

Patch PyTorch's bug that cross-process tensor transfer will lead to w…

e340b21

…rong device (sgl-project#4565)

fzyzcjy mentioned this pull request May 21, 2025

Fix multiprocessing with CUDA_VISIBLE_DEVICES seems to give the wrong device pytorch/pytorch#149248

Closed

ysy-phoenix mentioned this pull request May 22, 2025

[SGLang Async Rollout] Add monkey_patch_torch_reductions() for AsyncSGLangRollout verl-project/verl#1641

Closed

6 tasks

zTonyZhao mentioned this pull request Jul 2, 2025

RuntimeError: CUDA error: peer access is not supported between these two devices verl-project/verl#1874

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patch PyTorch's bug that cross-process tensor transfer will lead to wrong device#4565

Patch PyTorch's bug that cross-process tensor transfer will lead to wrong device#4565
zhyncs merged 57 commits intosgl-project:mainfrom
fzyzcjy:feat/patch_torch

fzyzcjy commented Mar 19, 2025 •

edited

Loading

Uh oh!

zhaochenyang20 Mar 22, 2025

Uh oh!

fzyzcjy Mar 22, 2025 •

edited

Loading

Uh oh!

zhaochenyang20 commented Mar 27, 2025

Uh oh!

Uh oh!

albanD commented May 21, 2025

Uh oh!

fzyzcjy commented May 21, 2025

Uh oh!

albanD commented May 21, 2025

Uh oh!

fzyzcjy commented May 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

fzyzcjy commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

zhaochenyang20 Mar 22, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 commented Mar 27, 2025

Uh oh!

Uh oh!

albanD commented May 21, 2025

Uh oh!

fzyzcjy commented May 21, 2025

Uh oh!

albanD commented May 21, 2025

Uh oh!

fzyzcjy commented May 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fzyzcjy commented Mar 19, 2025 •

edited

Loading

fzyzcjy Mar 22, 2025 •

edited

Loading