Fix CUDA RPC Stream Synchronization#50949
Closed
mrshenli wants to merge 4 commits intogh/mrshenli/289/basefrom
Closed
Fix CUDA RPC Stream Synchronization#50949mrshenli wants to merge 4 commits intogh/mrshenli/289/basefrom
mrshenli wants to merge 4 commits intogh/mrshenli/289/basefrom
Conversation
When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. [ghstack-poisoned]
Contributor
💊 CI failures summary and remediationsAs of commit 01f04f7 (more details on the Dr. CI page):
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. [ghstack-poisoned]
When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes #50881 fixes #50839 [ghstack-poisoned]
mrshenli
added a commit
that referenced
this pull request
Jan 22, 2021
When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes #50881 fixes #50839 ghstack-source-id: 56c7900 Pull Request resolved: #50949
When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes #50881 fixes #50839 Differential Revision: [D26020458](https://our.internmc.facebook.com/intern/diff/D26020458) [ghstack-poisoned]
mrshenli
added a commit
that referenced
this pull request
Jan 22, 2021
When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes #50881 fixes #50839 ghstack-source-id: db34a57 Pull Request resolved: #50949
pritamdamania87
approved these changes
Jan 22, 2021
Comment on lines
141
to
+151
| std::weak_ptr<JitFuture> wp = messageJitFuture; | ||
| messageJitFuture->addCallback( | ||
| at::wrapPropagateTLSState<void>([pyJitFuture, wp]() { | ||
| return messageJitFuture->then( | ||
| at::wrapPropagateTLSState<IValue>([wp]() { | ||
| auto future = wp.lock(); | ||
| if (future->hasError()) { | ||
| pyJitFuture->setError(future->exception_ptr()); | ||
| std::rethrow_exception(future->exception_ptr()); | ||
| } else { | ||
| pyJitFuture->markCompleted( | ||
| toPyIValue(*future->value().toCustomClass<Message>())); | ||
| return toPyIValue(*future->value().toCustomClass<Message>()); | ||
| } | ||
| })); | ||
|
|
||
| return pyJitFuture; | ||
| }), | ||
| PyObjectType::get()); |
Contributor
There was a problem hiding this comment.
If possible, it would be nice to add some unit tests that would consistently fail without this patch.
Contributor
Author
There was a problem hiding this comment.
it's hard to make it consistently fail, but #50839 fails pretty frequently without this fix. I am not sure if this is the only bug, but I tried a few tens of times locally, and the error didn't occur.
lw
approved these changes
Jan 22, 2021
Contributor
lw
left a comment
There was a problem hiding this comment.
Good catch, thanks for fixing!
Contributor
laurentdupin
pushed a commit
to laurentdupin/pytorch
that referenced
this pull request
Apr 24, 2026
Summary: Pull Request resolved: pytorch#50949 When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes pytorch#50881 fixes pytorch#50839 Test Plan: Imported from OSS Reviewed By: pritamdamania87 Differential Revision: D26020458 Pulled By: mrshenli fbshipit-source-id: 25195fbc10b99f4c401ec3ed7a382128464b5f08
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack:
When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling
rpc_async(...).wait(). This commituses
Future::thenAPI to create the chained Future, which willbe creating a CUDAFuture if the existing Future is a CUDA one.
fixes #50881
fixes #50839
Differential Revision: D26020458