Skip to content

Fix CUDA RPC Stream Synchronization#50949

Closed
mrshenli wants to merge 4 commits intogh/mrshenli/289/basefrom
gh/mrshenli/289/head
Closed

Fix CUDA RPC Stream Synchronization#50949
mrshenli wants to merge 4 commits intogh/mrshenli/289/basefrom
gh/mrshenli/289/head

Conversation

@mrshenli
Copy link
Copy Markdown
Contributor

@mrshenli mrshenli commented Jan 22, 2021

Stack from ghstack:

When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling rpc_async(...).wait(). This commit
uses Future::then API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

fixes #50881
fixes #50839

Differential Revision: D26020458

When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

[ghstack-poisoned]
@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Jan 22, 2021

💊 CI failures summary and remediations

As of commit 01f04f7 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

[ghstack-poisoned]
When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

fixes #50881
fixes #50839

[ghstack-poisoned]
@mrshenli mrshenli requested a review from lw January 22, 2021 17:08
mrshenli added a commit that referenced this pull request Jan 22, 2021
When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

fixes #50881
fixes #50839

ghstack-source-id: 56c7900
Pull Request resolved: #50949
When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

fixes #50881
fixes #50839

Differential Revision: [D26020458](https://our.internmc.facebook.com/intern/diff/D26020458)

[ghstack-poisoned]
mrshenli added a commit that referenced this pull request Jan 22, 2021
When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

fixes #50881
fixes #50839

ghstack-source-id: db34a57
Pull Request resolved: #50949
Comment on lines 141 to +151
std::weak_ptr<JitFuture> wp = messageJitFuture;
messageJitFuture->addCallback(
at::wrapPropagateTLSState<void>([pyJitFuture, wp]() {
return messageJitFuture->then(
at::wrapPropagateTLSState<IValue>([wp]() {
auto future = wp.lock();
if (future->hasError()) {
pyJitFuture->setError(future->exception_ptr());
std::rethrow_exception(future->exception_ptr());
} else {
pyJitFuture->markCompleted(
toPyIValue(*future->value().toCustomClass<Message>()));
return toPyIValue(*future->value().toCustomClass<Message>());
}
}));

return pyJitFuture;
}),
PyObjectType::get());
Copy link
Copy Markdown
Contributor

@pritamdamania87 pritamdamania87 Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, it would be nice to add some unit tests that would consistently fail without this patch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's hard to make it consistently fail, but #50839 fails pretty frequently without this fix. I am not sure if this is the only bug, but I tried a few tens of times locally, and the error didn't occur.

Copy link
Copy Markdown
Contributor

@lw lw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks for fixing!

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@mrshenli merged this pull request in f0e72e5.

@facebook-github-bot facebook-github-bot deleted the gh/mrshenli/289/head branch January 26, 2021 15:21
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Pull Request resolved: pytorch#50949

When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

fixes pytorch#50881
fixes pytorch#50839

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D26020458

Pulled By: mrshenli

fbshipit-source-id: 25195fbc10b99f4c401ec3ed7a382128464b5f08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants