Restore thread_local states in continuation thread on RPC servers by mrshenli · Pull Request #38512 · pytorch/pytorch

mrshenli · 2020-05-14T22:13:33Z

Stack from ghstack:

Restore thread_local states in continuation thread on RPC servers #38512 Restore thread_local states in continuation thread on RPC servers

As we gradually making the RPC non-blocking on server side, the
processing of the same request can yield-run on different threads.
Hence, we need to populate thread_local states (e.g., ctx id) in
the continuation thread.

Fixes #38439

Differential Revision: D21583642

As we gradually making the RPC non-blocking on server side, the processing of the same request can yield-run on different threads. Hence, we need to populate thread_local states (e.g., ctx id) in the continuation thread. Fixes #38439 [ghstack-poisoned]

…servers" As we gradually making the RPC non-blocking on server side, the processing of the same request can yield-run on different threads. Hence, we need to populate thread_local states (e.g., ctx id) in the continuation thread. Fixes #38439 [ghstack-poisoned]

mrshenli · 2020-05-14T22:17:11Z

+        return t3
+
+    @dist_init
+    def test_thread_local_context_id(self):


I confirm that this test fails without this fix.

…servers" As we gradually making the RPC non-blocking on server side, the processing of the same request can yield-run on different threads. Hence, we need to populate thread_local states (e.g., ctx id) in the continuation thread. Fixes #38439 [ghstack-poisoned]

As we gradually making the RPC non-blocking on server side, the processing of the same request can yield-run on different threads. Hence, we need to populate thread_local states (e.g., ctx id) in the continuation thread. Fixes #38439 ghstack-source-id: 1d3307f Pull Request resolved: #38512

rohan-varma · 2020-05-15T00:00:40Z

-           weak = std::weak_ptr<FutureMessage>(wrappedRpcResponseFuture)]() {
+           weak = std::weak_ptr<FutureMessage>(wrappedRpcResponseFuture),
+           threadLocalState = ThreadLocalState(),
+           ctxId = autogradContext->contextId()]() {


This is currently the only place on server-side where we have such a continuation by continuing processing with addCallback, correct?

There are more in other types of requests. But IIUC, this is the only place we need to propagate context id.

In BACKWARD_AUTOGRAD_REQ, it does not need the context id when creating the PropagateGradientsResp obj

In SCRIPT_CALL and SCRIPT_REMOTE_CALL, the context id is fixed by [DistAutograd x JIT] Capture global state, dist autograd current context id, before thread switching triggered by JIT future.wait() #36395.

PYTHON_CALL and PYTHON_REMOTE_CALL currently do not yield. But when we add async user function, we also need to propagate the TLS here.

SCRIPT_RREF_FETCH_CALL and PYTHON_RREF_FETCH_CALL do not need the context id either.

rohan-varma · 2020-05-15T00:02:41Z

           fromWorkerId,
-           weak = std::weak_ptr<FutureMessage>(wrappedRpcResponseFuture)]() {
+           weak = std::weak_ptr<FutureMessage>(wrappedRpcResponseFuture),
+           threadLocalState = ThreadLocalState(),


Do we need TLS state or just the dist autograd context id for now? (I am planning to eventually use TLS state for distributed profiler work, but curious if this is already needed now)

Also, could we potentially reuse the approach taken in record_function_ops.cpp?
This is basically the same here, but there we declare the tls_state outside of the lambda capture and std::move it into the capture. Not sure if there's a difference perf wise.

also cc @ilia-cher, if you have any comments on the usage here.

Do we need TLS state or just the dist autograd context id for now? (I am planning to eventually use TLS state for distributed profiler work, but curious if this is already needed now)

For now we only need the context id I think. @xush6528 also pointed out that we will need this for profiler later, so added ThreadLocalState as well. I think we can also remove it in this PR and leave to the profiler-related PR?

This is basically the same here, but there we declare the tls_state outside of the lambda capture and std::move it into the capture. Not sure if there's a difference perf wise.

Should be the same I guess, as both are rvalue references?

You can remove. I think profiler state is not needed because no more tensor operations happens in this callback.
Autograd context id is needed because getMessageWithAutograd(..) needs it.

We will need to restore both profiler state and autograd context id for python call user async function continuation.

…servers" As we gradually making the RPC non-blocking on server side, the processing of the same request can yield-run on different threads. Hence, we need to populate thread_local states (e.g., ctx id) in the continuation thread. Fixes #38439 Differential Revision: [D21583642](https://our.internmc.facebook.com/intern/diff/D21583642) [ghstack-poisoned]

As we gradually making the RPC non-blocking on server side, the processing of the same request can yield-run on different threads. Hence, we need to populate thread_local states (e.g., ctx id) in the continuation thread. Fixes #38439 ghstack-source-id: 5cd17f0 Pull Request resolved: #38512

dr-ci · 2020-05-15T15:31:44Z

💊 CI failures summary and remediations

As of commit 95d287f (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 11 times.

…servers" As we gradually making the RPC non-blocking on server side, the processing of the same request can yield-run on different threads. Hence, we need to populate thread_local states (e.g., ctx id) in the continuation thread. Fixes #38439 Differential Revision: [D21583642](https://our.internmc.facebook.com/intern/diff/D21583642) [ghstack-poisoned]

As we gradually making the RPC non-blocking on server side, the processing of the same request can yield-run on different threads. Hence, we need to populate thread_local states (e.g., ctx id) in the continuation thread. Fixes #38439 ghstack-source-id: 8787a49 Pull Request resolved: #38512

xush6528 · 2020-05-15T21:07:36Z

+        rref = rpc.remote(dst, DistAutogradTest._slow_add, args=(t1, t2))
+
+        with dist_autograd.context() as context_id:
+            loss = rref.to_here().sum()


Let me write down a note for whoever like me is not super clear with FORWARD_AUTOGRAD_REQ.

The rref.to_here() here is a PYTHON_RREF_FETCH_CALL, which is always wrapped in FORWARD_AUTORAD_REQUEST as long as it's called within autograd context, because of forceGradRecording on this line.

pytorch/torch/csrc/distributed/rpc/rref_impl.cpp

Lines 145 to 149 in dfcea82

auto futureResponse = autograd::sendMessageWithAutograd(

*agent,

agent->getWorkerInfo(ownerId_),

std::move(msgToSend),

true /* forceGradRecording */);

When the PYTHON_RREF_FETCH_RESPONSE is ready, as in this line,

pytorch/torch/csrc/distributed/rpc/request_callback_impl.cpp

Line 401 in dfcea82

responseFuture->markCompleted(std::move(m));

the response message is wrapped into FORWARD_AUTOGRAD_RESP by this line,

pytorch/torch/csrc/distributed/rpc/request_callback_impl.cpp

Lines 496 to 499 in dfcea82

auto msg = getMessageWithAutograd(

fromWorkerId,

std::move(*wrappedRpcResponseFuture).moveValue(),

MessageType::FORWARD_AUTOGRAD_RESP);

where the key getMessageWithAutograd(...) is called to add a SendFunction to the autograd context and the on the client receiving this FORWARD_AUTOGRAD_RESP will add a corresponding RecvFunction to the dist autograd context.

If the autograd context is not restored here, the SendFunction will be added to a wrong autograd context, or crashes because of no active context. And the following sum() backward will be propagated back to server but in a wrong autograd context.

Why this test can reproduce the thread switch?

By making rpc.remote(..., _slow_add, ...) request slow, and the rref.to_here() here is processed before the RRef value is set.

So the to_here() request callback will run on another thread. The thread to run it is exactly the thread for rpc.remote(..), because it marks the value as ready, thus responsible to run the added callbacks.

Since rpc.remote(..) is called outside of a dist autograd context, getMessageWithAutograd(...) does not wrap the PYTHON_REMOTE_CALL into a FORWARD_AUTORAD_REQUEST, no autograd context is available on this thread.

…servers" As we gradually making the RPC non-blocking on server side, the processing of the same request can yield-run on different threads. Hence, we need to populate thread_local states (e.g., ctx id) in the continuation thread. Fixes #38439 Differential Revision: [D21583642](https://our.internmc.facebook.com/intern/diff/D21583642) [ghstack-poisoned]

As we gradually making the RPC non-blocking on server side, the processing of the same request can yield-run on different threads. Hence, we need to populate thread_local states (e.g., ctx id) in the continuation thread. Fixes #38439 ghstack-source-id: afb9d07 Pull Request resolved: #38512

…servers" As we gradually making the RPC non-blocking on server side, the processing of the same request can yield-run on different threads. Hence, we need to populate thread_local states (e.g., ctx id) in the continuation thread. Fixes #38439 Differential Revision: [D21583642](https://our.internmc.facebook.com/intern/diff/D21583642) [ghstack-poisoned]

pritamdamania87 · 2020-05-15T22:03:44Z

+            // thread_local states there.
+            // TODO: Land on a general solution for RPC ThreadLocalState. See
+            // https://github.com/pytorch/pytorch/issues/38510
+            DistAutogradContextGuard ctxGuard(ctxId);


Shouldn't we do this in addCallback itself instead of fixing each callsite of addCallback?

Yes, eventually, we should do that I think. But for now we still have two Futures (utils and ivalue), and not every callback needs to capture ThreadLocalState, it might be better to first fix master for now. Let's revisit this when we reach a consensus on how we should implement RPC ThreadLocalState.

…servers" As we gradually making the RPC non-blocking on server side, the processing of the same request can yield-run on different threads. Hence, we need to populate thread_local states (e.g., ctx id) in the continuation thread. Fixes #38439 Differential Revision: [D21583642](https://our.internmc.facebook.com/intern/diff/D21583642) [ghstack-poisoned]

As we gradually making the RPC non-blocking on server side, the processing of the same request can yield-run on different threads. Hence, we need to populate thread_local states (e.g., ctx id) in the continuation thread. Fixes #38439 ghstack-source-id: da83b3a Pull Request resolved: #38512

facebook-github-bot · 2020-05-16T02:18:41Z

@mrshenli merged this pull request in f39222a.

mrshenli · 2020-05-19T15:31:02Z

If we are going to release v1.5.1, we should include this fix.

…torch#38512) Summary: Pull Request resolved: pytorch#38512 As we gradually making the RPC non-blocking on server side, the processing of the same request can yield-run on different threads. Hence, we need to populate thread_local states (e.g., ctx id) in the continuation thread. Fixes pytorch#38439 Test Plan: Imported from OSS Differential Revision: D21583642 Pulled By: mrshenli fbshipit-source-id: a79bce1cb207fd11f1fa02b08465e49badda65fc

…8512) Summary: Pull Request resolved: #38512 As we gradually making the RPC non-blocking on server side, the processing of the same request can yield-run on different threads. Hence, we need to populate thread_local states (e.g., ctx id) in the continuation thread. Fixes #38439 Test Plan: Imported from OSS Differential Revision: D21583642 Pulled By: mrshenli fbshipit-source-id: a79bce1cb207fd11f1fa02b08465e49badda65fc

…torch#38512) Summary: Pull Request resolved: pytorch#38512 As we gradually making the RPC non-blocking on server side, the processing of the same request can yield-run on different threads. Hence, we need to populate thread_local states (e.g., ctx id) in the continuation thread. Fixes pytorch#38439 Test Plan: Imported from OSS Differential Revision: D21583642 Pulled By: mrshenli fbshipit-source-id: a79bce1cb207fd11f1fa02b08465e49badda65fc

mrshenli requested review from pritamdamania87 and zhaojuanmao as code owners May 14, 2020 22:13

mrshenli commented May 14, 2020

View reviewed changes

mrshenli linked an issue May 14, 2020 that may be closed by this pull request

Future callbacks in RPC should capture and restore autograd context id #38439

Closed

mrshenli requested review from ilia-cher, rohan-varma and xush6528 May 14, 2020 22:56

rohan-varma reviewed May 15, 2020

View reviewed changes

xush6528 reviewed May 15, 2020

View reviewed changes

Comment thread torch/testing/_internal/distributed/rpc/dist_autograd_test.py Outdated

xush6528 approved these changes May 15, 2020

View reviewed changes

pritamdamania87 reviewed May 15, 2020

View reviewed changes

facebook-github-bot closed this in f39222a May 16, 2020

facebook-github-bot added the merged label May 16, 2020

facebook-github-bot deleted the gh/mrshenli/179/head branch May 19, 2020 14:16

mrshenli mentioned this pull request May 19, 2020

DistAutogradTest.TestInitializedContextCleanup and DistAutogradTest.TestInitializedContextCleanupSendFunction fail #38710

Closed

mrshenli added this to the 1.5.1 milestone May 19, 2020

mrshenli mentioned this pull request May 27, 2020

[v1.5.1] Release Tracker #39104

Closed

mrshenli mentioned this pull request May 27, 2020

[v1.5.1 patch] Restore thread_local states in continuation thread on RPC servers #39109

Merged

mruberry added the Merged label Oct 28, 2020

	auto futureResponse = autograd::sendMessageWithAutograd(
	*agent,
	agent->getWorkerInfo(ownerId_),
	std::move(msgToSend),
	true /* forceGradRecording */);

	auto msg = getMessageWithAutograd(
	fromWorkerId,
	std::move(*wrappedRpcResponseFuture).moveValue(),
	MessageType::FORWARD_AUTOGRAD_RESP);

Conversation

mrshenli commented May 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rohan-varma May 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dr-ci Bot commented May 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

Uh oh!

xush6528 May 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented May 16, 2020

Uh oh!

mrshenli commented May 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mrshenli commented May 14, 2020 •

edited

Loading

rohan-varma May 15, 2020 •

edited

Loading

dr-ci Bot commented May 15, 2020 •

edited

Loading

xush6528 May 15, 2020 •

edited

Loading