Skip to content

[DistAutograd x JIT] Capture global state, dist autograd current context id, before thread switching triggered by JIT future.wait()#36395

Closed
xush6528 wants to merge 2 commits intomasterfrom
export-D7857991
Closed

[DistAutograd x JIT] Capture global state, dist autograd current context id, before thread switching triggered by JIT future.wait()#36395
xush6528 wants to merge 2 commits intomasterfrom
export-D7857991

Conversation

@xush6528
Copy link
Copy Markdown
Contributor

@xush6528 xush6528 commented Apr 10, 2020

Stack:
    :black_circle:  #36395 [DistAutograd x JIT] Capture global state, dist autograd current context id, before thread switching triggered by JIT future.wait()  💛

titled

Differential Revision: D7857991

Differential Revision: D7857991
Differential Version: 101928130
@xush6528 xush6528 requested a review from apaszke as a code owner April 10, 2020 16:55
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Apr 10, 2020
@xush6528 xush6528 changed the title [DistAutograd x JIT] Capture global state dist autograd current context id before thread switching triggered by JIT future.wait() [WIP] [DistAutograd x JIT] Capture global state dist autograd current context id before thread switching triggered by JIT future.wait() Apr 10, 2020
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Apr 10, 2020

💊 Build failures summary and remediations

As of commit 6e535e3 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

Extra GitHub checks


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 5 times.

@xush6528 xush6528 changed the title [WIP] [DistAutograd x JIT] Capture global state dist autograd current context id before thread switching triggered by JIT future.wait() [WIP] [DistAutograd x JIT] Capture global state, dist autograd current context id, before thread switching triggered by JIT future.wait() Apr 10, 2020
Copy link
Copy Markdown
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a unit test that would've failed without this fix?

@jjlilley
Copy link
Copy Markdown

@xush6528 - thanks for fixing this!

Re: a test case - if it's straightforward to construct something, let's do it. My only concern is making this a hard requirement and delaying a straightforward correctness fix like this for a week. The existing c++ test coverage is frankly fairly weak - not that it shouldn't be fixed, but it's not unreasonable to do that out-of-band.

@xush6528
Copy link
Copy Markdown
Contributor Author

@jjlilley

It's an easy test case. make 2 rpcs in torch script and check every time we get gradients back.

Without this fix, only for the tensors passed in the first rpc can get gradients back.

@xush6528 xush6528 changed the title [WIP] [DistAutograd x JIT] Capture global state, dist autograd current context id, before thread switching triggered by JIT future.wait() [DistAutograd x JIT] Capture global state, dist autograd current context id, before thread switching triggered by JIT future.wait() Apr 10, 2020
Differential Revision: D7857991
Differential Version: 101948084
@xush6528 xush6528 added the module: rpc Related to RPC, distributed autograd, RRef, and distributed optimizer label Apr 10, 2020
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in ae452a8.

ashishfarmer pushed a commit to ashishfarmer/pytorch that referenced this pull request Apr 13, 2020
…ext id, before thread switching triggered by JIT future.wait() (pytorch#36395)

Summary:
Pull Request resolved: pytorch#36395

titled

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:dist_autograd_fork -- test_restore_context_after_swtich_to_jit_thread

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:dist_autograd_fork

buck-out/gen/caffe2/test/distributed/rpc/jit/dist_autograd_fork\#binary.par -r test_restore_context_after_swtich_to_jit_thread
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork -- test_backward_simple_script_call

buck build mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork

buck-out/gen/caffe2/test/distributed/rpc/dist_autograd_fork\#binary.par -r test_backward_simple_script_call
```

Differential Revision: D7857991

fbshipit-source-id: 168e0e3846a50ea92d4f9450a30ccc6c13e2fcec
@facebook-github-bot facebook-github-bot deleted the export-D7857991 branch April 14, 2020 14:16
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…ext id, before thread switching triggered by JIT future.wait() (pytorch#36395)

Summary:
Pull Request resolved: pytorch#36395

titled

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc/jit:dist_autograd_fork -- test_restore_context_after_swtich_to_jit_thread

buck build mode/dev-nosan //caffe2/test/distributed/rpc/jit:dist_autograd_fork

buck-out/gen/caffe2/test/distributed/rpc/jit/dist_autograd_fork\#binary.par -r test_restore_context_after_swtich_to_jit_thread
```

```
buck test mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork -- test_backward_simple_script_call

buck build mode/dev-nosan //caffe2/test/distributed/rpc:dist_autograd_fork

buck-out/gen/caffe2/test/distributed/rpc/dist_autograd_fork\#binary.par -r test_backward_simple_script_call
```

Differential Revision: D7857991

fbshipit-source-id: 168e0e3846a50ea92d4f9450a30ccc6c13e2fcec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: rpc Related to RPC, distributed autograd, RRef, and distributed optimizer oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants