Skip to content

Use Future's then() API to fix RPC profiling#38352

Closed
rohan-varma wants to merge 6 commits intogh/rohan-varma/127/basefrom
gh/rohan-varma/127/head
Closed

Use Future's then() API to fix RPC profiling#38352
rohan-varma wants to merge 6 commits intogh/rohan-varma/127/basefrom
gh/rohan-varma/127/head

Conversation

@rohan-varma
Copy link
Copy Markdown
Contributor

@rohan-varma rohan-varma commented May 12, 2020

Stack from ghstack:

Fixes the RPC profiling by using the then() API added in #37311. Instead of adding a regular callback, we return a new future that completes when the profiling callback is finished. This is transparent to the user as the future still completes with the value of the original future (i.e. the RPC's return value)

To make this work for RRef, we add a _set_profiling_future to set the profiling future, and _get_profiling_future to retrieve this future and wait on it in the tests.

Re-enabled profiling tests and stress tested them 1000 times to verify the fix

Differential Revision: D21506940

Fixes the RPC profiling by using the `then()` API added in #37311. Instead of adding a regular callback, we return a new future that completes when the profiling callback is finished. This is transparent to the user as the future still completes with the value of the original future (i.e. the RPC's return value)

To make this work for RRef, we add a `_set_profiling_future` to set the profiling future, and `_get_profiling_future` to retrieve this future and wait on it in the tests.

Re-enabled profiling tests and stress tested them 1000 times to verify the fix

Differential Revision: [D21506940](https://our.internmc.facebook.com/intern/diff/D21506940/)

[ghstack-poisoned]
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented May 12, 2020

💊 CI failures summary and remediations

As of commit c4dbe49 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build caffe2_onnx_main_py3_6_clang7_ubuntu16_04_test (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

May 14 07:37:27 E AssertionError: False is not true
May 14 07:37:27      
May 14 07:37:27         def shutdown_fun(): 
May 14 07:37:27             workspace.FeedBlob('data', 'shutdown') 
May 14 07:37:27      
May 14 07:37:27         worker_coordinator = parallel_workers.init_workers( 
May 14 07:37:27             dummy_worker, shutdown_fun=shutdown_fun 
May 14 07:37:27         ) 
May 14 07:37:27         worker_coordinator.start() 
May 14 07:37:27      
May 14 07:37:27 >       self.assertTrue(worker_coordinator.stop()) 
May 14 07:37:27 E       AssertionError: False is not true 
May 14 07:37:27  
May 14 07:37:27 ../.local/lib/python3.6/site-packages/caffe2/python/parallel_workers_test.py:116: AssertionError 
May 14 07:37:27 ----------------------------- Captured stdout call ----------------------------- 
May 14 07:37:27 Wait for workers to die: train 
May 14 07:37:27 Worker <Thread(parallel_workers worker id 4, started daemon 139753781446400)> failed to close while waiting 
May 14 07:37:27 Worker <Thread(parallel_workers worker id 5, started daemon 139753789839104)> failed to close while waiting 
May 14 07:37:27 All workers terminated: False 
May 14 07:37:27 - generated xml file: /var/lib/jenkins/workspace/caffe2_tests/python/result.xml - 
May 14 07:37:27 ---------- onnx coverage: ---------- 
May 14 07:37:27 Operators (passed/loaded/total): 0/0/185 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 21 times.

Fixes the RPC profiling by using the `then()` API added in #37311. Instead of adding a regular callback, we return a new future that completes when the profiling callback is finished. This is transparent to the user as the future still completes with the value of the original future (i.e. the RPC's return value)

To make this work for RRef, we add a `_set_profiling_future` to set the profiling future, and `_get_profiling_future` to retrieve this future and wait on it in the tests.

Re-enabled profiling tests and stress tested them 1000 times to verify the fix

Differential Revision: [D21506940](https://our.internmc.facebook.com/intern/diff/D21506940/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request May 12, 2020
Pull Request resolved: #38352

Fixes the RPC profiling by using the `then()` API added in #37311. Instead of adding a regular callback, we return a new future that completes when the profiling callback is finished. This is transparent to the user as the future still completes with the value of the original future (i.e. the RPC's return value)

To make this work for RRef, we add a `_set_profiling_future` to set the profiling future, and `_get_profiling_future` to retrieve this future and wait on it in the tests.

Re-enabled profiling tests and stress tested them 1000 times to verify the fix
ghstack-source-id: 103957346

Differential Revision: [D21506940](https://our.internmc.facebook.com/intern/diff/D21506940/)
Fixes the RPC profiling by using the `then()` API added in #37311. Instead of adding a regular callback, we return a new future that completes when the profiling callback is finished. This is transparent to the user as the future still completes with the value of the original future (i.e. the RPC's return value)

To make this work for RRef, we add a `_set_profiling_future` to set the profiling future, and `_get_profiling_future` to retrieve this future and wait on it in the tests.

Re-enabled profiling tests and stress tested them 1000 times to verify the fix

Differential Revision: [D21506940](https://our.internmc.facebook.com/intern/diff/D21506940/)

[ghstack-poisoned]
Fixes the RPC profiling by using the `then()` API added in #37311. Instead of adding a regular callback, we return a new future that completes when the profiling callback is finished. This is transparent to the user as the future still completes with the value of the original future (i.e. the RPC's return value)

To make this work for RRef, we add a `_set_profiling_future` to set the profiling future, and `_get_profiling_future` to retrieve this future and wait on it in the tests.

Re-enabled profiling tests and stress tested them 1000 times to verify the fix

Differential Revision: [D21506940](https://our.internmc.facebook.com/intern/diff/D21506940/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request May 13, 2020
Pull Request resolved: #38352

Fixes the RPC profiling by using the `then()` API added in #37311. Instead of adding a regular callback, we return a new future that completes when the profiling callback is finished. This is transparent to the user as the future still completes with the value of the original future (i.e. the RPC's return value)

To make this work for RRef, we add a `_set_profiling_future` to set the profiling future, and `_get_profiling_future` to retrieve this future and wait on it in the tests.

Re-enabled profiling tests and stress tested them 1000 times to verify the fix
ghstack-source-id: 103983281

Differential Revision: [D21506940](https://our.internmc.facebook.com/intern/diff/D21506940/)
Fixes the RPC profiling by using the `then()` API added in #37311. Instead of adding a regular callback, we return a new future that completes when the profiling callback is finished. This is transparent to the user as the future still completes with the value of the original future (i.e. the RPC's return value)

To make this work for RRef, we add a `_set_profiling_future` to set the profiling future, and `_get_profiling_future` to retrieve this future and wait on it in the tests.

Re-enabled profiling tests and stress tested them 1000 times to verify the fix

Differential Revision: [D21506940](https://our.internmc.facebook.com/intern/diff/D21506940/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request May 13, 2020
Pull Request resolved: #38352

Fixes the RPC profiling by using the `then()` API added in #37311. Instead of adding a regular callback, we return a new future that completes when the profiling callback is finished. This is transparent to the user as the future still completes with the value of the original future (i.e. the RPC's return value)

To make this work for RRef, we add a `_set_profiling_future` to set the profiling future, and `_get_profiling_future` to retrieve this future and wait on it in the tests.

Re-enabled profiling tests and stress tested them 1000 times to verify the fix
ghstack-source-id: 104042479

Differential Revision: [D21506940](https://our.internmc.facebook.com/intern/diff/D21506940/)
Copy link
Copy Markdown
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing! I added some minor comments.

auto& rec = getRecordFunctionFromTensor(handle);
rec._end();
});
return fut->constValue();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment to explain why we need to forward the return value here. IIUC, it is because _invoke_rpc actually returns this future?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we get the future which has the actual result from _invoke_rpc. Then, if profiling we run this, and we want to ensure that the correct value of the future from _invoke_rpc is propagated. Will add this to the comment

_call_end_callbacks_on_fut(tensor, fut);
auto profiledFut = _call_end_callbacks_on_fut(tensor, fut);
// return future that completes when profiling callbacks have run.
jit::push(stack, profiledFut);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, push takes rvalue reference. If so, shall we use std::move() on the profilerFut?

Comment thread torch/csrc/distributed/rpc/init.cpp Outdated
Comment on lines +340 to +358
.def(
"_get_profiling_future",
[](const PyRRef& self) {
return std::make_shared<jit::PythonFutureWrapper>(
self.getProfilingFuture());
},
py::call_guard<py::gil_scoped_release>(),
R"(
Returns future that completes when the profiling event corresponding
to the creation of this RRef on the remote node has been recorded.
)")
.def(
"_set_profiling_future",
[](PyRRef& self,
const std::shared_ptr<jit::PythonFutureWrapper>&
wrappedFuture) {
self.setProfilingFuture(wrappedFuture->fut);
},
py::call_guard<py::gil_scoped_release>(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two seems are very light functions, should I hold GIL instead to avoid ctx switch?

Copy link
Copy Markdown
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lint failure is real.

  {
    path: 'torch/testing/_internal/distributed/rpc/jit/rpc_test.py',
    start_line: 2,
    end_line: 2,
    start_column: 1,
    end_column: 1,
    annotation_level: 'failure',
    message: "[F401] 'unittest' imported but unused"
  }

Fixes the RPC profiling by using the `then()` API added in #37311. Instead of adding a regular callback, we return a new future that completes when the profiling callback is finished. This is transparent to the user as the future still completes with the value of the original future (i.e. the RPC's return value)

To make this work for RRef, we add a `_set_profiling_future` to set the profiling future, and `_get_profiling_future` to retrieve this future and wait on it in the tests.

Re-enabled profiling tests and stress tested them 1000 times to verify the fix

Differential Revision: [D21506940](https://our.internmc.facebook.com/intern/diff/D21506940/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request May 14, 2020
Pull Request resolved: #38352

Fixes the RPC profiling by using the `then()` API added in #37311. Instead of adding a regular callback, we return a new future that completes when the profiling callback is finished. This is transparent to the user as the future still completes with the value of the original future (i.e. the RPC's return value)

To make this work for RRef, we add a `_set_profiling_future` to set the profiling future, and `_get_profiling_future` to retrieve this future and wait on it in the tests.

Re-enabled profiling tests and stress tested them 1000 times to verify the fix
ghstack-source-id: 104086114

Differential Revision: [D21506940](https://our.internmc.facebook.com/intern/diff/D21506940/)
@rohan-varma rohan-varma requested a review from mrshenli May 14, 2020 07:07
Copy link
Copy Markdown
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let's land!

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in 4d4895a.

laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Pull Request resolved: pytorch#38352

Fixes the RPC profiling by using the `then()` API added in pytorch#37311. Instead of adding a regular callback, we return a new future that completes when the profiling callback is finished. This is transparent to the user as the future still completes with the value of the original future (i.e. the RPC's return value)

To make this work for RRef, we add a `_set_profiling_future` to set the profiling future, and `_get_profiling_future` to retrieve this future and wait on it in the tests.

Re-enabled profiling tests and stress tested them 1000 times to verify the fix
ghstack-source-id: 104086114

Test Plan: Re-enabled profiling tests

Differential Revision: D21506940

fbshipit-source-id: 35cde22f0551c825c9bc98ddc24cca412878a63a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants