Skip to content

[WIP] per-RPC device mapping#64901

Closed
pbelevich wants to merge 26 commits intogh/pbelevich/155/basefrom
gh/pbelevich/155/head
Closed

[WIP] per-RPC device mapping#64901
pbelevich wants to merge 26 commits intogh/pbelevich/155/basefrom
gh/pbelevich/155/head

Conversation

@pbelevich
Copy link
Contributor

@pbelevich pbelevich commented Sep 13, 2021

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Sep 13, 2021
pbelevich added a commit that referenced this pull request Sep 13, 2021
ghstack-source-id: 205d104
Pull Request resolved: #64901
@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Sep 13, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Sep 13, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit a018e7d (more details on the Dr. CI page):


  • 11/11 failures introduced in this PR

🕵️ 8 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build linux-xenial-py3.6-gcc5.4 / build-docs (cpp) (1/8)

Step: "Unknown" (full log | diagnosis details | 🔁 rerun)

2021-10-20T18:25:19.6378111Z error: could not l...modules/third_party/zstd/config: Permission denied
2021-10-20T18:25:19.6244552Z http.https://github.com/.extraheader
2021-10-20T18:25:19.6255161Z error: could not lock config file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config: Permission denied
2021-10-20T18:25:19.6267750Z Entering 'third_party/tensorpipe/third_party/pybind11'
2021-10-20T18:25:19.6284884Z http.https://github.com/.extraheader
2021-10-20T18:25:19.6295382Z error: could not lock config file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config: Permission denied
2021-10-20T18:25:19.6308019Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang'
2021-10-20T18:25:19.6324758Z http.https://github.com/.extraheader
2021-10-20T18:25:19.6335517Z error: could not lock config file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config: Permission denied
2021-10-20T18:25:19.6349904Z Entering 'third_party/zstd'
2021-10-20T18:25:19.6367246Z http.https://github.com/.extraheader
2021-10-20T18:25:19.6378111Z error: could not lock config file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/zstd/config: Permission denied
2021-10-20T18:25:19.6438429Z Cleaning up orphan processes
2021-10-20T18:25:19.6673738Z Terminate orphan process: pid (9127) (docker)

See CircleCI build pytorch_macos_10_13_py3_test (2/8)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Oct 20 19:09:13 [E request_callback_no_python.c...yUniqueId(created_on=0, local_id=0) to be created.
Oct 20 19:08:59 ok (4.379s)
Oct 20 19:09:01   test_udf_remote_message_delay_timeout (__main__.FaultyFaultyAgentRpcTest) ... [W tensorpipe_agent.cpp:180] Failed to look up the IP address for the hostname (EAI_NONAME: unknown node or service (this error originated at tensorpipe/transport/uv/utility.cc:97)), defaulting to 127.0.0.1
Oct 20 19:09:01 [W tensorpipe_agent.cpp:180] Failed to look up the IP address for the hostname (EAI_NONAME: unknown node or service (this error originated at tensorpipe/transport/uv/utility.cc:97)), defaulting to 127.0.0.1
Oct 20 19:09:01 [W tensorpipe_agent.cpp:180] Failed to look up the IP address for the hostname (EAI_NONAME: unknown node or service (this error originated at tensorpipe/transport/uv/utility.cc:97)), defaulting to 127.0.0.1
Oct 20 19:09:01 [W tensorpipe_agent.cpp:180] Failed to look up the IP address for the hostname (EAI_NONAME: unknown node or service (this error originated at tensorpipe/transport/uv/utility.cc:97)), defaulting to 127.0.0.1
Oct 20 19:09:08 ok (8.447s)
Oct 20 19:09:10   test_udf_remote_message_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTest) ... [W tensorpipe_agent.cpp:180] Failed to look up the IP address for the hostname (EAI_NONAME: unknown node or service (this error originated at tensorpipe/transport/uv/utility.cc:97)), defaulting to 127.0.0.1
Oct 20 19:09:10 [W tensorpipe_agent.cpp:180] Failed to look up the IP address for the hostname (EAI_NONAME: unknown node or service (this error originated at tensorpipe/transport/uv/utility.cc:97)), defaulting to 127.0.0.1
Oct 20 19:09:10 [W tensorpipe_agent.cpp:180] Failed to look up the IP address for the hostname (EAI_NONAME: unknown node or service (this error originated at tensorpipe/transport/uv/utility.cc:97)), defaulting to 127.0.0.1
Oct 20 19:09:10 [W tensorpipe_agent.cpp:180] Failed to look up the IP address for the hostname (EAI_NONAME: unknown node or service (this error originated at tensorpipe/transport/uv/utility.cc:97)), defaulting to 127.0.0.1
Oct 20 19:09:13 [E request_callback_no_python.cpp:573] Received error while processing request type 261: falseINTERNAL ASSERT FAILED at "../torch/csrc/distributed/rpc/rref_context.cpp":387, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
Oct 20 19:09:13 Exception raised from getOwnerRRef at ../torch/csrc/distributed/rpc/rref_context.cpp:387 (most recent call first):
Oct 20 19:09:13 frame #0: c10::Error::Error(c10::SourceLocation, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >) + 98 (0x115e4f662 in libc10.dylib)
Oct 20 19:09:13 frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 106 (0x115e4ddda in libc10.dylib)
Oct 20 19:09:13 frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&) + 64 (0x115e4e010 in libc10.dylib)
Oct 20 19:09:13 frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 1663 (0x11a051def in libtorch_cpu.dylib)
Oct 20 19:09:13 frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 86 (0x11a03af96 in libtorch_cpu.dylib)
Oct 20 19:09:13 frame #5: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 179 (0x11549ba43 in libtorch_python.dylib)
Oct 20 19:09:13 frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 617 (0x11a039859 in libtorch_cpu.dylib)
Oct 20 19:09:13 frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const + 74 (0x11549c9fa in libtorch_python.dylib)
Oct 20 19:09:13 frame #8: c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> > c10::ivalue::Future::thenAsync<torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const::$_1>(torch::distributed::rpc::RequestCallbackNoPython::processMessage(torch::distributed::rpc::Message&, std::__1::vector<c10::Stream, std::__1::allocator<c10::Stream> >) const::$_1, std::__1::shared_ptr<c10::Type>)::'lambda'(c10::ivalue::Future&)::operator()(c10::ivalue::Future&) + 223 (0x11a041e9f in libtorch_cpu.dylib)

See GitHub Actions build Lint / clang-tidy (3/8)

Step: "Check for warnings" (full log | diagnosis details | 🔁 rerun)

2021-10-20T18:08:19.8473359Z /__w/pytorch/pytor...t [performance-move-const-arg,-warnings-as-errors]
2021-10-20T18:08:19.8458379Z                          ^
2021-10-20T18:08:19.8459174Z /__w/pytorch/pytorch/torch/csrc/distributed/rpc/request_callback_impl.cpp:58:11: note: the use and move are unsequenced, i.e. there is no guarantee about the order in which they are evaluated
2021-10-20T18:08:19.8460067Z           prc.serializedPyObj(),
2021-10-20T18:08:19.8460659Z           ^
2021-10-20T18:08:19.8462698Z /__w/pytorch/pytorch/torch/csrc/distributed/rpc/request_callback_impl.cpp:191:43: error: std::move of the const variable 'reversed_dm' has no effect; remove std::move() or make the variable non-const [performance-move-const-arg,-warnings-as-errors]
2021-10-20T18:08:19.8464895Z             ScriptResp(jitFuture.value(), std::move(reversed_dm)).toMessage());
2021-10-20T18:08:19.8465522Z                                           ^~~~~~~~~~           ~
2021-10-20T18:08:19.8468327Z /__w/pytorch/pytorch/torch/csrc/distributed/rpc/request_callback_impl.cpp:208:52: error: std::move of the const variable 'reversed_dm' has no effect; remove std::move() or make the variable non-const [performance-move-const-arg,-warnings-as-errors]
2021-10-20T18:08:19.8470200Z                 serializePyObject(future.value()), std::move(reversed_dm))
2021-10-20T18:08:19.8471022Z                                                    ^~~~~~~~~~           ~
2021-10-20T18:08:19.8473359Z /__w/pytorch/pytorch/torch/csrc/distributed/rpc/request_callback_impl.cpp:260:48: error: std::move of the const variable 'reversed_dm' has no effect; remove std::move() or make the variable non-const [performance-move-const-arg,-warnings-as-errors]
2021-10-20T18:08:19.8475403Z                 std::move(result).toIValues(), std::move(reversed_dm))
2021-10-20T18:08:19.8475954Z                                                ^~~~~~~~~~           ~
2021-10-20T18:08:19.8476388Z Warnings detected!
2021-10-20T18:08:19.8477379Z Summary:
2021-10-20T18:08:19.8478521Z [clang-diagnostic-pessimizing-move] occurred 1 times
2021-10-20T18:08:19.8479690Z     /__w/pytorch/pytorch/torch/csrc/distributed/rpc/python_resp.cpp:24
2021-10-20T18:08:19.8480373Z 
2021-10-20T18:08:19.8481379Z [bugprone-use-after-move] occurred 2 times
2021-10-20T18:08:19.8482678Z     /__w/pytorch/pytorch/torch/csrc/distributed/rpc/request_callback_impl.cpp:51
2021-10-20T18:08:19.8483902Z     /__w/pytorch/pytorch/torch/csrc/distributed/rpc/request_callback_impl.cpp:58

See GitHub Actions build linux-xenial-py3.6-gcc5.4 / build-docs (python) (4/8)

Step: "Build python docs" (full log | diagnosis details | 🔁 rerun)

2021-10-20T18:24:59.7470150Z error: could not l...modules/third_party/zstd/config: Permission denied
2021-10-20T18:24:59.7333532Z http.https://github.com/.extraheader
2021-10-20T18:24:59.7343454Z error: could not lock config file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config: Permission denied
2021-10-20T18:24:59.7357371Z Entering 'third_party/tensorpipe/third_party/pybind11'
2021-10-20T18:24:59.7375089Z http.https://github.com/.extraheader
2021-10-20T18:24:59.7385235Z error: could not lock config file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config: Permission denied
2021-10-20T18:24:59.7397346Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang'
2021-10-20T18:24:59.7415961Z http.https://github.com/.extraheader
2021-10-20T18:24:59.7426756Z error: could not lock config file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config: Permission denied
2021-10-20T18:24:59.7440960Z Entering 'third_party/zstd'
2021-10-20T18:24:59.7460086Z http.https://github.com/.extraheader
2021-10-20T18:24:59.7470150Z error: could not lock config file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/zstd/config: Permission denied
2021-10-20T18:24:59.7533910Z Cleaning up orphan processes

See GitHub Actions build linux-xenial-py3.6-gcc5.4 / test (distributed, 1, 1, linux.2xlarge) (5/8)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-10-20T18:27:21.2609972Z test_udf_remote_...yUniqueId(created_on=0, local_id=0) to be created.
2021-10-20T18:26:42.5269218Z frame #12: c10::ThreadPool::main_loop(unsigned long) + 0x2a3 (0x7fed34ca1063 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
2021-10-20T18:26:42.5270579Z frame #13: <unknown function> + 0xc92bd (0x7fed34bcf2bd in /opt/conda/lib/libstdc++.so.6)
2021-10-20T18:26:42.5272188Z frame #14: <unknown function> + 0x76ba (0x7fed4a0c56ba in /lib/x86_64-linux-gnu/libpthread.so.0)
2021-10-20T18:26:42.5273575Z frame #15: clone + 0x6d (0x7fed49dfb51d in /lib/x86_64-linux-gnu/libc.so.6)
2021-10-20T18:26:42.5274168Z 
2021-10-20T18:26:42.7634506Z ok (3.315s)
2021-10-20T18:26:57.5979637Z   test_rpc_builtin_timeout (__main__.FaultyFaultyAgentRpcTest) ... ok (14.834s)
2021-10-20T18:27:06.4224471Z   test_rpc_script_timeout (__main__.FaultyFaultyAgentRpcTest) ... ok (8.824s)
2021-10-20T18:27:09.7383217Z   test_rref_to_here_timeout (__main__.FaultyFaultyAgentRpcTest) ... ok (3.316s)
2021-10-20T18:27:17.1614164Z   test_udf_remote_message_delay_timeout (__main__.FaultyFaultyAgentRpcTest) ... ok (7.423s)
2021-10-20T18:27:21.2609972Z   test_udf_remote_message_delay_timeout_to_self (__main__.FaultyFaultyAgentRpcTest) ... [E request_callback_no_python.cpp:573] Received error while processing request type 261: falseINTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp":387, please report a bug to PyTorch. Expected OwnerRRef with id GloballyUniqueId(created_on=0, local_id=0) to be created.
2021-10-20T18:27:21.2612151Z Exception raised from getOwnerRRef at /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/rref_context.cpp:387 (most recent call first):
2021-10-20T18:27:21.2614428Z frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7fa7e8cc3229 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
2021-10-20T18:27:21.2616286Z frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xd2 (0x7fa7e8cbf7d2 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
2021-10-20T18:27:21.2618472Z frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x4e (0x7fa7e8cc116e in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
2021-10-20T18:27:21.2620408Z frame #3: torch::distributed::rpc::RRefContext::getOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, bool) + 0x4cb (0x7fa7ecf8ad6b in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
2021-10-20T18:27:21.2622805Z frame #4: torch::distributed::rpc::RequestCallbackNoPython::assignOwnerRRef(torch::distributed::rpc::GloballyUniqueId const&, torch::distributed::rpc::GloballyUniqueId const&, c10::intrusive_ptr<c10::ivalue::Future, c10::detail::intrusive_target_default_null_type<c10::ivalue::Future> >) const + 0x71 (0x7fa7ecf7add1 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
2021-10-20T18:27:21.2625395Z frame #5: torch::distributed::rpc::RequestCallbackImpl::processPythonRemoteCall(torch::distributed::rpc::RpcCommandBase&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0xc8 (0x7fa7f5492a98 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
2021-10-20T18:27:21.2627773Z frame #6: torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x194 (0x7fa7ecf7fa24 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
2021-10-20T18:27:21.2630447Z frame #7: torch::distributed::rpc::RequestCallbackImpl::processRpcWithErrors(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const + 0x65 (0x7fa7f54920a5 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
2021-10-20T18:27:21.2632140Z frame #8: <unknown function> + 0x407d57a (0x7fa7ecf7c57a in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)

See GitHub Actions build linux-xenial-cuda11.3-py3.6-gcc7 / test (distributed, 1, 1, linux.8xlarge.nvidia.gpu) (6/8)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-10-20T19:50:29.0987391Z ERROR [1.517s]: te...or (__main__.TensorPipeTensorPipeAgentCudaRpcTest)
2021-10-20T19:50:29.0979335Z     fn()
2021-10-20T19:50:29.0980259Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 112, in wrapper
2021-10-20T19:50:29.0981093Z     return func(*args, **kwargs)
2021-10-20T19:50:29.0982309Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 6835, in test_cuda_future_can_extract_list_with_cuda_sparse_tensor
2021-10-20T19:50:29.0983330Z     self._test_cuda_future_extraction(
2021-10-20T19:50:29.0984652Z AttributeError: 'TensorPipeTensorPipeAgentCudaRpcTest' object has no attribute '_test_cuda_future_extraction'
2021-10-20T19:50:29.0985564Z 
2021-10-20T19:50:29.0985795Z 
2021-10-20T19:50:29.0986004Z 
2021-10-20T19:50:29.0986384Z ======================================================================
2021-10-20T19:50:29.0987391Z ERROR [1.517s]: test_cuda_future_can_extract_list_with_cuda_tensor (__main__.TensorPipeTensorPipeAgentCudaRpcTest)
2021-10-20T19:50:29.0988665Z ----------------------------------------------------------------------
2021-10-20T19:50:29.0989295Z Traceback (most recent call last):
2021-10-20T19:50:29.0990336Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 424, in wrapper
2021-10-20T19:50:29.0991223Z     self._join_processes(fn)
2021-10-20T19:50:29.0992283Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 643, in _join_processes
2021-10-20T19:50:29.0993191Z     self._check_return_codes(elapsed_time)
2021-10-20T19:50:29.0994323Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 688, in _check_return_codes
2021-10-20T19:50:29.0995201Z     raise RuntimeError(error)
2021-10-20T19:50:29.0996221Z RuntimeError: Process 0 exited with error code 10 and exception:
2021-10-20T19:50:29.0996896Z Traceback (most recent call last):

See GitHub Actions build linux-xenial-cuda11.3-py3.6-gcc7 / test (default, 1, 2, linux.8xlarge.nvidia.gpu) (7/8)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-10-20T20:00:42.3184978Z AssertionError: Fa...9612650210993), which occurred at index (1, 0, 3).
2021-10-20T20:00:42.3174752Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
2021-10-20T20:00:42.3175655Z     result = test(self, **param_kwargs)
2021-10-20T20:00:42.3176820Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
2021-10-20T20:00:42.3177701Z     return test(*args, **kwargs)
2021-10-20T20:00:42.3178244Z   File "test_ops.py", line 1065, in test_neg_view
2021-10-20T20:00:42.3178879Z     lambda x: not torch.is_complex(x))
2021-10-20T20:00:42.3179512Z   File "test_ops.py", line 1005, in _test_math_view
2021-10-20T20:00:42.3180432Z     self.assertEqual(expected_forward, forward_with_mathview)
2021-10-20T20:00:42.3181994Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1889, in assertEqual
2021-10-20T20:00:42.3183021Z     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
2021-10-20T20:00:42.3184978Z AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1e-07 and atol=1e-07, found 20 element(s) (out of 45) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.4712163490720924 (-0.49827596172230343 vs. -0.027059612650210993), which occurred at index (1, 0, 3).
2021-10-20T20:00:42.3186335Z 		
2021-10-20T20:00:42.3186841Z ✅ 8314 Passed
2021-10-20T20:00:42.3187454Z 💨 3741 Skipped
2021-10-20T20:00:42.3187996Z 🚨 1 Failed
2021-10-20T20:00:42.3545168Z ##[group]Run # Remove any previous test reports if they exist
2021-10-20T20:00:42.3546016Z �[36;1m# Remove any previous test reports if they exist�[0m
2021-10-20T20:00:42.3546623Z �[36;1mrm -f test-reports-*.zip�[0m
2021-10-20T20:00:42.3547306Z �[36;1mzip -r "test-reports-${FILE_SUFFIX}.zip" test -i '*.xml'�[0m
2021-10-20T20:00:42.3560045Z shell: /usr/bin/bash -e {0}
2021-10-20T20:00:42.3560464Z env:

See GitHub Actions build Test tools / test (8/8)

Step: "Test tools" (full log | diagnosis details | 🔁 rerun)

2021-10-20T17:59:17.8774634Z AssertionError: 'x...ck (most recent call last):\n Fil[154 chars]\'\n'
2021-10-20T17:59:17.8764005Z   File "/opt/hostedtoolcache/Python/3.10.0/x64/lib/python3.10/unittest/async_case.py", line 65, in _callTestMethod
2021-10-20T17:59:17.8765007Z     self._callMaybeAsync(method)
2021-10-20T17:59:17.8766037Z   File "/opt/hostedtoolcache/Python/3.10.0/x64/lib/python3.10/unittest/async_case.py", line 88, in _callMaybeAsync
2021-10-20T17:59:17.8767172Z     return self._asyncioTestLoop.run_until_complete(fut)
2021-10-20T17:59:17.8768473Z   File "/opt/hostedtoolcache/Python/3.10.0/x64/lib/python3.10/asyncio/base_events.py", line 641, in run_until_complete
2021-10-20T17:59:17.8769358Z     return future.result()
2021-10-20T17:59:17.8770354Z   File "/opt/hostedtoolcache/Python/3.10.0/x64/lib/python3.10/unittest/async_case.py", line 102, in _asyncioLoopRunner
2021-10-20T17:59:17.8771271Z     ret = await awaitable
2021-10-20T17:59:17.8772106Z   File "/home/runner/work/pytorch/pytorch/tools/test/test_actions_local_runner.py", line 187, in test_mypy
2021-10-20T17:59:17.8773065Z     self.assertEqual(expected, f.getvalue())
2021-10-20T17:59:17.8774634Z AssertionError: 'x my[29 chars]on)\ntorch/some_stubs.pyi:3:17: error: Incompa[788 chars]t]\n' != 'x my[29 chars]on)\nTraceback (most recent call last):\n  Fil[154 chars]\'\n'
2021-10-20T17:59:17.8775732Z   x mypy (skipped typestub generation)
2021-10-20T17:59:17.8776323Z + Traceback (most recent call last):
2021-10-20T17:59:17.8777073Z +   File "/home/runner/work/pytorch/pytorch/tools/linter/mypy_wrapper.py", line 27, in <module>
2021-10-20T17:59:17.8777779Z +     import mypy.api
2021-10-20T17:59:17.8778548Z + ModuleNotFoundError: No module named 'mypy'
2021-10-20T17:59:17.8779809Z - torch/some_stubs.pyi:3:17: error: Incompatible types in assignment (expression has type "None", variable has type "str")  [assignment]
2021-10-20T17:59:17.8781332Z - torch/some_stubs.pyi:4:17: error: Incompatible types in assignment (expression has type "float", variable has type "str")  [assignment]
2021-10-20T17:59:17.8782677Z - torch/some_cool_file.py:3:17: error: Incompatible types in assignment (expression has type "None", variable has type "str")  [assignment]
2021-10-20T17:59:17.8784012Z - torch/some_cool_file.py:4:17: error: Incompatible types in assignment (expression has type "float", variable has type "str")  [assignment]
2021-10-20T17:59:17.8785371Z - caffe2/some_cool_file.py:3:17: error: Incompatible types in assignment (expression has type "None", variable has type "str")  [assignment]

3 failures not recognized by patterns:

Job Step Action
GitHub Actions linux-xenial-py3.6-gcc5.4 / test (docs_test, 1, 1, linux.2xlarge) Test 🔁 rerun
GitHub Actions Lint / flake8-py3 Fail if there were any warnings 🔁 rerun
GitHub Actions Lint / mypy Run mypy 🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23

[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Sep 13, 2021
ghstack-source-id: 6e8ea14
Pull Request resolved: #64901
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23

[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Sep 15, 2021
ghstack-source-id: f2db8bb
Pull Request resolved: #64901
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23

[ghstack-poisoned]
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23

[ghstack-poisoned]
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23

[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Sep 15, 2021
ghstack-source-id: cc211d1
Pull Request resolved: #64901
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23

[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Sep 15, 2021
ghstack-source-id: e6e1c07
Pull Request resolved: #64901
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23

[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Sep 16, 2021
ghstack-source-id: 855040b
Pull Request resolved: #64901
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23

[ghstack-poisoned]
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor comment that, it might be easier for review and test if we have incremental PRs that targets one API at a time.

py::arg("device_map") = DeviceMap(),
py::arg("timeout") = py::cast(kUnsetRpcTimeout),
py::call_guard<py::gil_scoped_release>(),
R"(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this new argument to docstring?

.def(
"to_here",
&PyRRef::toHere,
py::arg("device_map") = DeviceMap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the default behavior here? Will it inherit the global device map or fallback to CPU RPC?

: payload_(std::move(payload)),
tensors_(std::move(tensors)),
type_(type),
deviceMap_(deviceMap) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need std::move here?

}

void Message::setDeviceMap(DeviceMap&& deviceMap) {
deviceMap_ = deviceMap;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto?

pc.serializedPyObj(), pc.isAsyncExecution());
pc.serializedPyObj(),
std::move(pc).moveDeviceMap(),
pc.isAsyncExecution());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using pc after move?

namespace distributed {
namespace rpc {

// TODO(pbelevich)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the TODO here means consolidation? :)

Comment on lines +4 to +6
#include <torch/csrc/distributed/rpc/rpc_agent.h>
#include <torch/csrc/distributed/rpc/types.h>
#include <torch/csrc/jit/serialization/pickle.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw these two headers are included in the .cpp files in the derived classes. Do we still need them here? Will this create circular dependency?


c10::intrusive_ptr<Message> ScriptRRefFetchRet::toMessageImpl() && {
auto res = fromIValues(values_, type_);
res->setDeviceMap(std::move(deviceMap_));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, why do we set after creating the message instead of letting fromIValues handle deviceMap_ as well?

RRefFetchRet(
std::vector<at::IValue> values,
MessageType type,
DeviceMap&& deviceMap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the RRefFetchRet need a deviceMap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rref.to_here(device_map) propagates device_map to ScriptRRefFetchRet/PythonRRefFetchRet to put the result to the target device

public:
RRefMessageBase(const RRefId& rrefId, MessageType type)
: rrefId_(rrefId), type_(type) {}
RRefMessageBase(const RRefId& rrefId, MessageType type, DeviceMap&& deviceMap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not 100% sure, but looks like only Script/PythonRRefFetchCall needs this device map. Does it make sense to only add this device map to those two messages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

# C++ (see python_rpc_handler.cpp).
class RRefProxy:
def __init__(self, rref, rpc_api, timeout=UNSET_RPC_TIMEOUT):
def __init__(self, rref, rpc_api, device_map, timeout=UNSET_RPC_TIMEOUT):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, why do we need device_map in RRefProxy instead of letting the RPCs (_invoke_rpc above) on this proxy handling that?

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23

[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Sep 21, 2021
ghstack-source-id: 6605f8d
Pull Request resolved: #64901
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23

[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Sep 22, 2021
ghstack-source-id: e055ed6
Pull Request resolved: #64901
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23

[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Sep 22, 2021
ghstack-source-id: 22da032
Pull Request resolved: #64901
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23

[ghstack-poisoned]
@pytorch-probot
Copy link

pytorch-probot bot commented Sep 23, 2021

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/a018e7d46a5b31c2a6384f6557a5b9779e0703db/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/xla ✅ triggered
linux-vulkan-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-dynamic ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3.6-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers ✅ triggered
linux-xenial-py3.6-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx ✅ triggered
linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/win ✅ triggered
Skipped Workflows
libtorch-linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
linux-xenial-py3-clang5-mobile-code-analysis ciflow/all, ciflow/linux, ciflow/mobile 🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped
puretorch-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23

[ghstack-poisoned]
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23

[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Sep 23, 2021
ghstack-source-id: 2421139
Pull Request resolved: #64901
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23

[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Sep 28, 2021
ghstack-source-id: 3d235ca
Pull Request resolved: #64901
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23

[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Oct 5, 2021
ghstack-source-id: 0edf9d6
Pull Request resolved: #64901
cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse @SciPioneer @H-Huang cbalioglu gcramer23

[ghstack-poisoned]
pbelevich added a commit that referenced this pull request Oct 20, 2021
ghstack-source-id: 37a05b6
Pull Request resolved: #64901
@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label May 21, 2022
@cbalioglu cbalioglu removed request for cbalioglu and wayi1 June 9, 2022 16:29
@github-actions github-actions bot closed this Jul 9, 2022
@facebook-github-bot facebook-github-bot deleted the gh/pbelevich/155/head branch August 9, 2022 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue oncall: jit Add this issue/PR to JIT oncall triage queue Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants