May 02 07:52:38 ======================================================================
May 02 07:52:38 ERROR [1.222s]: test_clean_context_during_backward (__main__.DistAutogradTestWithSpawn)
May 02 07:52:38 ----------------------------------------------------------------------
May 02 07:52:38 Traceback (most recent call last):
May 02 07:52:38 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 184, in wrapper
May 02 07:52:38 self._join_processes(fn)
May 02 07:52:38 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 286, in _join_processes
May 02 07:52:38 self._check_return_codes(elapsed_time)
May 02 07:52:38 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 319, in _check_return_codes
May 02 07:52:38 raise RuntimeError(error)
May 02 07:52:38 RuntimeError: Processes 3 exited with error code 10
May 02 07:52:38
May 02 07:52:38 ----------------------------------------------------------------------
May 02 07:51:57 test_clean_context_during_backward (__main__.DistAutogradTestWithSpawn) ... ERROR:root:Caught exception:
May 02 07:51:57 Traceback (most recent call last):
May 02 07:51:57 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 187, in wrapper
May 02 07:51:57 fn()
May 02 07:51:57 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/dist_utils.py", line 85, in new_test_method
May 02 07:51:57 return_value = old_test_method(self, *arg, **kwargs)
May 02 07:51:57 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/distributed/rpc/dist_autograd_test.py", line 1566, in test_clean_context_during_backward
May 02 07:51:57 t1 = rpc.rpc_sync(worker_name(dst), torch.add, args=(t1, t1))
May 02 07:51:57 File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 76, in wrapper
May 02 07:51:57 return func(*args, **kwargs)
May 02 07:51:57 File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 600, in rpc_sync
May 02 07:51:57 return fut.wait()
May 02 07:51:57 RuntimeError: Encountered exception in ProcessGroupAgent::enqueueSend: tensor does not have a device (device at /var/lib/jenkins/workspace/c10/core/TensorImpl.h:465)
May 02 07:51:57 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6c (0x7fb5d66e4fbc in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
May 02 07:51:57 frame #1: torch::distributed::rpc::wireSerialize[abi:cxx11](std::vector<char, std::allocator<char> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&) + 0x644 (0x7fb5d18deff4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
May 02 07:51:57 frame #2: torch::distributed::rpc::ProcessGroupAgent::handleSend(torch::distributed::rpc::SendWork const&) + 0x59 (0x7fb5d7619439 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
May 02 07:51:57 frame #3: <unknown function> + 0x994f81 (0x7fb5d7619f81 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
May 02 07:51:57 frame #4: c10::ThreadPool::main_loop(unsigned long) + 0x2b3 (0x7fb5d66d43b3 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
May 02 07:51:57 frame #5: <unknown function> + 0xc819d (0x7fb5d6bd919d in /opt/conda/lib/libstdc++.so.6)
May 02 07:51:57 frame #6: <unknown function> + 0x76ba (0x7fb5dfa2c6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
May 02 07:51:57 frame #7: clone + 0x6d (0x7fb5df76241d in /lib/x86_64-linux-gnu/libc.so.6)
May 02 07:51:57 on node: 3
May 02 07:51:57 exiting process with exit code: 10
May 02 07:51:57 ERROR:root:Caught exception:
May 02 07:51:57 Traceback (most recent call last):
May 02 07:51:57 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 187, in wrapper
May 02 07:51:57 fn()
May 02 07:51:57 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/dist_utils.py", line 85, in new_test_method
May 02 07:51:57 return_value = old_test_method(self, *arg, **kwargs)
May 02 07:51:57 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/distributed/rpc/dist_autograd_test.py", line 1587, in test_clean_context_during_backward
May 02 07:51:57 dist.barrier()
May 02 07:51:57 File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1717, in barrier
May 02 07:51:57 work.wait()
May 02 07:51:57 RuntimeError: [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [172.17.0.2]:429
May 02 07:51:57 exiting process with exit code: 10
May 02 07:51:57 ERROR:root:Caught exception:
May 02 07:51:57 Traceback (most recent call last):
May 02 07:51:57 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 187, in wrapper
May 02 07:51:57 fn()
May 02 07:51:57 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/dist_utils.py", line 85, in new_test_method
May 02 07:51:57 return_value = old_test_method(self, *arg, **kwargs)
May 02 07:51:57 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/distributed/rpc/dist_autograd_test.py", line 1587, in test_clean_context_during_backward
May 02 07:51:57 dist.barrier()
May 02 07:51:57 File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1717, in barrier
May 02 07:51:57 work.wait()
May 02 07:51:57 RuntimeError: [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [172.17.0.2]:58551
May 02 07:51:57 exiting process with exit code: 10
May 02 07:51:57 ERROR:root:Caught exception:
May 02 07:51:57 Traceback (most recent call last):
May 02 07:51:57 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 187, in wrapper
May 02 07:51:57 fn()
May 02 07:51:57 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/dist_utils.py", line 85, in new_test_method
May 02 07:51:57 return_value = old_test_method(self, *arg, **kwargs)
May 02 07:51:57 File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/distributed/rpc/dist_autograd_test.py", line 1587, in test_clean_context_during_backward
May 02 07:51:57 dist.barrier()
May 02 07:51:57 File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1717, in barrier
May 02 07:51:57 work.wait()
May 02 07:51:57 RuntimeError: [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [172.17.0.2]:34229
May 02 07:51:57 exiting process with exit code: 10
May 02 07:51:57 ERROR (1.222s)
https://app.circleci.com/pipelines/github/pytorch/pytorch/163274/workflows/1553b86b-c315-499f-9f69-a4ab3aadad3b/jobs/5327226/steps
cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @xush6528 @jjlilley @osalpekar