Skip to content

DISABLED test_clean_context_during_backward (__main__.DistAutogradTestWithSpawn) #37765

@mrshenli

Description

@mrshenli

https://app.circleci.com/pipelines/github/pytorch/pytorch/163274/workflows/1553b86b-c315-499f-9f69-a4ab3aadad3b/jobs/5327226/steps

May 02 07:52:38 ======================================================================
May 02 07:52:38 ERROR [1.222s]: test_clean_context_during_backward (__main__.DistAutogradTestWithSpawn)
May 02 07:52:38 ----------------------------------------------------------------------
May 02 07:52:38 Traceback (most recent call last):
May 02 07:52:38   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 184, in wrapper
May 02 07:52:38     self._join_processes(fn)
May 02 07:52:38   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 286, in _join_processes
May 02 07:52:38     self._check_return_codes(elapsed_time)
May 02 07:52:38   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 319, in _check_return_codes
May 02 07:52:38     raise RuntimeError(error)
May 02 07:52:38 RuntimeError: Processes 3 exited with error code 10
May 02 07:52:38 
May 02 07:52:38 ----------------------------------------------------------------------
May 02 07:51:57   test_clean_context_during_backward (__main__.DistAutogradTestWithSpawn) ... ERROR:root:Caught exception: 
May 02 07:51:57 Traceback (most recent call last):
May 02 07:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 187, in wrapper
May 02 07:51:57     fn()
May 02 07:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/dist_utils.py", line 85, in new_test_method
May 02 07:51:57     return_value = old_test_method(self, *arg, **kwargs)
May 02 07:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/distributed/rpc/dist_autograd_test.py", line 1566, in test_clean_context_during_backward
May 02 07:51:57     t1 = rpc.rpc_sync(worker_name(dst), torch.add, args=(t1, t1))
May 02 07:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 76, in wrapper
May 02 07:51:57     return func(*args, **kwargs)
May 02 07:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/api.py", line 600, in rpc_sync
May 02 07:51:57     return fut.wait()
May 02 07:51:57 RuntimeError: Encountered exception in ProcessGroupAgent::enqueueSend: tensor does not have a device (device at /var/lib/jenkins/workspace/c10/core/TensorImpl.h:465)
May 02 07:51:57 frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6c (0x7fb5d66e4fbc in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
May 02 07:51:57 frame #1: torch::distributed::rpc::wireSerialize[abi:cxx11](std::vector<char, std::allocator<char> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&) + 0x644 (0x7fb5d18deff4 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so)
May 02 07:51:57 frame #2: torch::distributed::rpc::ProcessGroupAgent::handleSend(torch::distributed::rpc::SendWork const&) + 0x59 (0x7fb5d7619439 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
May 02 07:51:57 frame #3: <unknown function> + 0x994f81 (0x7fb5d7619f81 in /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
May 02 07:51:57 frame #4: c10::ThreadPool::main_loop(unsigned long) + 0x2b3 (0x7fb5d66d43b3 in /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so)
May 02 07:51:57 frame #5: <unknown function> + 0xc819d (0x7fb5d6bd919d in /opt/conda/lib/libstdc++.so.6)
May 02 07:51:57 frame #6: <unknown function> + 0x76ba (0x7fb5dfa2c6ba in /lib/x86_64-linux-gnu/libpthread.so.0)
May 02 07:51:57 frame #7: clone + 0x6d (0x7fb5df76241d in /lib/x86_64-linux-gnu/libc.so.6)
May 02 07:51:57  on node: 3
May 02 07:51:57 exiting process with exit code: 10
May 02 07:51:57 ERROR:root:Caught exception: 
May 02 07:51:57 Traceback (most recent call last):
May 02 07:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 187, in wrapper
May 02 07:51:57     fn()
May 02 07:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/dist_utils.py", line 85, in new_test_method
May 02 07:51:57     return_value = old_test_method(self, *arg, **kwargs)
May 02 07:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/distributed/rpc/dist_autograd_test.py", line 1587, in test_clean_context_during_backward
May 02 07:51:57     dist.barrier()
May 02 07:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1717, in barrier
May 02 07:51:57     work.wait()
May 02 07:51:57 RuntimeError: [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [172.17.0.2]:429
May 02 07:51:57 exiting process with exit code: 10
May 02 07:51:57 ERROR:root:Caught exception: 
May 02 07:51:57 Traceback (most recent call last):
May 02 07:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 187, in wrapper
May 02 07:51:57     fn()
May 02 07:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/dist_utils.py", line 85, in new_test_method
May 02 07:51:57     return_value = old_test_method(self, *arg, **kwargs)
May 02 07:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/distributed/rpc/dist_autograd_test.py", line 1587, in test_clean_context_during_backward
May 02 07:51:57     dist.barrier()
May 02 07:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1717, in barrier
May 02 07:51:57     work.wait()
May 02 07:51:57 RuntimeError: [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [172.17.0.2]:58551
May 02 07:51:57 exiting process with exit code: 10
May 02 07:51:57 ERROR:root:Caught exception: 
May 02 07:51:57 Traceback (most recent call last):
May 02 07:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 187, in wrapper
May 02 07:51:57     fn()
May 02 07:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/dist_utils.py", line 85, in new_test_method
May 02 07:51:57     return_value = old_test_method(self, *arg, **kwargs)
May 02 07:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/distributed/rpc/dist_autograd_test.py", line 1587, in test_clean_context_during_backward
May 02 07:51:57     dist.barrier()
May 02 07:51:57   File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1717, in barrier
May 02 07:51:57     work.wait()
May 02 07:51:57 RuntimeError: [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [172.17.0.2]:34229
May 02 07:51:57 exiting process with exit code: 10
May 02 07:51:57 ERROR (1.222s)

cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @xush6528 @jjlilley @osalpekar

Metadata

Metadata

Assignees

Labels

high prioritymodule: flaky-testsProblem is a flaky test in CImodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizertriage reviewtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions