Skip to content

DISABLED test_multi_rpc (__main__.RpcTestWithSpawn) #37795

@mrshenli

Description

@mrshenli

Another weird test error, this might be relevant to #37765.

https://app.circleci.com/pipelines/github/pytorch/pytorch/163815/workflows/2122858f-c503-4760-81a6-61452cff82c5/jobs/5338885/steps

May 04 21:29:26 ======================================================================
May 04 21:29:26 ERROR [0.816s]: test_multi_rpc (__main__.RpcTestWithSpawn)
May 04 21:29:26 ----------------------------------------------------------------------
May 04 21:29:26 Traceback (most recent call last):
May 04 21:29:26   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 184, in wrapper
May 04 21:29:26     self._join_processes(fn)
May 04 21:29:26   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 286, in _join_processes
May 04 21:29:26     self._check_return_codes(elapsed_time)
May 04 21:29:26   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 319, in _check_return_codes
May 04 21:29:26     raise RuntimeError(error)
May 04 21:29:26 RuntimeError: Processes 2 exited with error code 10
May 04 21:28:05   test_multi_rpc (__main__.RpcTestWithSpawn) ... ERROR:root:Caught exception: 
May 04 21:28:05 Traceback (most recent call last):
May 04 21:28:05   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 187, in wrapper
May 04 21:28:05     fn()
May 04 21:28:05   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/dist_utils.py", line 85, in new_test_method
May 04 21:28:05     return_value = old_test_method(self, *arg, **kwargs)
May 04 21:28:05   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 677, in test_multi_rpc
May 04 21:28:05     self.assertEqual(ret, torch.ones(n, n) * 2)
May 04 21:28:05   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 839, in assertEqual
May 04 21:28:05     (rtol, atol) = self.get_default_tolerance(x, y)
May 04 21:28:05   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 825, in get_default_tolerance
May 04 21:28:05     a_tol = self.get_default_tolerance(a)
May 04 21:28:05   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 811, in get_default_tolerance
May 04 21:28:05     dtype = a.dtype
May 04 21:28:05 RuntimeError: unsupported scalarType
May 04 21:28:05 exiting process with exit code: 10
May 04 21:28:05 Some process exited badly, terminating rest.
May 04 21:28:05 ERROR (0.816s)

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @xush6528 @jjlilley @osalpekar

Metadata

Metadata

Assignees

Labels

high prioritymodule: flaky-testsProblem is a flaky test in CImodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizertriage reviewtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions