Skip to content

distributed.comm.core.CommClosedError loop failing to Terminate worker pods #3488

@scottyhq

Description

@scottyhq

We recently encountered an issue on binderhub where a dask pod failed to terminate, resulting in a node running for hours:

[ec2-user@ip-192-168-60-131 ~]$ kubectl get pods -A
NAMESPACE        NAME                                                            READY   STATUS      RESTARTS   AGE
binder-staging   autohttps-77465ddf6f-8mgcn                                      2/2     Running     0          9d
binder-staging   binder-58b4fbccd5-7r2qz                                         1/1     Running     0          9d
binder-staging   binder-staging-dind-fppcs                                       1/1     Running     0          9d
binder-staging   binder-staging-image-cleaner-lmg2n                              2/2     Running     0          9d
binder-staging   binder-staging-kube-lego-84c9c589cf-zwjzc                       1/1     Running     0          9d
binder-staging   binder-staging-nginx-ingress-controller-84464555db-psrdm        1/1     Running     0          9d
binder-staging   binder-staging-nginx-ingress-controller-84464555db-st8np        1/1     Running     0          9d
binder-staging   binder-staging-nginx-ingress-default-backend-6cd8b44c86-scn29   1/1     Running     0          9d
binder-staging   dask-cgentemann-osm2020tutorial-9fm90iqn-d58bc39d-a2v8wz        0/1     Completed   0          3h24m
binder-staging   dask-cgentemann-osm2020tutorial-9fm90iqn-d58bc39d-a6sfbh        1/1     Running     0          3h24m
binder-staging   dask-cgentemann-osm2020tutorial-9fm90iqn-d58bc39d-a8s7x5        0/1     Completed   0          3h24m
binder-staging   dask-cgentemann-osm2020tutorial-9fm90iqn-d58bc39d-a9csxq        0/1     Completed   0          3h24m
binder-staging   dask-cgentemann-osm2020tutorial-9fm90iqn-d58bc39d-ajsm8h        0/1     Completed   0          3h24m
binder-staging   dask-cgentemann-osm2020tutorial-9fm90iqn-d58bc39d-aldrmw        0/1     Completed   0          3h24m
binder-staging   dask-cgentemann-osm2020tutorial-9fm90iqn-d58bc39d-avv8qw        0/1     Completed   0          3h24m
binder-staging   hub-57b965856b-9nfc8                                            1/1     Running     3          9d
binder-staging   proxy-67f46bb5d-9khxh                                           1/1     Running     0          9d
binder-staging   user-scheduler-6589468f65-5ntcq                                 1/1     Running     0          9d
binder-staging   user-scheduler-6589468f65-ml4dt                                 1/1     Running     0          9d
kube-system      aws-node-mv8p4                                                  1/1     Running     0          9d
kube-system      aws-node-w2hg5                                                  1/1     Running     0          3h23m
kube-system      cluster-autoscaler-78fb96cfd5-hp2pd                             1/1     Running     0          9d
kube-system      coredns-74d48d5d5b-fgnk5                                        1/1     Running     0          9d
kube-system      coredns-74d48d5d5b-gqqlr                                        1/1     Running     0          9d
kube-system      kube-proxy-k4nfj                                                1/1     Running     0          3h23m
kube-system      kube-proxy-sh5dl                                                1/1     Running     0          9d

The pod listed as Running had a log showing an infinite loop of CommClousedErrors:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
    return self.callback()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 867, in <lambda>
    lambda: self.batched_stream.send({"op": "keep-alive"}),
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/batched.py", line 117, in send
    raise CommClosedError
distributed.comm.core.CommClosedError
tornado.application - ERROR - Exception in callback <function Worker._register_with_scheduler.<locals>.<lambda> at 0x7f972cbad7a0>
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
    return self.callback()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 867, in <lambda>
    lambda: self.batched_stream.send({"op": "keep-alive"}),
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/batched.py", line 117, in send
    raise CommClosedError
distributed.comm.core.CommClosedError

And the pods listed as Completed had the following traceback in their logs:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 1941, in gather_dep
    self.rpc, deps, worker, who=self.address
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 3195, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/utils_comm.py", line 391, in retry_operation
    operation=operation,
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/utils_comm.py", line 379, in retry
    return await coro()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 3182, in _get_data
    max_connections=max_connections,
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 540, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/tcp.py", line 208, in read
    convert_stream_closed_error(self, e)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error
    raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fe06bd55590>>, <Task finished coro=<Worker.heartbeat() done, defined at /srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py:881> exception=CommClosedError('in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer')>)
Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/tcp.py", line 188, in read
    n_frames = await stream.read_bytes(8)
tornado.iostream.StreamClosedError: Stream is closed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
    ret = callback()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
    future.result()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 918, in heartbeat
    raise e
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 891, in heartbeat
    metrics=await self.get_metrics(),
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/utils_comm.py", line 391, in retry_operation
    operation=operation,
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/utils_comm.py", line 379, in retry
    return await coro()
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 757, in send_recv_from_rpc
    result = await send_recv(comm=comm, op=key, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 540, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/tcp.py", line 208, in read
    convert_stream_closed_error(self, e)
  File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error
    raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer

pinging @jhamman - not sure this has come up on the binderhub running on GCE

Versions:

dask                      2.10.1                     py_0    conda-forge
dask-core                 2.10.1                     py_0    conda-forge
dask-gateway              0.6.1                    py37_0    conda-forge
dask-kubernetes           0.10.1                     py_0    conda-forge
dask-labextension         1.1.0                      py_0    conda-forge
distributed               2.10.0                     py_0    conda-forge

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions