-
-
Notifications
You must be signed in to change notification settings - Fork 756
Open
Description
We recently encountered an issue on binderhub where a dask pod failed to terminate, resulting in a node running for hours:
[ec2-user@ip-192-168-60-131 ~]$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
binder-staging autohttps-77465ddf6f-8mgcn 2/2 Running 0 9d
binder-staging binder-58b4fbccd5-7r2qz 1/1 Running 0 9d
binder-staging binder-staging-dind-fppcs 1/1 Running 0 9d
binder-staging binder-staging-image-cleaner-lmg2n 2/2 Running 0 9d
binder-staging binder-staging-kube-lego-84c9c589cf-zwjzc 1/1 Running 0 9d
binder-staging binder-staging-nginx-ingress-controller-84464555db-psrdm 1/1 Running 0 9d
binder-staging binder-staging-nginx-ingress-controller-84464555db-st8np 1/1 Running 0 9d
binder-staging binder-staging-nginx-ingress-default-backend-6cd8b44c86-scn29 1/1 Running 0 9d
binder-staging dask-cgentemann-osm2020tutorial-9fm90iqn-d58bc39d-a2v8wz 0/1 Completed 0 3h24m
binder-staging dask-cgentemann-osm2020tutorial-9fm90iqn-d58bc39d-a6sfbh 1/1 Running 0 3h24m
binder-staging dask-cgentemann-osm2020tutorial-9fm90iqn-d58bc39d-a8s7x5 0/1 Completed 0 3h24m
binder-staging dask-cgentemann-osm2020tutorial-9fm90iqn-d58bc39d-a9csxq 0/1 Completed 0 3h24m
binder-staging dask-cgentemann-osm2020tutorial-9fm90iqn-d58bc39d-ajsm8h 0/1 Completed 0 3h24m
binder-staging dask-cgentemann-osm2020tutorial-9fm90iqn-d58bc39d-aldrmw 0/1 Completed 0 3h24m
binder-staging dask-cgentemann-osm2020tutorial-9fm90iqn-d58bc39d-avv8qw 0/1 Completed 0 3h24m
binder-staging hub-57b965856b-9nfc8 1/1 Running 3 9d
binder-staging proxy-67f46bb5d-9khxh 1/1 Running 0 9d
binder-staging user-scheduler-6589468f65-5ntcq 1/1 Running 0 9d
binder-staging user-scheduler-6589468f65-ml4dt 1/1 Running 0 9d
kube-system aws-node-mv8p4 1/1 Running 0 9d
kube-system aws-node-w2hg5 1/1 Running 0 3h23m
kube-system cluster-autoscaler-78fb96cfd5-hp2pd 1/1 Running 0 9d
kube-system coredns-74d48d5d5b-fgnk5 1/1 Running 0 9d
kube-system coredns-74d48d5d5b-gqqlr 1/1 Running 0 9d
kube-system kube-proxy-k4nfj 1/1 Running 0 3h23m
kube-system kube-proxy-sh5dl 1/1 Running 0 9d
The pod listed as Running had a log showing an infinite loop of CommClousedErrors:
Traceback (most recent call last):
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
return self.callback()
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 867, in <lambda>
lambda: self.batched_stream.send({"op": "keep-alive"}),
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/batched.py", line 117, in send
raise CommClosedError
distributed.comm.core.CommClosedError
tornado.application - ERROR - Exception in callback <function Worker._register_with_scheduler.<locals>.<lambda> at 0x7f972cbad7a0>
Traceback (most recent call last):
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
return self.callback()
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 867, in <lambda>
lambda: self.batched_stream.send({"op": "keep-alive"}),
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/batched.py", line 117, in send
raise CommClosedError
distributed.comm.core.CommClosedErrorAnd the pods listed as Completed had the following traceback in their logs:
Traceback (most recent call last):
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 1941, in gather_dep
self.rpc, deps, worker, who=self.address
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 3195, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/utils_comm.py", line 391, in retry_operation
operation=operation,
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/utils_comm.py", line 379, in retry
return await coro()
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 3182, in _get_data
max_connections=max_connections,
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 540, in send_recv
response = await comm.read(deserializers=deserializers)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/tcp.py", line 208, in read
convert_stream_closed_error(self, e)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error
raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7fe06bd55590>>, <Task finished coro=<Worker.heartbeat() done, defined at /srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py:881> exception=CommClosedError('in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer')>)
Traceback (most recent call last):
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/tcp.py", line 188, in read
n_frames = await stream.read_bytes(8)
tornado.iostream.StreamClosedError: Stream is closed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 743, in _run_callback
ret = callback()
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/ioloop.py", line 767, in _discard_future_result
future.result()
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 918, in heartbeat
raise e
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/worker.py", line 891, in heartbeat
metrics=await self.get_metrics(),
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/utils_comm.py", line 391, in retry_operation
operation=operation,
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/utils_comm.py", line 379, in retry
return await coro()
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 757, in send_recv_from_rpc
result = await send_recv(comm=comm, op=key, **kwargs)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/core.py", line 540, in send_recv
response = await comm.read(deserializers=deserializers)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/tcp.py", line 208, in read
convert_stream_closed_error(self, e)
File "/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error
raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc))
distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peerpinging @jhamman - not sure this has come up on the binderhub running on GCE
Versions:
dask 2.10.1 py_0 conda-forge
dask-core 2.10.1 py_0 conda-forge
dask-gateway 0.6.1 py37_0 conda-forge
dask-kubernetes 0.10.1 py_0 conda-forge
dask-labextension 1.1.0 py_0 conda-forge
distributed 2.10.0 py_0 conda-forge
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels