-
-
Notifications
You must be signed in to change notification settings - Fork 158
KubeCluster fails occasionally because it falls back to port forwarding #401
Copy link
Copy link
Closed
Labels
Description
What happened:
I've been creating KubeCluster with:
worker_spec = make_pod_spec(
...
)
scheduler_spec = make_pod_spec(
...
)
KubeCluster(
pod_template=worker_spec,
scheduler_pod_template=scheduler_spec,
namespace=...,
idle_timeout=600,
)The code above is executed in a pod living in a kubernetes cluster. The pod uses a service account with the required permissions.
Most of the times, this works without any isues. But occasionally it fails with the following stack (sensitive parts ommited):
Traceback (most recent call last):
[...]
[...]
File [...] line 98, in _setup_dask_cluster
self.cluster = KubeCluster(
File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/core.py", line 466, in __init__
super().__init__(**self.kwargs)
File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 275, in __init__
self.sync(self._start)
File "/usr/local/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 220, in sync
return sync(self.loop, func, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 327, in sync
raise exc.with_traceback(tb)
File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 310, in f
result[0] = yield future
File "/usr/local/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/core.py", line 595, in _start
await super()._start()
File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 304, in _start
self.scheduler = await self.scheduler
File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 59, in _
await self.start()
File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/core.py", line 200, in start
self.external_address = await get_external_address_for_scheduler_service(
File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/utils.py", line 66, in get_external_address_for_scheduler_service
port = await port_forward_service(
File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/utils.py", line 106, in port_forward_service
raise ConnectionError("kubectl port forward failed")
ConnectionError: kubectl port forward failed
Here's what I was able to conclude so far:
- tipically (when no failure occurs, which is most of the time) it resolves the service name, and continues without any problem.
- I suspect occasionally the service resolution fails and it fallsback to trying to connect to the scheduler via port forwarding using kubectl which is not available (and is expected to not be available).
- I'm not 100% sure of what is the cause for the occasional failure and I may be biased. But it seems to me it often happens when there is a higher resources demand in the cluster and it either: takes longer to start the scheduler pod; or raises insufficient node warning and scales to new nodes (which in turn takes longer to start the scheduler pod).
My guess
I'm guessing that if there was some retrial logic here with a timeout or similar. This would be solved.
Does my reasoning make sense? Happy to make a PR but would need some guidance.
Environment:
- Dask version: dask-kubernetes==2021.10.0, dask[complete]==2021.10.0
- Python version: 3.8.10
- Operating System: ubuntu1604
- Install method (conda, pip, source): pip
Reactions are currently unavailable