KubeCluster fails occasionally because it falls back to port forwarding

**What happened**:

I've been creating `KubeCluster` with:

```python
worker_spec = make_pod_spec(
    ...
)

scheduler_spec = make_pod_spec(
    ...
)

KubeCluster(
    pod_template=worker_spec,
    scheduler_pod_template=scheduler_spec,
    namespace=...,
    idle_timeout=600,
)
```

The code above is executed in a pod living in a kubernetes cluster. The pod uses a service account with the required permissions.

Most of the times, this works without any isues. But occasionally it fails with the following stack (sensitive parts ommited):

```
Traceback (most recent call last):
  [...]
  [...]
  File [...] line 98, in _setup_dask_cluster
    self.cluster = KubeCluster(
  File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/core.py", line 466, in __init__
    super().__init__(**self.kwargs)
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 275, in __init__
    self.sync(self._start)
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 220, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 327, in sync
    raise exc.with_traceback(tb)
  File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 310, in f
    result[0] = yield future
  File "/usr/local/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/core.py", line 595, in _start
    await super()._start()
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 304, in _start
    self.scheduler = await self.scheduler
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 59, in _
    await self.start()
  File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/core.py", line 200, in start
    self.external_address = await get_external_address_for_scheduler_service(
  File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/utils.py", line 66, in get_external_address_for_scheduler_service
    port = await port_forward_service(
  File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/utils.py", line 106, in port_forward_service
    raise ConnectionError("kubectl port forward failed")
ConnectionError: kubectl port forward failed
```

Here's what I was able to conclude so far:

- tipically (when no failure occurs, which is most of the time) it [resolves the service name](https://github.com/dask/dask-kubernetes/blob/dd0a112d6f37a84a7b4f0e1f761b53e628e3da01/dask_kubernetes/utils.py#L60-L62), and continues without any problem.
- I suspect occasionally the service resolution fails and it fallsback to trying to connect to the scheduler via [port forwarding](https://github.com/dask/dask-kubernetes/blob/dd0a112d6f37a84a7b4f0e1f761b53e628e3da01/dask_kubernetes/utils.py#L64-L68) using kubectl which is not available (and is expected to not be available).
- I'm not 100% sure of what is the cause for the occasional failure and I may be biased. But it seems to me it often happens when there is a higher resources demand in the cluster and it either: takes longer to start the scheduler pod; or raises insufficient node warning and scales to new nodes (which in turn takes longer to start the scheduler pod).


**My guess**

I'm guessing that if there was some retrial logic [here](https://github.com/dask/dask-kubernetes/blob/dd0a112d6f37a84a7b4f0e1f761b53e628e3da01/dask_kubernetes/utils.py#L60-L62) with a timeout or similar. This would be solved.


Does my reasoning make sense? Happy to make a PR but would need some guidance.


**Environment**:

- Dask version: dask-kubernetes==2021.10.0, dask[complete]==2021.10.0
- Python version: 3.8.10
- Operating System: ubuntu1604
- Install method (conda, pip, source): pip



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KubeCluster fails occasionally because it falls back to port forwarding #401

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

KubeCluster fails occasionally because it falls back to port forwarding #401

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions