Skip to content

KubeCluster fails occasionally because it falls back to port forwarding #401

@erdnaavlis

Description

@erdnaavlis

What happened:

I've been creating KubeCluster with:

worker_spec = make_pod_spec(
    ...
)

scheduler_spec = make_pod_spec(
    ...
)

KubeCluster(
    pod_template=worker_spec,
    scheduler_pod_template=scheduler_spec,
    namespace=...,
    idle_timeout=600,
)

The code above is executed in a pod living in a kubernetes cluster. The pod uses a service account with the required permissions.

Most of the times, this works without any isues. But occasionally it fails with the following stack (sensitive parts ommited):

Traceback (most recent call last):
  [...]
  [...]
  File [...] line 98, in _setup_dask_cluster
    self.cluster = KubeCluster(
  File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/core.py", line 466, in __init__
    super().__init__(**self.kwargs)
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 275, in __init__
    self.sync(self._start)
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 220, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 327, in sync
    raise exc.with_traceback(tb)
  File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 310, in f
    result[0] = yield future
  File "/usr/local/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/core.py", line 595, in _start
    await super()._start()
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 304, in _start
    self.scheduler = await self.scheduler
  File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 59, in _
    await self.start()
  File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/core.py", line 200, in start
    self.external_address = await get_external_address_for_scheduler_service(
  File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/utils.py", line 66, in get_external_address_for_scheduler_service
    port = await port_forward_service(
  File "/usr/local/lib/python3.8/site-packages/dask_kubernetes/utils.py", line 106, in port_forward_service
    raise ConnectionError("kubectl port forward failed")
ConnectionError: kubectl port forward failed

Here's what I was able to conclude so far:

  • tipically (when no failure occurs, which is most of the time) it resolves the service name, and continues without any problem.
  • I suspect occasionally the service resolution fails and it fallsback to trying to connect to the scheduler via port forwarding using kubectl which is not available (and is expected to not be available).
  • I'm not 100% sure of what is the cause for the occasional failure and I may be biased. But it seems to me it often happens when there is a higher resources demand in the cluster and it either: takes longer to start the scheduler pod; or raises insufficient node warning and scales to new nodes (which in turn takes longer to start the scheduler pod).

My guess

I'm guessing that if there was some retrial logic here with a timeout or similar. This would be solved.

Does my reasoning make sense? Happy to make a PR but would need some guidance.

Environment:

  • Dask version: dask-kubernetes==2021.10.0, dask[complete]==2021.10.0
  • Python version: 3.8.10
  • Operating System: ubuntu1604
  • Install method (conda, pip, source): pip

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions