Skip to content

Implement DaskJob resource #483

@jacobtomlinson

Description

@jacobtomlinson

In KubeFlow there are many operators that allow users to run training jobs as a one-shot task. For example, PyTorch has a PyTorchJob CRD which creates a cluster of pods to run a training task and runs them to completion, then cleans up when it is done.

We should add something similar to the operator here so that folks will have familiar tools. We can reuse the DaskCluster CRD within the DaskJob (nesting this will be trivial thanks to the work by @samdyzon in #452).

I've thought about a few approaches (see alternatives in the details below) but this is my preferred option:

  • User creates DaskJob with nested Job spec and DaskCluster spec.
  • Operator creates Job resource that runs the client code and a DaskCluster resource that will be leveraged by the client code.
  • When the Job creates its Pod (this is done by the kubelet) the operator adopts the DaskCluster to the Pod so that it will be cascade deleted on completion of the Job.

This approach will only support non-parallel jobs.

Alternative approaches that I have discounted

One way to implement this would be to reuse as many existing resources and behaviours as possible. But it would require access to the Jobs API and for the client code to be resilient to waiting for the cluster to start.

  • User creates DaskJob with nested Job spec and DaskCluster spec.
  • Operator creates Job resource that runs the client code.
  • When the Job creates a Pod the operator creates a DaskCluster resource and adopts it to the Pod.
  • When the Job finishes and the Pod is removed the DaskCluster will be cascade deleted automatically.

We could also create the DaskCluster and Job at the same time or the DaskCluster first then the Job.

Alternatively, we could reimplement some of what a Job does in the operator which would give us a little more control over the startup order and require less of the API.

  • User creates DaskJob with nested Pod spec and DaskCluster spec.
  • Operator creates the DaskCluster resource.
  • When the cluster is running the operator creates a Pod from the spec that runs the client code.
  • The operator polls the Pod to wait for it to finish.
  • When the Pod is done the Pod and DaskCluster resources are deleted.

We also probably want to support a nested autoscaler resource in the DaskJob too, so will need #451 to be merged.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions