-
-
Notifications
You must be signed in to change notification settings - Fork 158
Implement DaskJob resource #483
Description
In KubeFlow there are many operators that allow users to run training jobs as a one-shot task. For example, PyTorch has a PyTorchJob CRD which creates a cluster of pods to run a training task and runs them to completion, then cleans up when it is done.
We should add something similar to the operator here so that folks will have familiar tools. We can reuse the DaskCluster CRD within the DaskJob (nesting this will be trivial thanks to the work by @samdyzon in #452).
I've thought about a few approaches (see alternatives in the details below) but this is my preferred option:
- User creates
DaskJobwith nestedJobspec andDaskClusterspec. - Operator creates
Jobresource that runs the client code and aDaskClusterresource that will be leveraged by the client code. - When the
Jobcreates itsPod(this is done by the kubelet) the operator adopts theDaskClusterto thePodso that it will be cascade deleted on completion of the Job.
This approach will only support non-parallel jobs.
Alternative approaches that I have discounted
One way to implement this would be to reuse as many existing resources and behaviours as possible. But it would require access to the Jobs API and for the client code to be resilient to waiting for the cluster to start.
- User creates
DaskJobwith nestedJobspec andDaskClusterspec. - Operator creates
Jobresource that runs the client code. - When the
Jobcreates aPodthe operator creates aDaskClusterresource and adopts it to thePod. - When the
Jobfinishes and thePodis removed theDaskClusterwill be cascade deleted automatically.
We could also create the DaskCluster and Job at the same time or the DaskCluster first then the Job.
Alternatively, we could reimplement some of what a Job does in the operator which would give us a little more control over the startup order and require less of the API.
- User creates
DaskJobwith nestedPodspec andDaskClusterspec. - Operator creates the
DaskClusterresource. - When the cluster is running the operator creates a
Podfrom the spec that runs the client code. - The operator polls the
Podto wait for it to finish. - When the
Podis done thePodandDaskClusterresources are deleted.
We also probably want to support a nested autoscaler resource in the DaskJob too, so will need #451 to be merged.