IMPORTANT This CRD is no longer being actively maintained. We encourage you to look at TorchX to launch PyTorch jobs on Kubernetes. The two relevant sections are:
- TorchX-Kubernetes setup: https://pytorch.org/torchx/latest/schedulers/kubernetes.html
- TorchX dist.ddp builtin (uses torchelastic under the hood): https://pytorch.org/torchx/latest/components/distributed.html
TorchElastic Controller for Kubernetes manages a Kubernetes custom resource ElasticJob and makes it easy to
run Torch Elastic workloads on Kubernetes.
NOTE:
- (recommended) create a cluster with GPU instances as some examples (e.g. imagenet) only work on GPU.
- If you provision instances with a single GPU you will only be able to run a single worker per node.
- Our examples assume 1 GPU per node so you will have to adjust
--nproc_per_nodeto be equal to the number of CUDA devices on the instance you are using if you want to run multiple workers per container
Here we provide the instructions to create an Amazon EKS cluster or a Microsoft AKS cluster. If you are not using AWS or Azure, please refer to your cloud/infrastructure provider's manual to setup a kubernetes cluster.
NOTE: EKS/AKS is not required to run this controller, you can use other Kubernetes clusters.
Use eksctl to create an Amazon EKS cluster. This process takes ~15 minutes.
eksctl create cluster \
--name=torchelastic \
--node-type=p3.2xlarge \
--region=us-west-2 \
--version=1.15 \
--ssh-access \
--ssh-public-key=~/.ssh/id_rsa.pub \
--nodes=2Use az aks to create a Microsoft AKS cluster. This process takes ~4 minutes.
az aks create
--resource-group myResourceGroup \
--name torchelastic \
--node-vm-size Standard_NC6 \
--node-count 3 \
--location westus2 \
--kubernetes-version 1.15
--generate-ssh-keysUse gcloud to create a Google GKE cluster. This process takes ~3 minutes.
gcloud container clusters create torchelastic \
--accelerator type=nvidia-tesla-v100,count=1\
--machine-type=n1-standard-4 \
--zone=us-west1-b \
--cluster-version=1.15 \
--num-nodes=2Add --preemptible to use pre-emptive instances.
Deploy the following Daemonset:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.ymlgit clone https://github.com/pytorch/elastic.git
cd elastic/kubernetes
kubectl apply -k config/default
# or
# kustomize build config/default | kubectl apply -f -You will see logs like following
namespace/elastic-job created
customresourcedefinition.apiextensions.k8s.io/elasticjobs.elastic.pytorch.org created
role.rbac.authorization.k8s.io/leader-election-role created
clusterrole.rbac.authorization.k8s.io/manager-role created
rolebinding.rbac.authorization.k8s.io/leader-election-rolebinding created
clusterrolebinding.rbac.authorization.k8s.io/elastic-job-k8s-controller-rolebinding created
deployment.apps/elastic-job-k8s-controller createdVerify that the ElasticJob custom resource is installed
kubectl get crdThe output should include elasticjobs.elastic.pytorch.org
NAME CREATED AT
...
elasticjobs.elastic.pytorch.org 2020-03-18T07:40:53Z
...
Verify controller is ready
kubectl get pods -n elastic-job
NAME READY STATUS RESTARTS AGE
elastic-job-k8s-controller-6d4884c75b-z22cm 1/1 Running 0 15skubectl logs -f elastic-job-k8s-controller-6d4884c75b-z22cm -n elastic-job
2020-03-19T10:13:43.532Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": ":8080"}
2020-03-19T10:13:43.534Z INFO controller-runtime.controller Starting EventSource {"controller": "elasticjob", "source": "kind source: /, Kind="}
2020-03-19T10:13:43.534Z INFO controller-runtime.controller Starting EventSource {"controller": "elasticjob", "source": "kind source: /, Kind="}
2020-03-19T10:13:43.534Z INFO controller-runtime.controller Starting EventSource {"controller": "elasticjob", "source": "kind source: /, Kind="}
2020-03-19T10:13:43.534Z INFO setup starting manager
2020-03-19T10:13:43.534Z INFO controller-runtime.manager starting metrics server {"path": "/metrics"}
2020-03-19T10:13:43.822Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"ConfigMap","namespace":"elastic-job","name":"controller-leader-election-helper","uid":"50269b8b-69ca-11ea-b995-0653198c16be","apiVersion":"v1","resourceVersion":"2107564"}, "reason": "LeaderElection", "message": "elastic-job-k8s-controller-6d4884c75b-z22cm_4cf549b7-3289-4285-8e64-647d067178bf became leader"}
2020-03-19T10:13:44.021Z INFO controller-runtime.controller Starting Controller {"controller": "elasticjob"}
2020-03-19T10:13:44.121Z INFO controller-runtime.controller Starting workers {"controller": "elasticjob", "worker count": 1}-
Deploy an etcd server. This will expose a Kubernetes service
etcd-servicewith port2379.kubectl apply -f config/samples/etcd.yaml -
Get the etcd server endpoint
$ kubectl get svc -n elastic-job NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE etcd-service ClusterIP 10.100.104.168 <none> 2379/TCP 5m5s -
Update
config/samples/imagenet.yaml:- set
rdzvEndpoint(e.g.10.100.104.168:2379) to the etcd server you just provisioned. - set
minReplicasandmaxReplicasto the desired min and max num nodes (max should not exceed your cluster capacity) - set
Worker.replicasto the number of nodes to start with (you may modify this later to scale the job in/out) - set the correct
--nproc_per_nodeincontainer.argsbased on the instance you are running on.
NOTE the
ENTRYPOINTtotorchelastic/examplesispython -m torchelastic.distributed.launch <args...>. Notice that you do not have to specify certainlaunchoptions such as--rdzv_endpoint, and--rdzv_id. These are set automatically by the controller.IMPORTANT a
Workerin the context of kubernetes refers toNodeintorchelastic.distributed.launch. Each kubernetesWorkercan run multiple trainers processes (a.k.aworkerintorchelastic.distributed.launch). - set
-
Submit the training job.
kubectl apply -f config/samples/imagenet.yamlAs you can see, training pod and headless services have been created.
$ kubectl get pods -n elastic-job NAME READY STATUS RESTARTS AGE elastic-job-k8s-controller-6d4884c75b-z22cm 1/1 Running 0 11m imagenet-worker-0 1/1 Running 0 5s imagenet-worker-1 1/1 Running 0 5s -
You can scale the number of nodes by adjusting
.spec.replicaSpecs[Worker].replicasand applying the change.kubectl apply -f config/samples/imagenet.yamlNOTE since you are scaling the containers, you will be scaling in increments of
nproc_per_nodetrainers. In our case--nproc_per_node=1For better performance consider using an instance with multiple GPUs and setting--nproc_per_node=$NUM_CUDA_DEVICES.WARNING the name of the job is used as
rdzv_id, which is used to uniquely identify a job run instance. Hence to run multiple parallel jobs with the same spec you need to change.spec.metadata.nameto give it a unique run id (e.g.imagenet_run_0). Otherwise the new nodes will attempt to join the membership of a different run.
You can describe the job to check job status and job related events.
In following example, imagenet job is created in elastic-job namespace, change to use your job name and namespace in your command.
$ kubectl describe elasticjob imagenet -n elastic-job
Name: imagenet
Namespace: elastic-job
<... OMITTED ...>
Status:
Conditions:
Last Transition Time: 2020-03-19T10:30:55Z
Last Update Time: 2020-03-19T10:30:55Z
Message: ElasticJob imagenet is running.
Reason: ElasticJobRunning
Status: True
Type: Running
<... OMITTED ...>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreatePod 13s elastic-job-controller Created pod: imagenet-worker-0
Tail the logs of a worker:
$ kubectl logs -f -n elastic-job imagenet-worker-0
We have included other sample job specs in the config/samples directory
(e.g. config/samples/classy_vision.yaml), try them out by
replacing imagenet.yaml with the appropariate spec filename in the
instructions above.
To use your own script, build a docker image containing your script.
You can use torchelastic/examples as your base image. Then point your
job specs to use your container by editing Worker.template.spec.containers.image.
Our examples save checkpoints and models in the container hence the trained model and checkpoints are not accessible after the job is complete. In your scripts use a persistent store like AWS S3 or Azure Blob Storage.
Please check TROUBLESHOOTING.md