To install Grove, you can choose one of the following options:
- Install Grove from the published Helm charts under the GitHub packages section.
- Build from source and install Grove using the
maketargets we provide as a part of the repository.
You can directly install Grove in your cluster using the published grove-charts Helm packages.
Locate the release tag to install.
Set the KUBECONFIG in your shell session, and run the following:
helm upgrade -i grove oci://ghcr.io/ai-dynamo/grove/grove-charts --version <version>You can build and deploy Grove to your local kind cluster or remote cluster using the provided make targets in the repository.
All grove operator make targets are located in Operator Makefile.
In case you wish to develop Grove using a local kind cluster or are following along with our tutorials on your local machine, please do the following:
-
Navigate to the operator directory:
cd operator -
Set up a KIND cluster with local docker registry:
make kind-up
-
Optional: To create a KIND cluster with fake nodes for testing at scale, specify the number of fake nodes:
# Create a cluster with 20 fake nodes make kind-up FAKE_NODES=20This will automatically install KWOK (Kubernetes WithOut Kubelet) and create the specified number of fake nodes. These fake nodes are tainted with
fake-node=true:NoSchedule, so you'll need to add the following toleration to your pod specs to schedule on them:tolerations: - key: fake-node operator: Exists effect: NoSchedule
-
Specify the
KUBECONFIGenvironment variable in your shell session to the path printed out at the end of the previous step:# You would see something like `export KUBECONFIG=/path-to-your-grove-clone/grove/operator/hack/kind/kubeconfig` printed. # If you are already in `/path-to-your-grove-clone/grove/operator`, then you can simply: export KUBECONFIG=./hack/kind/kubeconfig
If you wish to use your own Kubernetes cluster instead of the local KIND cluster, follow these steps:
-
Set the KUBECONFIG environment variable to point to your Kubernetes cluster configuration:
# Set KUBECONFIG to use your Kubernetes cluster kubeconfig export KUBECONFIG=/path/to/your/kubernetes/kubeconfig
-
Set the CONTAINER_REGISTRY environment variable to specify your container registry:
# Set a container registry to push your images to export CONTAINER_REGISTRY=your-container-registry
Important: All commands in this section must be run from the
operator/directory.
# Navigate to the operator directory (if not already there)
cd operator
# Optional: Deploy to a custom namespace
export NAMESPACE=custom-ns
# Deploy Grove operator and all resources
make deployThis make target installs all relevant CRDs, builds grove-operator, grove-initc, and deploys the operator to the cluster.
You can configure the Grove operator by modifying the values.yaml.
This make target leverages Grove Helm charts and Skaffold to install the following resources to the cluster:
- CRDs:
- Grove operator CRD -
podcliquesets.grove.io,podcliques.grove.ioandpodcliquescalinggroups.grove.io. - Grove Scheduler CRDs -
podgangs.scheduler.grove.io.
- Grove operator CRD -
- All Grove operator resources defined as a part of Grove Helm chart templates.
By default, Grove automatically generates and manages TLS certificates for its webhook server. For production environments, you may want to use certificates from your organization's PKI or a certificate manager like cert-manager.
See the Certificate Management Guide for detailed configuration options.
On clusters with NVIDIA MNNVL support, you can enable automatic Multi-Node NVLink for GPU workloads by setting config.network.autoMNNVLEnabled: true in the operator configuration (e.g. via Helm --set config.network.autoMNNVLEnabled=true). See the Auto MNNVL user guide for prerequisites, enabling the feature, and usage.
Follow the instructions in the quickstart guide to deploy a PodCliqueSet and validate your installation.
Grove does not provide automatic migration for existing ClusterTopology
resources.
If ClusterTopology resources already exist in the cluster:
- Re-create them manually as
ClusterTopologyBindingresources. - Delete the old
ClusterTopologyinstances. - Delete the old
ClusterTopologyCRD.
Example:
# 1. Verify any old ClusterTopology instances that still exist
kubectl get clustertopologies.grove.io
# 2. Re-create any ClusterTopology resources you want to keep as
# ClusterTopologyBinding resources
# 3. Delete the old ClusterTopology instances
kubectl delete clustertopologies.grove.io --all
# 4. Delete the old ClusterTopology CRD
kubectl delete crd clustertopologies.grove.ioThis is expected to be a low-impact change because Grove has not yet had a
release containing the update that allowed administrators to create
ClusterTopology resources freely.
helm template does not render the chart's crds/ directory unless
--include-crds is passed, so installs via helm template | kubectl apply,
ArgoCD, Flux, or Kustomize fail with missing CRDs by default. Set
crdInstaller.enabled=true to install and upgrade CRDs from an init
container instead.
In the same workflows you should also set webhookServerSecret.enabled=false:
the chart otherwise renders an empty grove-webhook-server-cert Secret that
overwrites the auto-generated TLS material on every re-apply or GitOps sync,
breaking the webhook. With it disabled, the operator creates and manages the
Secret itself.
helm template grove oci://ghcr.io/ai-dynamo/grove/grove-charts \
--version <version> \
--set crdInstaller.enabled=true \
--set webhookServerSecret.enabled=false \
| kubectl apply -f -See GREP-436
for the CRD design details and the alternative --include-crds workflow.
Cause: You're running the command from the wrong directory.
Solution: Ensure you're in the operator/ directory:
cd operator
make deployCause: The KUBECONFIG environment variable is not set correctly.
Solution: Export the kubeconfig for your kind cluster:
kind get kubeconfig --name grove-test-cluster > hack/kind/kubeconfig
export KUBECONFIG=$(pwd)/hack/kind/kubeconfig
make deployCause: Check the operator logs for specific errors.
Solution:
kubectl logs -l app.kubernetes.io/name=grove-operatorCause: The operator ConfigMap is rendered with immutable: true by design and cannot be edited in place.
Solution: Change configuration via helm upgrade; a new ConfigMap is created and the operator rolls automatically.
helm upgrade grove oci://ghcr.io/ai-dynamo/grove/grove-charts \
--version <version> \
--set config.network.autoMNNVLEnabled=falseCause: Gang scheduling requirements might not be met, or there aren't enough resources.
Solution:
- Check PodGang status:
kubectl get pg -o yaml
- Check if MinAvailable requirements can be satisfied by your cluster resources
- Check node resources:
kubectl describe nodes
Cause: The resource name might be incorrect.
Solution: List the actual resource names first:
# For PodCliqueScalingGroups
kubectl get pcsg
# For PodCliqueSets
kubectl get pcsThen use the exact name from the output.
Cause: HPA might not be created or metrics-server might be missing.
Solution:
- Verify HPA exists:
kubectl get hpa
- Check if metrics-server is running (required for HPA):
kubectl get deployment metrics-server -n kube-system
- For kind clusters, you may need to install metrics-server separately (choose one of following methods):
- Use
operator/Makefiletarget
make deploy-addons- Manual setup
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yamlIf you encounter issues not covered here:
- Check the GitHub Issues for similar problems
- Join the Grove mailing list
- Start a discussion thread
Currently the following schedulers support gang scheduling of PodGangs created by the Grove operator:
- kai-scheduler/kai-scheduler
- Topology Aware Scheduling (TAS) requires v0.14.0+