The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHEL8.9
- Kernel Version: 4.18.0-513.24.1.el8.9
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd 1.6.31
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): 1.25.16
- GPU Operator Version: 23.6.2
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
When using a precompiled driver and all gpu nodes are not ready, gpu-operator will loop to eleted and recreated nvidia-driver-daemonset.
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
- Create a k8s cluster with one GPU nodes.
- Install gpu-operator and configure
driver.usePrecompiled = true.
- Login to the GPU node and trigger the node not ready through
systemctl stop kubelet
nvidia-driver-daemonset will be deleted and recreated, and this process will continue until the GPU node is Ready.
When the node is not ready, the node taints like this:
Taints: node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unreachable:NoSchedule
But the nvidia-driver-daemonset pod tolerations is like this:
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
Node taint node.kubernetes.io/unreachable:NoSchedule is not tolerated, so nvidia-driver-daemonset .status.desiredNumberScheduled is 0.
Following the logic of cleanupStalePrecompiledDaemonsets, nvidia-driver-daemonset will be deleted and then created again because the cluster still has GPU nodes.
|
// cleanupStalePrecompiledDaemonsets deletes stale driver daemonsets which can happen |
|
// 1. If all nodes upgraded to the latest kernel |
|
// 2. no GPU nodes are present |
|
func (n ClusterPolicyController) cleanupStalePrecompiledDaemonsets(ctx context.Context) error { |
|
opts := []client.ListOption{ |
|
client.MatchingLabels{ |
|
precompiledIdentificationLabelKey: precompiledIdentificationLabelValue, |
|
}, |
|
} |
|
list := &appsv1.DaemonSetList{} |
|
err := n.client.List(ctx, list, opts...) |
|
if err != nil { |
|
n.logger.Error(err, "could not get daemonset list") |
|
return err |
|
} |
|
|
|
for idx := range list.Items { |
|
name := list.Items[idx].ObjectMeta.Name |
|
desiredNumberScheduled := list.Items[idx].Status.DesiredNumberScheduled |
|
|
|
n.logger.V(1).Info("Driver DaemonSet found", |
|
"Name", name, |
|
"desiredNumberScheduled", desiredNumberScheduled) |
|
|
|
if desiredNumberScheduled != 0 { |
|
n.logger.Info("Driver DaemonSet active, keep it.", |
|
"Name", name, "Status.DesiredNumberScheduled", desiredNumberScheduled) |
|
continue |
|
} |
|
|
|
n.logger.Info("Delete Driver DaemonSet", "Name", name) |
|
|
|
err = n.client.Delete(ctx, &list.Items[idx]) |
|
if err != nil { |
|
n.logger.Info("ERROR: Could not get delete DaemonSet", |
|
"Name", name, "Error", err) |
|
} |
|
} |
|
return nil |
|
} |
This does not appear to be normal behavior.
The temporary solution is to add the following configuration when installing gpu-operator:
daemonsets:
tolerations:
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoSchedule
4. Information to attach (optional if deemed irrelevant)
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
When using a precompiled driver and all gpu nodes are not ready, gpu-operator will loop to eleted and recreated nvidia-driver-daemonset.
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
driver.usePrecompiled = true.systemctl stop kubeletnvidia-driver-daemonsetwill be deleted and recreated, and this process will continue until the GPU node is Ready.When the node is not ready, the node taints like this:
But the
nvidia-driver-daemonsetpod tolerations is like this:Node taint
node.kubernetes.io/unreachable:NoScheduleis not tolerated, sonvidia-driver-daemonset.status.desiredNumberScheduled is 0.Following the logic of
cleanupStalePrecompiledDaemonsets,nvidia-driver-daemonsetwill be deleted and then created again because the cluster still has GPU nodes.gpu-operator/controllers/object_controls.go
Lines 3689 to 3728 in a9e6a94
This does not appear to be normal behavior.
The temporary solution is to add the following configuration when installing gpu-operator:
4. Information to attach (optional if deemed irrelevant)
kubectl get pods -n OPERATOR_NAMESPACEkubectl get ds -n OPERATOR_NAMESPACEkubectl describe pod -n OPERATOR_NAMESPACE POD_NAMEkubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containersnvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smijournalctl -u containerd > containerd.logCollecting full debug bundle (optional):
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com