When using a precompiled driver and all gpu nodes are not ready, gpu-operator will loop to deleted and recreated `nvidia-driver-daemonset`

_The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense._

_**Important Note:  NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case [here](https://enterprise-support.nvidia.com/s/create-case)**._


### 1. Quick Debug Information
* OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHEL8.9
* Kernel Version: 4.18.0-513.24.1.el8.9
* Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd 1.6.31
* K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): 1.25.16
* GPU Operator Version: 23.6.2


### 2. Issue or feature description
_Briefly explain the issue in terms of expected behavior and current behavior._

 When using a precompiled driver and all gpu nodes are not ready, gpu-operator will loop to eleted and recreated nvidia-driver-daemonset.

### 3. Steps to reproduce the issue
_Detailed steps to reproduce the issue._

1. Create a k8s cluster with one GPU nodes.
2. Install gpu-operator and configure `driver.usePrecompiled = true`.
3. Login to the GPU node and trigger the node not ready through `systemctl stop kubelet`
4. `nvidia-driver-daemonset` will be deleted and recreated, and this process will continue until the GPU node is Ready.

When the node is not ready, the node taints like this：
```
Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule
```
But the `nvidia-driver-daemonset` pod tolerations is like this:
```
  tolerations:
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/pid-pressure
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    operator: Exists
```

Node taint  ` node.kubernetes.io/unreachable:NoSchedule` is not tolerated, so `nvidia-driver-daemonset` .status.desiredNumberScheduled is 0.

Following the logic of `cleanupStalePrecompiledDaemonsets`, `nvidia-driver-daemonset` will be deleted and then created again because the cluster still has GPU nodes.

https://github.com/NVIDIA/gpu-operator/blob/a9e6a947216518e5940c21523c2400a2f8f4def5/controllers/object_controls.go#L3689-L3728

This does not appear to be normal behavior.

The temporary solution is to add the following configuration when installing gpu-operator:
```
daemonsets:
  tolerations:
  - key: node.kubernetes.io/unreachable
    operator: Exists
    effect: NoSchedule
```

### 4. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)

 - [ ] kubernetes pods status: `kubectl get pods -n OPERATOR_NAMESPACE`
 - [ ] kubernetes daemonset status: `kubectl get ds -n OPERATOR_NAMESPACE`
 - [ ] If a pod/ds is in an error state or pending state `kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME`
 - [ ] If a pod/ds is in an error state or pending state `kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers`
 - [ ] Output from running `nvidia-smi` from the driver container: `kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi`
 - [ ] containerd logs `journalctl -u containerd > containerd.log`


Collecting full debug bundle (optional):

```
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh
```
**NOTE**: please refer to the [must-gather](https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh) script for debug data collected.

This bundle can be submitted to us via email: **operator_feedback@nvidia.com**


	// cleanupStalePrecompiledDaemonsets deletes stale driver daemonsets which can happen
	// 1. If all nodes upgraded to the latest kernel
	// 2. no GPU nodes are present
	func (n ClusterPolicyController) cleanupStalePrecompiledDaemonsets(ctx context.Context) error {
	opts := []client.ListOption{
	client.MatchingLabels{
	precompiledIdentificationLabelKey: precompiledIdentificationLabelValue,
	},
	}
	list := &appsv1.DaemonSetList{}
	err := n.client.List(ctx, list, opts...)
	if err != nil {
	n.logger.Error(err, "could not get daemonset list")
	return err
	}

	for idx := range list.Items {
	name := list.Items[idx].ObjectMeta.Name
	desiredNumberScheduled := list.Items[idx].Status.DesiredNumberScheduled

	n.logger.V(1).Info("Driver DaemonSet found",
	"Name", name,
	"desiredNumberScheduled", desiredNumberScheduled)

	if desiredNumberScheduled != 0 {
	n.logger.Info("Driver DaemonSet active, keep it.",
	"Name", name, "Status.DesiredNumberScheduled", desiredNumberScheduled)
	continue
	}

	n.logger.Info("Delete Driver DaemonSet", "Name", name)

	err = n.client.Delete(ctx, &list.Items[idx])
	if err != nil {
	n.logger.Info("ERROR: Could not get delete DaemonSet",
	"Name", name, "Error", err)
	}
	}
	return nil
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using a precompiled driver and all gpu nodes are not ready, gpu-operator will loop to deleted and recreated `nvidia-driver-daemonset` #715

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

When using a precompiled driver and all gpu nodes are not ready, gpu-operator will loop to deleted and recreated nvidia-driver-daemonset #715

Description

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

When using a precompiled driver and all gpu nodes are not ready, gpu-operator will loop to deleted and recreated `nvidia-driver-daemonset` #715