My target was to ensure that the NVIDIADriver resource scheme works with the default Helm chart settings. The NVIDIADriver resource was created, and the old nvidia-gpu-driver pods were terminated. After this, attempts to create new nvidia-gpu-driver pods under the NVIDIADriver resource control began. However, the newly created nvidia-gpu-driver pods were immediately terminated as soon as they were created. In fact, there were several attempts every second to create a new nvidia-gpu-driver pod, but each one was instantly deleted. I could see it in the openlens window (please see attach).
{"level":"info","ts":1743073635.4404998,"logger":"controllers.Upgrade","msg":"ProcessDoneOrUnknownNodes"}
{"level":"info","ts":1743073635.4405034,"logger":"controllers.Upgrade","msg":"ProcessDoneOrUnknownNodes"}
{"level":"info","ts":1743073635.4405065,"logger":"controllers.Upgrade","msg":"ProcessUpgradeRequiredNodes"}
{"level":"info","ts":1743073635.4405098,"logger":"controllers.Upgrade","msg":"ProcessCordonRequiredNodes"}
{"level":"info","ts":1743073635.4405127,"logger":"controllers.Upgrade","msg":"ProcessWaitForJobsRequiredNodes"}
{"level":"info","ts":1743073635.4405162,"logger":"controllers.Upgrade","msg":"ProcessPodDeletionRequiredNodes"}
{"level":"info","ts":1743073635.44052,"logger":"controllers.Upgrade","msg":"ProcessDrainNodes"}
{"level":"info","ts":1743073635.440523,"logger":"controllers.Upgrade","msg":"Node drain is disabled by policy, skipping this step"}
{"level":"info","ts":1743073635.4405265,"logger":"controllers.Upgrade","msg":"ProcessPodRestartNodes"}
{"level":"info","ts":1743073635.4405298,"logger":"controllers.Upgrade","msg":"Starting Pod Delete"}
{"level":"info","ts":1743073635.440533,"logger":"controllers.Upgrade","msg":"No pods scheduled to restart"}
{"level":"info","ts":1743073635.4405365,"logger":"controllers.Upgrade","msg":"ProcessUpgradeFailedNodes"}
{"level":"info","ts":1743073635.4405396,"logger":"controllers.Upgrade","msg":"ProcessValidationRequiredNodes"}
{"level":"info","ts":1743073635.440543,"logger":"controllers.Upgrade","msg":"ProcessUncordonRequiredNodes"}
{"level":"info","ts":1743073635.4405465,"logger":"controllers.Upgrade","msg":"State Manager, finished processing"}
{"level":"info","ts":1743073635.442196,"logger":"state.state-driver","msg":"Object is ready","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"6ddf94ad-328a-4459-91c0-cb49e800b754","Kind:":"RoleBinding","Name":"nvidia-gpu-driver-default-ubuntu20.04"}
{"level":"info","ts":1743073635.442217,"logger":"state.state-driver","msg":"Checking object","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"6ddf94ad-328a-4459-91c0-cb49e800b754","Kind:":"ClusterRoleBinding","Name":"nvidia-gpu-driver-default-ubuntu20.04"}
{"level":"info","ts":1743073635.4422336,"logger":"state.state-driver","msg":"Get Object","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"6ddf94ad-328a-4459-91c0-cb49e800b754","Namespace:":"","Name:":"nvidia-gpu-driver-default-ubuntu20.04"}
{"level":"info","ts":1743073635.4450245,"logger":"state.state-driver","msg":"Object is ready","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"6ddf94ad-328a-4459-91c0-cb49e800b754","Kind:":"ClusterRoleBinding","Name":"nvidia-gpu-driver-default-ubuntu20.04"}
{"level":"info","ts":1743073635.445047,"logger":"state.state-driver","msg":"Checking object","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"6ddf94ad-328a-4459-91c0-cb49e800b754","Kind:":"DaemonSet","Name":"nvidia-gpu-driver-ubuntu20.04-5df58685dd"}
{"level":"info","ts":1743073635.4451218,"logger":"state.state-driver","msg":"Get Object","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"6ddf94ad-328a-4459-91c0-cb49e800b754","Namespace:":"gpu-operator","Name:":"nvidia-gpu-driver-ubuntu20.04-5df58685dd"}
{"level":"debug","ts":1743073635.4489393,"logger":"state.state-driver","msg":"Check daemonset state","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"6ddf94ad-328a-4459-91c0-cb49e800b754","DesiredNodes:":0,"CurrentNodes:":0,"PodsAvailable:":0,"PodsUnavailable:":0,"UpdatedPodsScheduled":0,"PodsReady:":0,"Conditions:":null}
{"level":"info","ts":1743073635.4489691,"logger":"state.state-driver","msg":"Object is not ready","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"6ddf94ad-328a-4459-91c0-cb49e800b754","Kind:":"DaemonSet","Name":"nvidia-gpu-driver-ubuntu20.04-5df58685dd"}
{"level":"info","ts":1743073635.4489768,"msg":"Sync not Done for custom resource","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"6ddf94ad-328a-4459-91c0-cb49e800b754"}
{"level":"info","ts":1743073635.449008,"msg":"NVIDIADriver instance is not ready","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"6ddf94ad-328a-4459-91c0-cb49e800b754"}
{"level":"info","ts":1743073635.4575124,"msg":"Reconciling NVIDIADriver","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"49589689-b089-4491-9060-5061d14da31f"}
{"level":"info","ts":1743073635.4576921,"msg":"Syncing system state","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"49589689-b089-4491-9060-5061d14da31f"}
{"level":"info","ts":1743073635.4577005,"msg":"Sync State","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"49589689-b089-4491-9060-5061d14da31f","Name":"state-driver","Description":"NVIDIA driver deployed in the cluster"}
{"level":"info","ts":1743073635.4577074,"logger":"state.state-driver","msg":"Cleaning up stale driver DaemonSets","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"49589689-b089-4491-9060-5061d14da31f"}
{"level":"info","ts":1743073635.4577606,"logger":"state.state-driver","msg":"Deleting inactive driver DaemonSet","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"49589689-b089-4491-9060-5061d14da31f","Name":"nvidia-gpu-driver-ubuntu20.04-5df58685dd"}
{"level":"info","ts":1743073635.4634998,"logger":"controllers.Upgrade","msg":"Reconciling Upgrade","upgrade":{"name":"cluster-policy"}}
{"level":"info","ts":1743073635.4635444,"logger":"controllers.Upgrade","msg":"Using label selector","upgrade":{"name":"cluster-policy"},"key":"app.kubernetes.io/component","value":"nvidia-driver"}
{"level":"info","ts":1743073635.4635508,"logger":"controllers.Upgrade","msg":"Building state"}
{"level":"info","ts":1743073635.4650958,"logger":"state.state-driver","msg":"Detected new node pool","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"49589689-b089-4491-9060-5061d14da31f","NodePool":{}}
{"level":"debug","ts":1743073635.467486,"logger":"controllers.Upgrade","msg":"Got driver DaemonSets","length":0}
{"level":"info","ts":1743073635.471859,"logger":"state.state-driver","msg":"Rendering manifests for node pool","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"49589689-b089-4491-9060-5061d14da31f","NodePool":"ubuntu20.04"}
{"level":"debug","ts":1743073635.471877,"logger":"state.state-driver","msg":"Rendering objects","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"49589689-b089-4491-9060-5061d14da31f","data:":{"Driver":{"Spec":{"driverType":"gpu","usePrecompiled":false,"kernelModuleType":"auto","startupProbe":{"initialDelaySeconds":60,"timeoutSeconds":60,"periodSeconds":10,"failureThreshold":120},"rdma":{"enabled":false,"useHostMofed":false},"gdrcopy":{"enabled":false,"repository":"nvcr.io/nvidia/cloud-native","image":"gdrdrv","version":"v2.4.4","imagePullPolicy":"IfNotPresent"},"repository":"nvcr.io/nvidia","image":"driver","version":"570.124.06","manager":{"repository":"nvcr.io/nvidia/cloud-native","image":"k8s-driver-manager","version":"v0.8.0","imagePullPolicy":"IfNotPresent","env":[{"name":"ENABLE_GPU_POD_EVICTION","value":"true"},{"name":"ENABLE_AUTO_DRAIN","value":"false"},{"name":"DRAIN_USE_FORCE","value":"false"},{"name":"DRAIN_POD_SELECTOR_LABEL"},{"name":"DRAIN_TIMEOUT_SECONDS","value":"0s"},{"name":"DRAIN_DELETE_EMPTYDIR_DATA","value":"false"}]},"nodeSelector":{"feature.node.kubernetes.io/system-os_release.ID":"ubuntu","feature.node.kubernetes.io/system-os_release.VERSION_ID":"20.04","nvidia.com/gpu.present":"true"},"tolerations":[{"operator":"Exists"}]},"AppName":"nvidia-gpu-driver-ubuntu20.04-5df58685dd","Name":"nvidia-gpu-driver-default-ubuntu20.04","ImagePath":"nvcr.io/nvidia/driver:570.124.06-ubuntu20.04","ManagerImagePath":"nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.8.0","OCPToolkitEnabled":false,"OSVersion":"ubuntu20.04"},"GDS":null,"GPUDirectRDMA":{"enabled":false,"useHostMofed":false},"GDRCopy":null,"Runtime":{"Namespace":"gpu-operator","KubernetesVersion":"v1.26.7","OpenshiftVersion":"","OpenshiftDriverToolkitEnabled":false,"OpenshiftDriverToolkitImages":null,"OpenshiftProxySpec":null,"NodePools":[{}]},"Openshift":null,"Precompiled":null,"AdditionalConfigs":{"VolumeMounts":null,"Volumes":null},"HostRoot":"/"}}
{"level":"debug","ts":1743073635.4764528,"logger":"state.state-driver","msg":"Rendered","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"49589689-b089-4491-9060-5061d14da31f","objects:":[{"apiVersion":"v1","kind":"ServiceAccount","metadata":{"name":"nvidia-gpu-driver-default-ubuntu20.04","namespace":"gpu-operator"}},{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"Role","metadata":{"name":"nvidia-gpu-driver-default-ubuntu20.04","namespace":"gpu-operator"},"rules":[{"apiGroups":["security.openshift.io"],"resourceNames":["privileged"],"resources":["securitycontextconstraints"],"verbs":["use"]}]},{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRole","metadata":{"name":"nvidia-gpu-driver-default-ubuntu20.04"},"rules":[{"apiGroups":["config.openshift.io"],"resources":["clusterversions"],"verbs":["get","list"]},{"apiGroups":[""],"resources":["nodes"],"verbs":["get","list","patch","update","watch"]},{"apiGroups":[""],"resources":["pods"],"verbs":["get","list","watch"]},{"apiGroups":[""],"resources":["pods/eviction"],"verbs":["create"]},{"apiGroups":["apps"],"resources":["daemonsets"],"verbs":["get"]}]},{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"RoleBinding","metadata":{"name":"nvidia-gpu-driver-default-ubuntu20.04","namespace":"gpu-operator"},"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"Role","name":"nvidia-gpu-driver-default-ubuntu20.04"},"subjects":[{"kind":"ServiceAccount","name":"nvidia-gpu-driver-default-ubuntu20.04","namespace":"gpu-operator"}]},{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"ClusterRoleBinding","metadata":{"name":"nvidia-gpu-driver-default-ubuntu20.04"},"roleRef":{"apiGroup":"rbac.authorization.k8s.io","kind":"ClusterRole","name":"nvidia-gpu-driver-default-ubuntu20.04"},"subjects":[{"kind":"ServiceAccount","name":"nvidia-gpu-driver-default-ubuntu20.04","namespace":"gpu-operator"}]},{"apiVersion":"apps/v1","kind":"DaemonSet","metadata":{"annotations":{"openshift.io/scc":"nvidia-gpu-driver-default-ubuntu20.04"},"labels":{"app":"nvidia-gpu-driver-ubuntu20.04-5df58685dd","app.kubernetes.io/component":"nvidia-driver","nvidia.com/node.os-version":"ubuntu20.04","nvidia.com/precompiled":"false"},"name":"nvidia-gpu-driver-ubuntu20.04-5df58685dd","namespace":"gpu-operator"},"spec":{"selector":{"matchLabels":{"app":"nvidia-gpu-driver-ubuntu20.04-5df58685dd"}},"template":{"metadata":{"annotations":{"kubectl.kubernetes.io/default-container":"nvidia-driver-ctr"},"labels":{"app":"nvidia-gpu-driver-ubuntu20.04-5df58685dd","app.kubernetes.io/component":"nvidia-driver","nvidia.com/node.os-version":"ubuntu20.04","nvidia.com/precompiled":"false"}},"spec":{"affinity":{"podAntiAffinity":{"requiredDuringSchedulingIgnoredDuringExecution":[{"labelSelector":{"matchExpressions":[{"key":"app.kubernetes.io/component","operator":"In","values":["nvidia-driver","nvidia-vgpu-manager"]}]},"topologyKey":"kubernetes.io/hostname"}]}},"containers":[{"args":["init"],"command":["nvidia-driver"],"env":[{"name":"NVIDIA_VISIBLE_DEVICES","value":"void"},{"name":"NODE_NAME","valueFrom":{"fieldRef":{"fieldPath":"spec.nodeName"}}},{"name":"NODE_IP","valueFrom":{"fieldRef":{"fieldPath":"status.hostIP"}}},{"name":"KERNEL_MODULE_TYPE","value":"auto"}],"image":"nvcr.io/nvidia/driver:570.124.06-ubuntu20.04","imagePullPolicy":"IfNotPresent","lifecycle":{"preStop":{"exec":{"command":["/bin/sh","-c","rm -f /run/nvidia/validations/.driver-ctr-ready"]}}},"name":"nvidia-driver-ctr","securityContext":{"privileged":true,"seLinuxOptions":{"level":"s0"}},"startupProbe":{"exec":{"command":["sh","-c","nvidia-smi \u0026\u0026 touch /run/nvidia/validations/.driver-ctr-ready"]},"failureThreshold":120,"initialDelaySeconds":60,"periodSeconds":10,"successThreshold":0,"timeoutSeconds":60},"volumeMounts":[{"mountPath":"/run/nvidia","mountPropagation":"Bidirectional","name":"run-nvidia"},{"mountPath":"/run/nvidia-fabricmanager","name":"run-nvidia-fabricmanager"},{"mountPath":"/run/nvidia-topologyd","name":"run-nvidia-topologyd"},{"mountPath":"/var/log","name":"var-log"},{"mountPath":"/dev/log","name":"dev-log"},{"mountPath":"/host-etc/os-release","name":"host-os-release","readOnly":true},{"mountPath":"/run/mellanox/drivers/usr/src","mountPropagation":"HostToContainer","name":"mlnx-ofed-usr-src"},{"mountPath":"/run/mellanox/drivers","mountPropagation":"HostToContainer","name":"run-mellanox-drivers"},{"mountPath":"/sys/module/firmware_class/parameters/path","name":"firmware-search-path"},{"mountPath":"/sys/devices/system/memory/auto_online_blocks","name":"sysfs-memory-online"},{"mountPath":"/lib/firmware","name":"nv-firmware"}]}],"hostPID":true,"initContainers":[{"args":["uninstall_driver"],"command":["driver-manager"],"env":[{"name":"NODE_NAME","valueFrom":{"fieldRef":{"fieldPath":"spec.nodeName"}}},{"name":"NVIDIA_VISIBLE_DEVICES","value":"void"},{"name":"ENABLE_GPU_POD_EVICTION","value":"true"},{"name":"ENABLE_AUTO_DRAIN","value":"true"},{"name":"DRAIN_USE_FORCE","value":"false"},{"name":"DRAIN_POD_SELECTOR_LABEL","value":""},{"name":"DRAIN_TIMEOUT_SECONDS","value":"0s"},{"name":"DRAIN_DELETE_EMPTYDIR_DATA","value":"false"},{"name":"OPERATOR_NAMESPACE","valueFrom":{"fieldRef":{"fieldPath":"metadata.namespace"}}},{"name":"ENABLE_GPU_POD_EVICTION","value":"true"},{"name":"ENABLE_AUTO_DRAIN","value":"false"},{"name":"DRAIN_USE_FORCE","value":"false"},{"name":"DRAIN_POD_SELECTOR_LABEL","value":""},{"name":"DRAIN_TIMEOUT_SECONDS","value":"0s"},{"name":"DRAIN_DELETE_EMPTYDIR_DATA","value":"false"}],"image":"nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.8.0","imagePullPolicy":"IfNotPresent","name":"k8s-driver-manager","securityContext":{"privileged":true},"volumeMounts":[{"mountPath":"/run/nvidia","mountPropagation":"Bidirectional","name":"run-nvidia"},{"mountPath":"/host","mountPropagation":"HostToContainer","name":"host-root","readOnly":true},{"mountPath":"/sys","name":"host-sys"},{"mountPath":"/run/mellanox/drivers","mountPropagation":"HostToContainer","name":"run-mellanox-drivers"}]}],"nodeSelector":{"feature.node.kubernetes.io/system-os_release.ID":"ubuntu","feature.node.kubernetes.io/system-os_release.VERSION_ID":"20.04","nvidia.com/gpu.deploy.driver":"true","nvidia.com/gpu.present":"true"},"priorityClassName":"system-node-critical","serviceAccountName":"nvidia-gpu-driver-default-ubuntu20.04","tolerations":[{"effect":"NoSchedule","key":"nvidia.com/gpu","operator":"Exists"},{"operator":"Exists"}],"volumes":[{"hostPath":{"path":"/run/nvidia","type":"DirectoryOrCreate"},"name":"run-nvidia"},{"hostPath":{"path":"/var/log"},"name":"var-log"},{"hostPath":{"path":"/dev/log"},"name":"dev-log"},{"hostPath":{"path":"/etc/os-release"},"name":"host-os-release"},{"hostPath":{"path":"/run/nvidia-fabricmanager","type":"DirectoryOrCreate"},"name":"run-nvidia-fabricmanager"},{"hostPath":{"path":"/run/nvidia-topologyd","type":"DirectoryOrCreate"},"name":"run-nvidia-topologyd"},{"hostPath":{"path":"/run/mellanox/drivers/usr/src","type":"DirectoryOrCreate"},"name":"mlnx-ofed-usr-src"},{"hostPath":{"path":"/run/mellanox/drivers","type":"DirectoryOrCreate"},"name":"run-mellanox-drivers"},{"hostPath":{"path":"/run/nvidia/validations","type":"DirectoryOrCreate"},"name":"run-nvidia-validations"},{"hostPath":{"path":"/"},"name":"host-root"},{"hostPath":{"path":"/sys","type":"Directory"},"name":"host-sys"},{"hostPath":{"path":"/sys/module/firmware_class/parameters/path"},"name":"firmware-search-path"},{"hostPath":{"path":"/sys/devices/system/memory/auto_online_blocks"},"name":"sysfs-memory-online"},{"hostPath":{"path":"/run/nvidia/driver/lib/firmware","type":"DirectoryOrCreate"},"name":"nv-firmware"}]}},"updateStrategy":{"type":"OnDelete"}}}]}
{"level":"info","ts":1743073635.4767663,"logger":"state.state-driver","msg":"Handling manifest object","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"49589689-b089-4491-9060-5061d14da31f","Kind:":"ServiceAccount","Name":"nvidia-gpu-driver-default-ubuntu20.04"}
{"level":"info","ts":1743073635.4767883,"logger":"state.state-driver","msg":"Creating Object","controller":"nvidia-driver-controller","object":{"name":"default"},"namespace":"","name":"default","reconcileID":"49589689-b089-4491-9060-5061d14da31f","Namespace:":"gpu-operator","Name:":"nvidia-gpu-driver-default-ubuntu20.04"}
{"level":"info","ts":1743073635.477187,"logger":"controllers.Upgrade","msg":"Total orphaned Pods found:","count":0}
{"level":"info","ts":1743073635.4772024,"logger":"controllers.Upgrade","msg":"Propagate state to state manager","upgrade":{"name":"cluster-policy"}}
{"level":"debug","ts":1743073635.477206,"logger":"controllers.Upgrade","msg":"Current cluster upgrade state","upgrade":{"name":"cluster-policy"},"state":{"NodeStates":{}}}
{"level":"info","ts":1743073635.477216,"logger":"controllers.Upgrade","msg":"State Manager, got state update"}
{"level":"info","ts":1743073635.4772196,"logger":"controllers.Upgrade","msg":"Node states:","Unknown":0,"upgrade-done":0,"upgrade-required":0,"cordon-required":0,"wait-for-jobs-required":0,"pod-deletion-required":0,"upgrade-failed":0,"drain-required":0,"pod-restart-required":0,"validation-required":0,"uncordon-required":0}
{"level":"info","ts":1743073635.4772274,"logger":"controllers.Upgrade","msg":"Upgrades in progress","currently in progress":0,"max parallel upgrades":1,"upgrade slots available":0,"currently unavailable nodes":0,"total number of nodes":0,"maximum nodes that can be unavailable":0}
{"level":"info","ts":1743073635.477233,"logger":"controllers.Upgrade","msg":"ProcessDoneOrUnknownNodes"}
{"level":"info","ts":1743073635.4772358,"logger":"controllers.Upgrade","msg":"ProcessDoneOrUnknownNodes"}
{"level":"info","ts":1743073635.4772382,"logger":"controllers.Upgrade","msg":"ProcessUpgradeRequiredNodes"}
{"level":"info","ts":1743073635.477241,"logger":"controllers.Upgrade","msg":"ProcessCordonRequiredNodes"}
{"level":"info","ts":1743073635.4772437,"logger":"controllers.Upgrade","msg":"ProcessWaitForJobsRequiredNodes"}
{"level":"info","ts":1743073635.4772465,"logger":"controllers.Upgrade","msg":"ProcessPodDeletionRequiredNodes"}
{"level":"info","ts":1743073635.4772494,"logger":"controllers.Upgrade","msg":"ProcessDrainNodes"}
{"level":"info","ts":1743073635.4772515,"logger":"controllers.Upgrade","msg":"Node drain is disabled by policy, skipping this step"}
{"level":"info","ts":1743073635.4772546,"logger":"controllers.Upgrade","msg":"ProcessPodRestartNodes"}
{"level":"info","ts":1743073635.477258,"logger":"controllers.Upgrade","msg":"Starting Pod Delete"}
{"level":"info","ts":1743073635.4772606,"logger":"controllers.Upgrade","msg":"No pods scheduled to restart"}
{"level":"info","ts":1743073635.4772635,"logger":"controllers.Upgrade","msg":"ProcessUpgradeFailedNodes"}
{"level":"info","ts":1743073635.4772658,"logger":"controllers.Upgrade","msg":"ProcessValidationRequiredNodes"}
{"level":"info","ts":1743073635.4772687,"logger":"controllers.Upgrade","msg":"ProcessUncordonRequiredNodes"}
{"level":"info","ts":1743073635.4772716,"logger":"controllers.Upgrade","msg":"State Manager, finished processing"}
The operator could remain in this state for a long time — sometimes after several hours, the driver pods would eventually start successfully, sometimes not. I tried restarting the GPU Operator pod, completely deleting all operator resources from the cluster (including deleting the entire GPU Operator namespace and all CRDs), but none of these actions affected the behavior.
I have tested this with the latest release versions (25.3.0, 24.9.2, 24.9.1, 24.6.2, and 24.6.1), and the behavior was the same across all versions.
I tried to search for similar problems in github-issues but found nothing. Could you please help with this problem?
Hello,
We have been using the GPU Operator in our Kubernetes cluster with the Cluster Policy CRD to manage the NVIDIA driver daemonsets, and everything has been working fine for a long time. However, now we would like to use the NVIDIA Driver CRD to manage NVIDIA drivers. I tried deploying the GPU Operator with the following parameters:
My target was to ensure that the NVIDIADriver resource scheme works with the default Helm chart settings. The NVIDIADriver resource was created, and the old nvidia-gpu-driver pods were terminated. After this, attempts to create new nvidia-gpu-driver pods under the NVIDIADriver resource control began. However, the newly created nvidia-gpu-driver pods were immediately terminated as soon as they were created. In fact, there were several attempts every second to create a new nvidia-gpu-driver pod, but each one was instantly deleted. I could see it in the openlens window (please see attach).
In the operator logs, I saw this endless loop:
The operator could remain in this state for a long time — sometimes after several hours, the driver pods would eventually start successfully, sometimes not. I tried restarting the GPU Operator pod, completely deleting all operator resources from the cluster (including deleting the entire GPU Operator namespace and all CRDs), but none of these actions affected the behavior.
I have tested this with the latest release versions (25.3.0, 24.9.2, 24.9.1, 24.6.2, and 24.6.1), and the behavior was the same across all versions.
I tried to search for similar problems in github-issues but found nothing. Could you please help with this problem?