After a driver upgrade from 560.35.05 to 570.124.06 we suddenly experienced containers getting stuck in "pending" state on the nodes running the NVIDIA driver and the GPU operator. All other non-GPU nodes were unaffected.
It was suggested that the problem triggered an issue in Kubernetes where a call to a device plugin's gRPC server would hang forever, as described here: kubernetes/kubernetes#130855 (comment)
The problem seems to have gone away after downgrading to the 560.35.05 version of the driver and another user suggested that switching off MIG also makes the problem go away on the 570 driver. kubernetes/kubernetes#130855 (comment)
Note: our setup also used the host driver and not the driver deployment with the GPU operator.
Software setup:
Host OS Ubuntu 22.04
Kubernetes v1.31.7 (official APT install)
Containerd 1.7.25 bcc810d6b9066471b0b6fa75f557a15a1cbf31bb
CUDA 12.8 Update 1 (Driver 570.124.06)
NVIDIA GPU Operator 24.9.2 (Helm)
Longhorn Storage 1.7.4 (Helm)
Hardware setup:
CPU: 2x AMD (64 cores each)
RAM: 2TB
Disk: 5x 30TB NVMe (Four for Longhorn, one for Containerd), 2x 500G NVMe for system on a software RAID1)
GPU: 8x A100 80GB
Network: 10Gbps Ethernet
After a driver upgrade from 560.35.05 to 570.124.06 we suddenly experienced containers getting stuck in "pending" state on the nodes running the NVIDIA driver and the GPU operator. All other non-GPU nodes were unaffected.
It was suggested that the problem triggered an issue in Kubernetes where a call to a device plugin's gRPC server would hang forever, as described here: kubernetes/kubernetes#130855 (comment)
The problem seems to have gone away after downgrading to the 560.35.05 version of the driver and another user suggested that switching off MIG also makes the problem go away on the 570 driver. kubernetes/kubernetes#130855 (comment)
Note: our setup also used the host driver and not the driver deployment with the GPU operator.
Software setup:
Host OS Ubuntu 22.04
Kubernetes v1.31.7 (official APT install)
Containerd 1.7.25 bcc810d6b9066471b0b6fa75f557a15a1cbf31bb
CUDA 12.8 Update 1 (Driver 570.124.06)
NVIDIA GPU Operator 24.9.2 (Helm)
Longhorn Storage 1.7.4 (Helm)
Hardware setup:
CPU: 2x AMD (64 cores each)
RAM: 2TB
Disk: 5x 30TB NVMe (Four for Longhorn, one for Containerd), 2x 500G NVMe for system on a software RAID1)
GPU: 8x A100 80GB
Network: 10Gbps Ethernet