gpu-operator install fails with driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms''

### 1. Quick Debug Information
* OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHEL8.6
* Kernel Version: 4.18.0-372.9.1.el8.x86_64
* Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): cri-o://1.26.4
* K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8S 1.27.1
* GPU Operator Version: 23.9.x


### 2. Issue or feature description
Iam trying to install gpu operator using helm. During install, driver pod(nvidia-driver-daemonset-fwcvl) fails with below error
below are pod logs -- omitted the initial part and added only error logs.
+ '[' '' '!=' builtin ']'
Updating the package cache...
+ echo 'Updating the package cache...'
+ yum -q makecache
Error: Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
FATAL: failed to reach RHEL package repositories.  Ensure that the cluster can access the proper networks.
+ echo 'FATAL: failed to reach RHEL package repositories. ' 'Ensure that the cluster can access the proper networks.'


 - [ ] kubernetes pods status: `kubectl get pods -n gpu-operator`
 gpu-feature-discovery-zqm9h                                       0/1     Init:0/1           0                86m
gpu-operator-1700756391-node-feature-discovery-gc-5c546559bfmj2   1/1     Running            0                93m
gpu-operator-1700756391-node-feature-discovery-master-79796bzcb   1/1     Running            0                93m
gpu-operator-1700756391-node-feature-discovery-worker-6ddld       1/1     Running            0                93m
gpu-operator-1700756391-node-feature-discovery-worker-8c2k4       1/1     Running            0                93m
gpu-operator-1700756391-node-feature-discovery-worker-nzd7b       1/1     Running            0                93m
gpu-operator-1700756391-node-feature-discovery-worker-x8nx9       1/1     Running            0                93m
gpu-operator-68d85f45d-v97fz                                      1/1     Running            0                93m
nvidia-container-toolkit-daemonset-kqmtx                          0/1     Init:0/1           0                86m
nvidia-dcgm-exporter-5ncg7                                        0/1     Init:0/1           0                86m
nvidia-device-plugin-daemonset-qmvhc                              0/1     Init:0/1           0                86m
nvidia-driver-daemonset-fwcvl                                     0/1     CrashLoopBackOff   19 (3m20s ago)   87m
nvidia-operator-validator-vcztn                                   0/1     Init:0/4           0                86m

 - [ ] kubernetes daemonset status: `kubectl get ds -n OPERATOR_NAMESPACE`
 NAME                                                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                      AGE
gpu-feature-discovery                                   1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true   94m
gpu-operator-1700756391-node-feature-discovery-worker   4         4         4       4            4           <none>                                             94m
nvidia-container-toolkit-daemonset                      1         1         0       1            0           nvidia.com/gpu.deploy.container-toolkit=true       94m
nvidia-dcgm-exporter                                    1         1         0       1            0           nvidia.com/gpu.deploy.dcgm-exporter=true           94m
nvidia-device-plugin-daemonset                          1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true           94m
nvidia-driver-daemonset                                 1         1         0       1            0           nvidia.com/gpu.deploy.driver=true                  94m
nvidia-mig-manager                                      0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true             94m
nvidia-operator-validator                               1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true      94m
 - [ ] If a pod/ds is in an error state or pending state `kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME`
 k describe po nvidia-driver-daemonset-fwcvl
Name:                 nvidia-driver-daemonset-fwcvl
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-driver
Node:                 lab-worker-4/172.21.1.70
Start Time:           Thu, 23 Nov 2023 11:26:21 -0500
Labels:               app=nvidia-driver-daemonset
                      app.kubernetes.io/component=nvidia-driver
                      app.kubernetes.io/managed-by=gpu-operator
                      controller-revision-hash=5954d75477
                      helm.sh/chart=gpu-operator-v23.9.0
                      nvidia.com/precompiled=false
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: 14eb92fe162f5d1ddcf0d32343f0815ae1325dfca8eb88354d979f7cbc335c5d
                      cni.projectcalico.org/podIP: 192.168.148.114/32
                      cni.projectcalico.org/podIPs: 192.168.148.114/32
                      kubectl.kubernetes.io/default-container: nvidia-driver-ctr
Status:               Running
IP:                   192.168.148.114
IPs:
  IP:           192.168.148.114
Controlled By:  DaemonSet/nvidia-driver-daemonset
Init Containers:
  k8s-driver-manager:
    Container ID:  cri-o://b15e393c5603042c1938c49f132a706332ba76bb21dab6ea2d50a0fe2a0cf3b3
    Image:         nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.4
    Image ID:      nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5ca81f4f7e55f7b304dbbb7aaa235fca2656789145e4b34f47a7ab7079704dc7
    Port:          <none>
    Host Port:     <none>
    Command:
      driver-manager
    Args:
      uninstall_driver
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 23 Nov 2023 11:26:22 -0500
      Finished:     Thu, 23 Nov 2023 11:26:54 -0500
    Ready:          True
    Restart Count:  0
    Environment:
      NODE_NAME:                    (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:      void
      ENABLE_GPU_POD_EVICTION:     true
      ENABLE_AUTO_DRAIN:           false
      DRAIN_USE_FORCE:             false
      DRAIN_POD_SELECTOR_LABEL:
      DRAIN_TIMEOUT_SECONDS:       0s
      DRAIN_DELETE_EMPTYDIR_DATA:  false
      OPERATOR_NAMESPACE:          gpu-operator (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /run/nvidia from run-nvidia (rw)
      /sys from host-sys (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qphz2 (ro)
Containers:
  nvidia-driver-ctr:
    Container ID:  cri-o://8139fed89018b0c4382884f44dfa1f7146711824baf3029b9b8b416e4e91c9f5
    Image:         nvcr.io/nvidia/driver:525.125.06-rhel8.6
    Image ID:      nvcr.io/nvidia/driver@sha256:b58167d31d34784cd7c425961234d67c5e2d22eb4a5312681d0337dae812f746
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-driver
    Args:
      init
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 23 Nov 2023 12:49:50 -0500
      Finished:     Thu, 23 Nov 2023 12:50:24 -0500
    Ready:          False
    Restart Count:  19
    Startup:        exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
    Environment:    <none>
    Mounts:
      /dev/log from dev-log (rw)
      /host-etc/os-release from host-os-release (ro)
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
      /run/nvidia from run-nvidia (rw)
      /run/nvidia-topologyd from run-nvidia-topologyd (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qphz2 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  run-nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  DirectoryOrCreate
  var-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:
  dev-log:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/log
    HostPathType:
  host-os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:
  run-nvidia-topologyd:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia-topologyd
    HostPathType:  DirectoryOrCreate
  mlnx-ofed-usr-src:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers/usr/src
    HostPathType:  DirectoryOrCreate
  run-mellanox-drivers:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
  host-sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  kube-api-access-qphz2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.driver=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason   Age                    From     Message
  ----     ------   ----                   ----     -------
  Warning  BackOff  3m53s (x350 over 87m)  kubelet  Back-off restarting failed container nvidia-driver-ctr in pod nvidia-driver-daemonset-fwcvl_gpu-operator(1ab5bc39-dd70-411f-9592-a6b5b69ff723)

 any help on this issue will be very much appreciated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu-operator install fails with driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms'' #616

1. Quick Debug Information

2. Issue or feature description

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

gpu-operator install fails with driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms'' #616

Description

1. Quick Debug Information

2. Issue or feature description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions