Skip to content

gpu-operator install fails with driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms'' #616

@aneesh786

Description

@aneesh786

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHEL8.6
  • Kernel Version: 4.18.0-372.9.1.el8.x86_64
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): cri-o://1.26.4
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8S 1.27.1
  • GPU Operator Version: 23.9.x

2. Issue or feature description

Iam trying to install gpu operator using helm. During install, driver pod(nvidia-driver-daemonset-fwcvl) fails with below error
below are pod logs -- omitted the initial part and added only error logs.

  • '[' '' '!=' builtin ']'
    Updating the package cache...
  • echo 'Updating the package cache...'
  • yum -q makecache
    Error: Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
    FATAL: failed to reach RHEL package repositories. Ensure that the cluster can access the proper networks.
  • echo 'FATAL: failed to reach RHEL package repositories. ' 'Ensure that the cluster can access the proper networks.'
  • kubernetes pods status: kubectl get pods -n gpu-operator
    gpu-feature-discovery-zqm9h 0/1 Init:0/1 0 86m
    gpu-operator-1700756391-node-feature-discovery-gc-5c546559bfmj2 1/1 Running 0 93m
    gpu-operator-1700756391-node-feature-discovery-master-79796bzcb 1/1 Running 0 93m
    gpu-operator-1700756391-node-feature-discovery-worker-6ddld 1/1 Running 0 93m
    gpu-operator-1700756391-node-feature-discovery-worker-8c2k4 1/1 Running 0 93m
    gpu-operator-1700756391-node-feature-discovery-worker-nzd7b 1/1 Running 0 93m
    gpu-operator-1700756391-node-feature-discovery-worker-x8nx9 1/1 Running 0 93m
    gpu-operator-68d85f45d-v97fz 1/1 Running 0 93m
    nvidia-container-toolkit-daemonset-kqmtx 0/1 Init:0/1 0 86m
    nvidia-dcgm-exporter-5ncg7 0/1 Init:0/1 0 86m
    nvidia-device-plugin-daemonset-qmvhc 0/1 Init:0/1 0 86m
    nvidia-driver-daemonset-fwcvl 0/1 CrashLoopBackOff 19 (3m20s ago) 87m
    nvidia-operator-validator-vcztn 0/1 Init:0/4 0 86m

  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
    NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
    gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 94m
    gpu-operator-1700756391-node-feature-discovery-worker 4 4 4 4 4 94m
    nvidia-container-toolkit-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.container-toolkit=true 94m
    nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 94m
    nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 94m
    nvidia-driver-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.driver=true 94m
    nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 94m
    nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 94m

  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
    k describe po nvidia-driver-daemonset-fwcvl
    Name: nvidia-driver-daemonset-fwcvl
    Namespace: gpu-operator
    Priority: 2000001000
    Priority Class Name: system-node-critical
    Service Account: nvidia-driver
    Node: lab-worker-4/172.21.1.70
    Start Time: Thu, 23 Nov 2023 11:26:21 -0500
    Labels: app=nvidia-driver-daemonset
    app.kubernetes.io/component=nvidia-driver
    app.kubernetes.io/managed-by=gpu-operator
    controller-revision-hash=5954d75477
    helm.sh/chart=gpu-operator-v23.9.0
    nvidia.com/precompiled=false
    pod-template-generation=1
    Annotations: cni.projectcalico.org/containerID: 14eb92fe162f5d1ddcf0d32343f0815ae1325dfca8eb88354d979f7cbc335c5d
    cni.projectcalico.org/podIP: 192.168.148.114/32
    cni.projectcalico.org/podIPs: 192.168.148.114/32
    kubectl.kubernetes.io/default-container: nvidia-driver-ctr
    Status: Running
    IP: 192.168.148.114
    IPs:
    IP: 192.168.148.114
    Controlled By: DaemonSet/nvidia-driver-daemonset
    Init Containers:
    k8s-driver-manager:
    Container ID: cri-o://b15e393c5603042c1938c49f132a706332ba76bb21dab6ea2d50a0fe2a0cf3b3
    Image: nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.4
    Image ID: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5ca81f4f7e55f7b304dbbb7aaa235fca2656789145e4b34f47a7ab7079704dc7
    Port:
    Host Port:
    Command:
    driver-manager
    Args:
    uninstall_driver
    State: Terminated
    Reason: Completed
    Exit Code: 0
    Started: Thu, 23 Nov 2023 11:26:22 -0500
    Finished: Thu, 23 Nov 2023 11:26:54 -0500
    Ready: True
    Restart Count: 0
    Environment:
    NODE_NAME: (v1:spec.nodeName)
    NVIDIA_VISIBLE_DEVICES: void
    ENABLE_GPU_POD_EVICTION: true
    ENABLE_AUTO_DRAIN: false
    DRAIN_USE_FORCE: false
    DRAIN_POD_SELECTOR_LABEL:
    DRAIN_TIMEOUT_SECONDS: 0s
    DRAIN_DELETE_EMPTYDIR_DATA: false
    OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
    Mounts:
    /host from host-root (ro)
    /run/nvidia from run-nvidia (rw)
    /sys from host-sys (rw)
    /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qphz2 (ro)
    Containers:
    nvidia-driver-ctr:
    Container ID: cri-o://8139fed89018b0c4382884f44dfa1f7146711824baf3029b9b8b416e4e91c9f5
    Image: nvcr.io/nvidia/driver:525.125.06-rhel8.6
    Image ID: nvcr.io/nvidia/driver@sha256:b58167d31d34784cd7c425961234d67c5e2d22eb4a5312681d0337dae812f746
    Port:
    Host Port:
    Command:
    nvidia-driver
    Args:
    init
    State: Waiting
    Reason: CrashLoopBackOff
    Last State: Terminated
    Reason: Error
    Exit Code: 1
    Started: Thu, 23 Nov 2023 12:49:50 -0500
    Finished: Thu, 23 Nov 2023 12:50:24 -0500
    Ready: False
    Restart Count: 19
    Startup: exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
    Environment:
    Mounts:
    /dev/log from dev-log (rw)
    /host-etc/os-release from host-os-release (ro)
    /run/mellanox/drivers from run-mellanox-drivers (rw)
    /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
    /run/nvidia from run-nvidia (rw)
    /run/nvidia-topologyd from run-nvidia-topologyd (rw)
    /var/log from var-log (rw)
    /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qphz2 (ro)
    Conditions:
    Type Status
    Initialized True
    Ready False
    ContainersReady False
    PodScheduled True
    Volumes:
    run-nvidia:
    Type: HostPath (bare host directory volume)
    Path: /run/nvidia
    HostPathType: DirectoryOrCreate
    var-log:
    Type: HostPath (bare host directory volume)
    Path: /var/log
    HostPathType:
    dev-log:
    Type: HostPath (bare host directory volume)
    Path: /dev/log
    HostPathType:
    host-os-release:
    Type: HostPath (bare host directory volume)
    Path: /etc/os-release
    HostPathType:
    run-nvidia-topologyd:
    Type: HostPath (bare host directory volume)
    Path: /run/nvidia-topologyd
    HostPathType: DirectoryOrCreate
    mlnx-ofed-usr-src:
    Type: HostPath (bare host directory volume)
    Path: /run/mellanox/drivers/usr/src
    HostPathType: DirectoryOrCreate
    run-mellanox-drivers:
    Type: HostPath (bare host directory volume)
    Path: /run/mellanox/drivers
    HostPathType: DirectoryOrCreate
    run-nvidia-validations:
    Type: HostPath (bare host directory volume)
    Path: /run/nvidia/validations
    HostPathType: DirectoryOrCreate
    host-root:
    Type: HostPath (bare host directory volume)
    Path: /
    HostPathType:
    host-sys:
    Type: HostPath (bare host directory volume)
    Path: /sys
    HostPathType: Directory
    kube-api-access-qphz2:
    Type: Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds: 3607
    ConfigMapName: kube-root-ca.crt
    ConfigMapOptional:
    DownwardAPI: true
    QoS Class: BestEffort
    Node-Selectors: nvidia.com/gpu.deploy.driver=true
    Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
    node.kubernetes.io/memory-pressure:NoSchedule op=Exists
    node.kubernetes.io/not-ready:NoExecute op=Exists
    node.kubernetes.io/pid-pressure:NoSchedule op=Exists
    node.kubernetes.io/unreachable:NoExecute op=Exists
    node.kubernetes.io/unschedulable:NoSchedule op=Exists
    nvidia.com/gpu:NoSchedule op=Exists
    Events:
    Type Reason Age From Message


Warning BackOff 3m53s (x350 over 87m) kubelet Back-off restarting failed container nvidia-driver-ctr in pod nvidia-driver-daemonset-fwcvl_gpu-operator(1ab5bc39-dd70-411f-9592-a6b5b69ff723)

any help on this issue will be very much appreciated

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions