Skip to content

repoConfig is not mounted into GDS container #608

@age9990

Description

@age9990

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu20.04
  • Kernel Version: 5.15.x
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s
  • GPU Operator Version: v23.9.0

2. Issue or feature description

  1. From the code we can see the repoConfig is not mounted into GDS container, so the apt repository cannot be set to on-premise repository, causing the container in CrashLoopBackoff state.
    It should contain the following in nvidia-fs-ctr
    https://github.com/NVIDIA/gpu-operator/blob/master/manifests/state-driver/0500_daemonset.yaml
    {{- if and .AdditionalConfigs .AdditionalConfigs.VolumeMounts }}
    {{- range .AdditionalConfigs.VolumeMounts }}

  2. What's more, the GDS image name should concatenate os info, like what we do for nvidia driver pod.
    The default values.yaml, will cause the image pull backoff since the image tag is not correct (missing os, it should be 2.16.1-ubuntu20.04)
    gds:
    version: "2.16.1"
    From the code, the os is not used to construct imagePath.

    func getGDSSpec(spec *nvidiav1alpha1.NVIDIADriverSpec) (*gdsDriverSpec, error) {

    driver image path does reference os.
    https://github.com/NVIDIA/gpu-operator/blob/master/internal/state/driver.go#L472

3. Steps to reproduce the issue

Enable gds then the issue is reproduced.

@shivamerla Please help to resolve these issues to use GDS properly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions