Skip to content

nvidia-vgpu-driver daemonset always mount configmap named licensing-config even if licensingConfig.name is configured #672

@Levi080513

Description

@Levi080513

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHEL8.8
  • Kernel Version: 4.18.0-477.27.1.el8_8.x86_64
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd v1.6.22
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): v1.25.15
  • GPU Operator Version: 23.9.1

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

NVIDIADriver CR licensingConfig.name configured, but the nvidia-vgpu-driver daemonset always mount configmap named licensing-config.

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

  1. Install GPU Operator with parameter driver.nvidiaDriverCRD.enabled=true.
  2. Apply NVIDIADriver CR
apiVersion: nvidia.com/v1alpha1
kind: NVIDIADriver
metadata:
  name: vgpu-test
spec:
  driverType: vgpu
  env: []
  image: driver
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  manager: {}
  nodeSelector:
    cape.infrastructure.cluster.x-k8s.io/node-group: workergroup1
  repository: 10.255.128.144/sks/nvidia
  version: "525.105.17-grid"
  usePrecompiled: true
  licensingConfig:
    name: licensing-config-d41d8cd98f00b204e9800998ecf8427e
    nlsEnabled: true
  1. The daemonset for nvidia-vgpu-driver still mounts the configmap named licensing-config instead of mounting licensing-config-d41d8cd98f00b204e9800998ecf8427e.
  2. The events for driver pod is that:
Events:
  Type     Reason       Age                   From     Message
  ----     ------       ----                  ----     -------
  Warning  FailedMount  7m37s (x33 over 68m)  kubelet  (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[licensing-config], unattached volumes=[run-nvidia-topologyd k8tz nv-firmware var-log licensing-config host-root mlnx-ofed-usr-src run-nvidia firmware-search-path sysfs-memory-online host-sys kube-api-access-9grcv dev-log host-os-release run-mellanox-drivers]: timed out waiting for the condition
  Warning  FailedMount  3m26s (x44 over 88m)  kubelet  MountVolume.SetUp failed for volume "licensing-config" : configmap "licensing-config" not found

4. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log

It seems that there is a problem with the driver damonset manifest template.

{{- if and .AdditionalConfigs .AdditionalConfigs.Volumes }}
{{- range .AdditionalConfigs.Volumes }}
- name: {{ .Name }}
configMap:
name: {{ .Name }}
items:
{{- range .ConfigMap.Items }}
- key: {{ .Key }}
path: {{ .Path }}
{{- if .Mode }}
mode: {{ .Mode }}
{{- end }}
{{- end }}
{{- end }}
{{- end }}

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: operator_feedback@nvidia.com

Metadata

Metadata

Labels

bugIssue/PR to expose/discuss/fix a bug

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions