The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHEL8.8
- Kernel Version: 4.18.0-477.27.1.el8_8.x86_64
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd v1.6.22
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): v1.25.15
- GPU Operator Version: 23.9.1
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
NVIDIADriver CR licensingConfig.name configured, but the nvidia-vgpu-driver daemonset always mount configmap named licensing-config.
3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
- Install GPU Operator with parameter
driver.nvidiaDriverCRD.enabled=true.
- Apply NVIDIADriver CR
apiVersion: nvidia.com/v1alpha1
kind: NVIDIADriver
metadata:
name: vgpu-test
spec:
driverType: vgpu
env: []
image: driver
imagePullPolicy: IfNotPresent
imagePullSecrets: []
manager: {}
nodeSelector:
cape.infrastructure.cluster.x-k8s.io/node-group: workergroup1
repository: 10.255.128.144/sks/nvidia
version: "525.105.17-grid"
usePrecompiled: true
licensingConfig:
name: licensing-config-d41d8cd98f00b204e9800998ecf8427e
nlsEnabled: true
- The daemonset for nvidia-vgpu-driver still mounts the configmap named
licensing-config instead of mounting licensing-config-d41d8cd98f00b204e9800998ecf8427e.
- The events for driver pod is that:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 7m37s (x33 over 68m) kubelet (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[licensing-config], unattached volumes=[run-nvidia-topologyd k8tz nv-firmware var-log licensing-config host-root mlnx-ofed-usr-src run-nvidia firmware-search-path sysfs-memory-online host-sys kube-api-access-9grcv dev-log host-os-release run-mellanox-drivers]: timed out waiting for the condition
Warning FailedMount 3m26s (x44 over 88m) kubelet MountVolume.SetUp failed for volume "licensing-config" : configmap "licensing-config" not found
4. Information to attach (optional if deemed irrelevant)
It seems that there is a problem with the driver damonset manifest template.
|
{{- if and .AdditionalConfigs .AdditionalConfigs.Volumes }} |
|
{{- range .AdditionalConfigs.Volumes }} |
|
- name: {{ .Name }} |
|
configMap: |
|
name: {{ .Name }} |
|
items: |
|
{{- range .ConfigMap.Items }} |
|
- key: {{ .Key }} |
|
path: {{ .Path }} |
|
{{- if .Mode }} |
|
mode: {{ .Mode }} |
|
{{- end }} |
|
{{- end }} |
|
{{- end }} |
|
{{- end }} |
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
NVIDIADriver CR licensingConfig.name configured, but the nvidia-vgpu-driver daemonset always mount configmap named
licensing-config.3. Steps to reproduce the issue
Detailed steps to reproduce the issue.
driver.nvidiaDriverCRD.enabled=true.licensing-configinstead of mountinglicensing-config-d41d8cd98f00b204e9800998ecf8427e.4. Information to attach (optional if deemed irrelevant)
kubectl get pods -n OPERATOR_NAMESPACEkubectl get ds -n OPERATOR_NAMESPACEkubectl describe pod -n OPERATOR_NAMESPACE POD_NAMEkubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containersnvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smijournalctl -u containerd > containerd.logIt seems that there is a problem with the driver damonset manifest template.
gpu-operator/manifests/state-driver/0500_daemonset.yaml
Lines 623 to 637 in 30bc55d
Collecting full debug bundle (optional):
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com