The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
1. Issue or feature description
To remove the issue of all gpus being visible on pods with no gpu's allocated with the env CUDA_VISIBLE_DEVICES=all,
we set
accept-nvidia-visible-devices-envvar-when-unprivileged = false accept-nvidia-visible-devices-as-volume-mounts = true
to /etc/nvidia-container-runtime/config.toml
and changed the device plugin accordingly..
all seems to work fine except for the cuda validator, which has
securityContext: allowPrivilegeEscalation: false
set for both init container and the container itself.
in our current setting the securityContext of containers without gpu resources needs to be set to 'privileged: true' for it to work properly.
helm chart doesnt seem to have any options to change this, is this the proper way??
currently we set the node label to false by running the following command
kubectl label nodes giops2 nvidia.com/gpu.deploy.operator-validator=false --overwrite
is this an acceptable workaround???
and is there a way to change the default value of the .deploy node labels? in values.yaml file or such?
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_coreandipmi_msghandlerloaded on the nodes? yeskubectl describe clusterpolicies --all-namespaces) yes1. Issue or feature description
To remove the issue of all gpus being visible on pods with no gpu's allocated with the env CUDA_VISIBLE_DEVICES=all,
we set
accept-nvidia-visible-devices-envvar-when-unprivileged = false accept-nvidia-visible-devices-as-volume-mounts = trueto /etc/nvidia-container-runtime/config.toml
and changed the device plugin accordingly..
all seems to work fine except for the cuda validator, which has
securityContext: allowPrivilegeEscalation: falseset for both init container and the container itself.
in our current setting the securityContext of containers without gpu resources needs to be set to 'privileged: true' for it to work properly.
helm chart doesnt seem to have any options to change this, is this the proper way??
currently we set the node label to false by running the following command
kubectl label nodes giops2 nvidia.com/gpu.deploy.operator-validator=false --overwriteis this an acceptable workaround???
and is there a way to change the default value of the .deploy node labels? in values.yaml file or such?
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
kubernetes pods status:
kubectl get pods --all-namespaceskubernetes daemonset status:
kubectl get ds --all-namespacesIf a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAMEIf a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAMEOutput of running a container on the GPU machine:
docker run -it alpine echo fooDocker configuration file:
cat /etc/docker/daemon.jsonDocker runtime configuration:
docker info | grep runtimeNVIDIA shared directory:
ls -la /run/nvidiaNVIDIA packages directory:
ls -la /usr/local/nvidia/toolkitNVIDIA driver directory:
ls -la /run/nvidia/driverkubelet logs
journalctl -u kubelet > kubelet.logs