cuda-workload-validator crashloopback

_The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense._

### 1. Quick Debug Checklist
- [ ] Are you running on an Ubuntu 18.04 node? 20.04
- [ ] Are you running Kubernetes v1.13+? 1.21
- [ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? 21.10.14
- [ ] Do you have `i2c_core` and `ipmi_msghandler` loaded on the nodes? yes
- [ ] Did you apply the CRD (`kubectl describe clusterpolicies --all-namespaces`) yes

### 1. Issue or feature description
To remove the issue of all gpus being visible on pods with no gpu's allocated with the env CUDA_VISIBLE_DEVICES=all,
we set
`accept-nvidia-visible-devices-envvar-when-unprivileged = false
accept-nvidia-visible-devices-as-volume-mounts = true`
to /etc/nvidia-container-runtime/config.toml

and changed the device plugin accordingly..

all seems to work fine except for the cuda validator, which has
`securityContext:
      allowPrivilegeEscalation: false`
set for both init container and the container itself. 

in our current setting the securityContext of containers without gpu resources needs to be set to 'privileged: true' for it to work properly.

helm chart doesnt seem to have any options to change this, is this the proper way??

currently we set the node label to false by running the following command
`kubectl label nodes giops2 nvidia.com/gpu.deploy.operator-validator=false --overwrite`
is this an acceptable workaround??? 

and is there a way to change the default value of the .deploy node labels? in values.yaml file or such?


### 2. Steps to reproduce the issue

### 3. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)

 - [ ] kubernetes pods status: `kubectl get pods --all-namespaces`
 - [ ] kubernetes daemonset status: `kubectl get ds --all-namespaces`
 - [ ] If a pod/ds is in an error state or pending state `kubectl describe pod -n NAMESPACE POD_NAME`
 - [ ] If a pod/ds is in an error state or pending state `kubectl logs -n NAMESPACE POD_NAME`

 - [ ] Output of running a container on the GPU machine: `docker run -it alpine echo foo`
 - [ ] Docker configuration file: `cat /etc/docker/daemon.json`
 - [ ] Docker runtime configuration: `docker info | grep runtime`

 - [ ] NVIDIA shared directory: `ls -la /run/nvidia`
 - [ ] NVIDIA packages directory: `ls -la /usr/local/nvidia/toolkit`
 - [ ] NVIDIA driver directory: `ls -la /run/nvidia/driver`
 - [ ] kubelet logs `journalctl -u kubelet > kubelet.logs`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda-workload-validator crashloopback #365

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

cuda-workload-validator crashloopback #365

Description

1. Quick Debug Checklist

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions