The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
1. Issue or feature description
Some nodes have a specific taints. The deamonSets can be configured to have tolerations but the validators don't support that as far as I could see.
Can I have some help with this please?
2. Steps to reproduce the issue
values.yaml config
daemonsets:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: dedicated
operator: Equal
effect: NoExecute
value: customValue100
Node taints:
taints:
- effect: NoExecute
key: dedicated
value: customValue100
Node labels:
nvidia.com/gpu.deploy.container-toolkit: "true"
nvidia.com/gpu.deploy.dcgm: "true"
nvidia.com/gpu.deploy.dcgm-exporter: "true"
nvidia.com/gpu.deploy.device-plugin: "true"
nvidia.com/gpu.deploy.driver: "false"
nvidia.com/gpu.deploy.gpu-feature-discovery: "true"
nvidia.com/gpu.deploy.node-status-exporter: "true"
nvidia.com/gpu.deploy.operator-validator: "true"
nvidia.com/gpu.present: "true"
3. Information to attach (optional if deemed irrelevant)
time="2022-06-23T14:56:38Z" level=info msg="pod nvidia-cuda-validator-lb52p is curently in Pending phase"
time="2022-06-23T14:56:44Z" level=info msg="Error: error validating cuda workload: failed to get pod nvidia-cuda-validator-lb52p, err pods \"nvidia-cuda-validator-lb52p\" not found"
0s Normal TaintManagerEviction pod/nvidia-cuda-validator-28cfx Marking for deletion Pod gpu-operator/nvidia-cuda-validator-lb52p
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_coreandipmi_msghandlerloaded on the nodes?kubectl describe clusterpolicies --all-namespaces)1. Issue or feature description
Some nodes have a specific taints. The deamonSets can be configured to have tolerations but the validators don't support that as far as I could see.
Can I have some help with this please?
2. Steps to reproduce the issue
values.yaml config
Node taints:
Node labels:
3. Information to attach (optional if deemed irrelevant)
kubectl logs -n NAMESPACE POD_NAMEkubectl get events -w