Skip to content

gpu-operator does not have permissions to create 'GPUDriverUpgrade' events #1101

@ein-stein-chen

Description

@ein-stein-chen

The gpu-operator logs the following error:

E1104 12:53:42.412741       1 event.go:359] "Server rejected event (will not retry!)" err="events is forbidden: User \"system:serviceaccount:gpu-operator:gpu-operator\" cannot create resource \"events\" in API group \"\" in the namespace \"default\"" event="&Event{ObjectMeta:{node.1804c50e638a62b9  default    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:node,UID:6df7a803-6a84-4c8b-9461-d57f417348f3,APIVersion:v1,ResourceVersion:941906333,FieldPath:,},Reason:GPUDriverUpgrade,Message:Successfully updated node state label to [upgrade-required]%!(EXTRA <nil>),Source:EventSource{Component:nvidia-gpu-operator,Host:,},FirstTimestamp:2024-11-04 12:53:42.407340729 +0000 UTC m=+30.041827085,LastTimestamp:2024-11-04 12:53:42.407340729 +0000 UTC m=+30.041827085,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:nvidia-gpu-operator,ReportingInstance:,}"

and kubectl get events --sort-by='.lastTimestamp' | grep GPUDriverUpgrade (from https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-upgrades.html#metrics-and-events) does not return any events.

Workaround:

Adding the following snippet to the gpu-operator ClusterRole:

- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - '*'

NOTE: This needs to be added to the ClusterRole, as the events are created in the default namespace, while the gpu-operator is installed (and the ServiceAccount located) in the gpu-operator namespace.
This is a rather broad approach. Another option would probably be to create a separate gpu-operator Role (+ RoleBinding) with the necessary permissions in the default namespace.

Additional information:

  • gpu-operator version: 24.9.0

Metadata

Metadata

Assignees

Labels

bugIssue/PR to expose/discuss/fix a bug

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions