E1104 12:53:42.412741 1 event.go:359] "Server rejected event (will not retry!)" err="events is forbidden: User \"system:serviceaccount:gpu-operator:gpu-operator\" cannot create resource \"events\" in API group \"\" in the namespace \"default\"" event="&Event{ObjectMeta:{node.1804c50e638a62b9 default 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Node,Namespace:,Name:node,UID:6df7a803-6a84-4c8b-9461-d57f417348f3,APIVersion:v1,ResourceVersion:941906333,FieldPath:,},Reason:GPUDriverUpgrade,Message:Successfully updated node state label to [upgrade-required]%!(EXTRA <nil>),Source:EventSource{Component:nvidia-gpu-operator,Host:,},FirstTimestamp:2024-11-04 12:53:42.407340729 +0000 UTC m=+30.041827085,LastTimestamp:2024-11-04 12:53:42.407340729 +0000 UTC m=+30.041827085,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:nvidia-gpu-operator,ReportingInstance:,}"
The
gpu-operatorlogs the following error:and
kubectl get events --sort-by='.lastTimestamp' | grep GPUDriverUpgrade(from https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-driver-upgrades.html#metrics-and-events) does not return any events.Workaround:
Adding the following snippet to the
gpu-operatorClusterRole:NOTE: This needs to be added to the
ClusterRole, as the events are created in thedefaultnamespace, while thegpu-operatoris installed (and theServiceAccountlocated) in thegpu-operatornamespace.This is a rather broad approach. Another option would probably be to create a separate
gpu-operatorRole(+RoleBinding) with the necessary permissions in thedefaultnamespace.Additional information:
gpu-operatorversion:24.9.0