Description of problem:
Running etcd defrag on OpenShift Container Platform 4 with etcd database size bigger than 6 GB is causing the GPU-Operator to fail and restart due to lease renewal failure.
Since there is a 30 second disruption expected from etcd side when doing the defrag activity, we need to be able to deal with this kind of situation and allow better fault tolerance.
In https://github.com/openshift/library-go/blob/4362aa519714a4b62b00ab8318197ba2bba51cb7/pkg/config/leaderelection/leaderelection.go#L104 the value is set to 60 seconds and there is also some explanation why this value was chosen.
In https://github.com/kubernetes-sigs/controller-runtime/blob/v0.11.1/pkg/manager/manager.go#L182-L184 the default value is much smaller and thus does not have a lot failure toleration.
This is especially important in environments with large etcd databases as restarting the Operator will cause all CSV being updated with the Operator status and thus causing a spike with regards to etcd database instead of actually reducing the overall size.
Version-Release number of selected component (if applicable):
- OpenShift Container Platform 4.x
How reproducible:
Steps to Reproduce:
- Setup OpenShift Container Platform 4 - with GPU-Operator
- Load etcd with 6 GB or more in database size
- Run etcd defrag activtiy as per https://docs.openshift.com/container-platform/4.9/post_installation_configuration/cluster-tasks.html#etcd-defrag_post-install-cluster-tasks
- Watch how GPU-Operator is restarting and causing CSV's being updated
Actual results:
GPU-Operator is restarting due to lease renewal failure
Expected results:
GPU-Operator to have more fault tolerance and therefore prevent failing when etcd is unavailable for a short period of time due to defrag activity.
Additional info:
Similar issues were raised with other Operators as well, such as https://bugzilla.redhat.com/show_bug.cgi?id=2058256
Description of problem:
Running
etcddefrag on OpenShift Container Platform 4 withetcddatabase size bigger than 6 GB is causing the GPU-Operator to fail and restart due to lease renewal failure.Since there is a 30 second disruption expected from
etcdside when doing the defrag activity, we need to be able to deal with this kind of situation and allow better fault tolerance.In https://github.com/openshift/library-go/blob/4362aa519714a4b62b00ab8318197ba2bba51cb7/pkg/config/leaderelection/leaderelection.go#L104 the value is set to 60 seconds and there is also some explanation why this value was chosen.
In https://github.com/kubernetes-sigs/controller-runtime/blob/v0.11.1/pkg/manager/manager.go#L182-L184 the default value is much smaller and thus does not have a lot failure toleration.
This is especially important in environments with large
etcddatabases as restarting the Operator will cause all CSV being updated with the Operator status and thus causing a spike with regards toetcddatabase instead of actually reducing the overall size.Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
Actual results:
GPU-Operator is restarting due to lease renewal failure
Expected results:
GPU-Operator to have more fault tolerance and therefore prevent failing when etcd is unavailable for a short period of time due to defrag activity.
Additional info:
Similar issues were raised with other Operators as well, such as https://bugzilla.redhat.com/show_bug.cgi?id=2058256