LeaseDuration for GPU-Operator seems to be rather small, causing Operator restarts when running etcd defrag

### Description of problem:
Running `etcd` defrag on OpenShift Container Platform 4 with `etcd` database size bigger than 6 GB is causing the GPU-Operator to fail and restart due to lease renewal failure.

Since there is a 30 second disruption expected from `etcd` side when doing the defrag activity, we need to be able to deal with this kind of situation and allow better fault tolerance.

In https://github.com/openshift/library-go/blob/4362aa519714a4b62b00ab8318197ba2bba51cb7/pkg/config/leaderelection/leaderelection.go#L104 the value is set to 60 seconds and there is also some explanation why this value was chosen.

In https://github.com/kubernetes-sigs/controller-runtime/blob/v0.11.1/pkg/manager/manager.go#L182-L184 the default value is much smaller and thus does not have a lot failure toleration.

This is especially important in environments with large `etcd` databases as restarting the Operator will cause all CSV being updated with the Operator status and thus causing a spike with regards to `etcd` database instead of actually reducing the overall size.

### Version-Release number of selected component (if applicable):

 - OpenShift Container Platform 4.x

### How reproducible:

 - Always

### Steps to Reproduce:
1. Setup OpenShift Container Platform 4 - with GPU-Operator
2. Load etcd with 6 GB or more in database size
3. Run etcd defrag activtiy as per https://docs.openshift.com/container-platform/4.9/post_installation_configuration/cluster-tasks.html#etcd-defrag_post-install-cluster-tasks
4. Watch how GPU-Operator is restarting and causing CSV's being updated

### Actual results:
GPU-Operator is restarting due to lease renewal failure

### Expected results:

GPU-Operator to have more fault tolerance and therefore prevent failing when etcd is unavailable for a short period of time due to defrag activity.

### Additional info:
Similar issues were raised with other Operators as well, such as https://bugzilla.redhat.com/show_bug.cgi?id=2058256


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LeaseDuration for GPU-Operator seems to be rather small, causing Operator restarts when running etcd defrag #326

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

LeaseDuration for GPU-Operator seems to be rather small, causing Operator restarts when running etcd defrag #326

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions