Skip to content

LeaseDuration for GPU-Operator seems to be rather small, causing Operator restarts when running etcd defrag #326

@sreber84

Description

@sreber84

Description of problem:

Running etcd defrag on OpenShift Container Platform 4 with etcd database size bigger than 6 GB is causing the GPU-Operator to fail and restart due to lease renewal failure.

Since there is a 30 second disruption expected from etcd side when doing the defrag activity, we need to be able to deal with this kind of situation and allow better fault tolerance.

In https://github.com/openshift/library-go/blob/4362aa519714a4b62b00ab8318197ba2bba51cb7/pkg/config/leaderelection/leaderelection.go#L104 the value is set to 60 seconds and there is also some explanation why this value was chosen.

In https://github.com/kubernetes-sigs/controller-runtime/blob/v0.11.1/pkg/manager/manager.go#L182-L184 the default value is much smaller and thus does not have a lot failure toleration.

This is especially important in environments with large etcd databases as restarting the Operator will cause all CSV being updated with the Operator status and thus causing a spike with regards to etcd database instead of actually reducing the overall size.

Version-Release number of selected component (if applicable):

  • OpenShift Container Platform 4.x

How reproducible:

  • Always

Steps to Reproduce:

  1. Setup OpenShift Container Platform 4 - with GPU-Operator
  2. Load etcd with 6 GB or more in database size
  3. Run etcd defrag activtiy as per https://docs.openshift.com/container-platform/4.9/post_installation_configuration/cluster-tasks.html#etcd-defrag_post-install-cluster-tasks
  4. Watch how GPU-Operator is restarting and causing CSV's being updated

Actual results:

GPU-Operator is restarting due to lease renewal failure

Expected results:

GPU-Operator to have more fault tolerance and therefore prevent failing when etcd is unavailable for a short period of time due to defrag activity.

Additional info:

Similar issues were raised with other Operators as well, such as https://bugzilla.redhat.com/show_bug.cgi?id=2058256

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions