Skip to content

Bug 1850057: stage OS updates (nicely) while etcd is still running #1897

@cgwalters

Description

@cgwalters

See https://bugzilla.redhat.com/show_bug.cgi?id=1850057

(This content is canonically stored at https://github.com/cgwalters/workboard/tree/master/openshift/bz1850057-etcd-osupdate )

OS upgrade I/O competes with etcd

Currently the MCD does:

  • drain
  • apply updates
  • reboot

Now "drain" keeps both daemonsets and static pods running. Of those two, etcd is a static pod today. When we're applying OS updates, that can be a lot of I/O and (reportedly) compete with etcd.

We have two options:

  1. kill etcd after draining (or even more strongly, stop kubelet)
  2. "stage" updates gracefully while everything is running

I like option 2) better because we've put a whole lot of work into making the ostree stack support this "stage updates while system is running" and it'd be cool if OpenShift used it 😄 Another way to say this is - I think we want to minimize the time window in which the etcd cluster is missing a member, so the more work we can do while etcd is still running the better!

Links

Workboard:

Pull request in progress: #1957

Example failing jobs:

Note that both of those jobs jumped from RHEL 8.1 to 8.2.

Prometheus queries

histogram_quantile(0.99, sum(irate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le)) 
histogram_quantile(0.99, sum(irate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le))

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions