Bug 1850057: stage OS updates (nicely) while etcd is still running

See https://bugzilla.redhat.com/show_bug.cgi?id=1850057

(This content is canonically stored at https://github.com/cgwalters/workboard/tree/master/openshift/bz1850057-etcd-osupdate )

# OS upgrade I/O competes with etcd

Currently the MCD does:

- drain
- apply updates
- reboot

Now "drain" keeps both daemonsets and static pods running.  Of those two, etcd is a static pod today.  When we're applying OS updates, that can be a lot of I/O and (reportedly) compete with etcd.

We have two options:

1. kill etcd after draining (or even more strongly, stop kubelet)
1. "stage" updates gracefully while everything is running

I like option 2) better because we've put a *whole lot* of work into making the ostree stack support this "stage updates while system is running" and it'd be cool if OpenShift used it :smile:   Another way to say this is - I think we want to minimize the time window in which the etcd cluster is missing a member, so the more work we can do while etcd is still running the better!

## Links

Workboard: 

Pull request *in progress*: https://github.com/openshift/machine-config-operator/pull/1957

### Example failing jobs:

- https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/1275367195391561728
- https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.4-stable-to-4.5-ci/1275416807540264960

Note that both of those jobs jumped from RHEL 8.1 to 8.2.

### Prometheus queries

```
histogram_quantile(0.99, sum(irate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le)) 
histogram_quantile(0.99, sum(irate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le))
```

---


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1850057: stage OS updates (nicely) while etcd is still running #1897

OS upgrade I/O competes with etcd

Links

Example failing jobs:

Prometheus queries

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug 1850057: stage OS updates (nicely) while etcd is still running #1897

Description

OS upgrade I/O competes with etcd

Links

Example failing jobs:

Prometheus queries

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions