-
Notifications
You must be signed in to change notification settings - Fork 474
Description
See https://bugzilla.redhat.com/show_bug.cgi?id=1850057
(This content is canonically stored at https://github.com/cgwalters/workboard/tree/master/openshift/bz1850057-etcd-osupdate )
OS upgrade I/O competes with etcd
Currently the MCD does:
- drain
- apply updates
- reboot
Now "drain" keeps both daemonsets and static pods running. Of those two, etcd is a static pod today. When we're applying OS updates, that can be a lot of I/O and (reportedly) compete with etcd.
We have two options:
- kill etcd after draining (or even more strongly, stop kubelet)
- "stage" updates gracefully while everything is running
I like option 2) better because we've put a whole lot of work into making the ostree stack support this "stage updates while system is running" and it'd be cool if OpenShift used it 😄 Another way to say this is - I think we want to minimize the time window in which the etcd cluster is missing a member, so the more work we can do while etcd is still running the better!
Links
Workboard:
Pull request in progress: #1957
Example failing jobs:
- https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.4-stable-to-4.5-ci/1275367195391561728
- https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.4-stable-to-4.5-ci/1275416807540264960
Note that both of those jobs jumped from RHEL 8.1 to 8.2.
Prometheus queries
histogram_quantile(0.99, sum(irate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (instance, le))
histogram_quantile(0.99, sum(irate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) by (instance, le))