install: Recreate and delayed default ServiceAccount deletion

wking · wking · commit 7530ded862b3 · 2023-08-29T09:28:27.000-07:00
Single-node updates from 4.13 to 4.14 are having slow leader handoffs during operator Deployment updates [1]. For example [2] in Loki: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node/1692132139799154688"} | unpack | pod="machine-config-operator-74d7475455-cgcmg" |= "leaderelection" gives: machine-config-operator-7987b6bcdd-6zvrt I0817 11:26:14.747630 1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config... I0817 11:26:14.756376 1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config E0817 11:31:15.153289 1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps machine-config) E0817 11:37:42.204539 1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF E0817 11:53:05.221213 1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF machine-config-operator-74d7475455-cgcmg I0817 12:21:41.019562 1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}. I0817 12:21:41.036459 1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config... machine-config-operator-7987b6bcdd-6zvrt E0817 12:21:41.780276 1 leaderelection.go:306] Failed to release lock: configmaps "machine-config" is forbidden: User "system:serviceaccount:openshift-machine-config-operator:default" cannot update resource "configmaps" in API group "" in the namespace "openshift-machine-config-operator" machine-config-operator-74d7475455-cgcmg I0817 12:27:27.767746 1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config I0817 12:35:29.649939 1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}. I0817 12:35:29.685549 1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config... I0817 12:35:29.696394 1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config That shows two lease handoffs: 1. During the update from 4.13 to 4.14, the cluster-version operator bumps the MCO Deployment to roll out the 4.14 image and other manifest changes. a. The outgoing 4.13 is using the default ServiceAccount, because it predates [3,4]. b. The CVO deletes the default ServiceAccount, because its manifest filename sorts it before the MCO Deployment. c. The CVO bumps the MCO Deployment. d. The MCO Deployment defaults to RollingUpdate [5] with a default maxSurge of 25% rounded up [6]. That maxSurge default is very old, dating back to at least the version of Kubernetes vendored by OpenShift 4.1 [7]. e. The Deployment controller surges the incoming pod, and at 12:21:41.036459 it checks to see if the lease is available, but it isn't, because the outgoing pod is still running. f. The Deployment controller TERMs the outgoing containers, and at 12:21:41.780276 the outgoing pod complains that it is no longer authorized to delete the lease ConfigMap, because the default ServiceAccount has been deleted in 1.b. g. The incoming pod acquires the lease 5m46s later at 12:27:27, when the outgoing container's final lease goes stale. 2. When the single node goes down for a reboot to roll out the 4.14 RHCOS, and the new container coming up at 12:35:29 acquires the lease smoothly. This commit improves the deployment rollout with two changes: * Shifting the default ServiceAccount deletion to a 90 manifest moves it behind 0000_80_machine-config-operator_04_deployment.yaml and 0000_80_machine-config-operator_06_clusteroperator.yaml. Moving behind the ClusterOperator ensures that deletion is not initiated until the incoming operator pod has updated the ClusterOperator to confirm the update has completed. Positioning between the Deployment and the ClusterOperator would have put the deletion after the CVO bumped the Deployment, but might still have raced against the outgoing operator pod actually shutting down. This should allow the outgoing pod to gracefully remove the ConfigMap and Lease resources during graceful shutdowns. * Shifting the Deployment to Recreate [8]. But also as discussed in [8]: Note: This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a StatefulSet. Still, Recreate will avoid the current milliseconds of overlap between the incoming pod's first attempt to acquire the lease and the outgoing pod's attempt to release during Deployment rollouts. Other MCO container restarts and drains-for-multi-control-plane-rolls and such may still have racey overlap, but should be limited to the: Worst graceful lease acquisition is {26s}. latency (at least with the current 4.14 single-node lease configuration). [1]: https://issues.redhat.com/browse/OCPBUGS-16905 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node/1692132139799154688 [3]: https://issues.redhat.com/browse/OCPBUGS-10924 [4]: ace637f [5]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy [6]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-surge [7]: openshift/api $ git --no-pager grep -B12 maxSurge origin/release-4.1 -- vendor/k8s.io/api/apps/v1/types.go origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // The maximum number of pods that can be scheduled above the desired number of origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // pods. origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%). origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // This can not be 0 if MaxUnavailable is 0. origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // Absolute number is calculated from percentage by rounding up. origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // Defaults to 25%. origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // Example: when this is set to 30%, the new ReplicaSet can be scaled up immediately when origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // the rolling update starts, such that the total number of old and new pods do not exceed origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // 130% of desired pods. Once old pods have been killed, origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // new ReplicaSet can be scaled up further, ensuring that total number of pods running origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // at any time during the update is at most 130% of desired pods. origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // +optional origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go: MaxSurge *intstr.IntOrString `json:"maxSurge,omitempty" protobuf:"bytes,2,opt,name=maxSurge" [8]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment
diff --git a/install/0000_80_machine-config-operator_03_rbac.yaml b/install/0000_80_machine-config-operator_03_rbac.yaml
@@ -71,23 +71,3 @@ subjects:
 - kind: ServiceAccount
   name: prometheus-k8s
   namespace: openshift-monitoring
-# keep default for upgrading purposes, need to delete the existing role: 
-# https://github.com/openshift/enhancements/blob/master/dev-guide/cluster-version-operator/dev/object-deletion.md
----
-apiVersion: rbac.authorization.k8s.io/v1
-kind: ClusterRoleBinding
-metadata:
-  name: default-account-openshift-machine-config-operator
-  annotations:
-    include.release.openshift.io/ibm-cloud-managed: "true"
-    include.release.openshift.io/self-managed-high-availability: "true"
-    include.release.openshift.io/single-node-developer: "true"
-    release.openshift.io/delete: "true"
-subjects:
-- kind: ServiceAccount
-  name: default
-  namespace: openshift-machine-config-operator
-roleRef: 
-  kind: ClusterRole
-  name: cluster-admin
-  apiGroup: rbac.authorization.k8s.io
diff --git a/install/0000_80_machine-config-operator_04_deployment.yaml b/install/0000_80_machine-config-operator_04_deployment.yaml
@@ -14,6 +14,8 @@ spec:
   selector:
     matchLabels:
       k8s-app: machine-config-operator
+  strategy:
+    type: Recreate
   template:
     metadata:
       labels:
diff --git a/install/0000_90_machine-config-operator_90_deletion.yaml b/install/0000_90_machine-config-operator_90_deletion.yaml
@@ -0,0 +1,19 @@
+# keep default for upgrading purposes, need to delete the existing role:
+# https://github.com/openshift/enhancements/blob/master/dev-guide/cluster-version-operator/dev/object-deletion.md
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: default-account-openshift-machine-config-operator
+  annotations:
+    include.release.openshift.io/ibm-cloud-managed: "true"
+    include.release.openshift.io/self-managed-high-availability: "true"
+    include.release.openshift.io/single-node-developer: "true"
+    release.openshift.io/delete: "true"
+subjects:
+- kind: ServiceAccount
+  name: default
+  namespace: openshift-machine-config-operator
+roleRef:
+  kind: ClusterRole
+  name: cluster-admin
+  apiGroup: rbac.authorization.k8s.io