[release-4.13] OCPBUGS-19400: install: Recreate and delayed default ServiceAccount deletion#3923
Conversation
Single-node updates from 4.13 to 4.14 are having slow leader handoffs
during operator Deployment updates [1]. For example [2] in Loki:
{invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node/1692132139799154688"} | unpack | pod="machine-config-operator-74d7475455-cgcmg" |= "leaderelection"
gives:
machine-config-operator-7987b6bcdd-6zvrt
I0817 11:26:14.747630 1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
I0817 11:26:14.756376 1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config
E0817 11:31:15.153289 1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps machine-config)
E0817 11:37:42.204539 1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF
E0817 11:53:05.221213 1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF
machine-config-operator-74d7475455-cgcmg
I0817 12:21:41.019562 1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
I0817 12:21:41.036459 1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
machine-config-operator-7987b6bcdd-6zvrt
E0817 12:21:41.780276 1 leaderelection.go:306] Failed to release lock: configmaps "machine-config" is forbidden: User "system:serviceaccount:openshift-machine-config-operator:default" cannot update resource "configmaps" in API group "" in the namespace "openshift-machine-config-operator"
machine-config-operator-74d7475455-cgcmg
I0817 12:27:27.767746 1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config
I0817 12:35:29.649939 1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
I0817 12:35:29.685549 1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
I0817 12:35:29.696394 1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config
That shows two lease handoffs:
1. During the update from 4.13 to 4.14, the cluster-version operator
bumps the MCO Deployment to roll out the 4.14 image and other manifest
changes.
a. The outgoing 4.13 is using the default ServiceAccount, because
it predates [3,4].
b. The CVO deletes the default ServiceAccount, because its manifest
filename sorts it before the MCO Deployment.
c. The CVO bumps the MCO Deployment.
d. The MCO Deployment defaults to RollingUpdate [5] with a default
maxSurge of 25% rounded up [6]. That maxSurge default is very
old, dating back to at least the version of Kubernetes vendored
by OpenShift 4.1 [7].
e. The Deployment controller surges the incoming pod, and at
12:21:41.036459 it checks to see if the lease is available, but
it isn't, because the outgoing pod is still running.
f. The Deployment controller TERMs the outgoing containers, and at
12:21:41.780276 the outgoing pod complains that it is no longer
authorized to delete the lease ConfigMap, because the default
ServiceAccount has been deleted in 1.b.
g. The incoming pod acquires the lease 5m46s later at 12:27:27,
when the outgoing container's final lease goes stale.
2. When the single node goes down for a reboot to roll out the 4.14
RHCOS, and the new container coming up at 12:35:29 acquires the
lease smoothly.
This commit improves the deployment rollout with two changes:
* Shifting the default ServiceAccount deletion to a 90 manifest moves
it behind 0000_80_machine-config-operator_04_deployment.yaml and
0000_80_machine-config-operator_06_clusteroperator.yaml. Moving
behind the ClusterOperator ensures that deletion is not initiated
until the incoming operator pod has updated the ClusterOperator to
confirm the update has completed. Positioning between the
Deployment and the ClusterOperator would have put the deletion after
the CVO bumped the Deployment, but might still have raced against
the outgoing operator pod actually shutting down. This should allow
the outgoing pod to gracefully remove the ConfigMap and Lease
resources during graceful shutdowns.
* Shifting the Deployment to Recreate [8]. But also as discussed in
[8]:
Note: This will only guarantee Pod termination previous to
creation for upgrades. If you upgrade a Deployment, all Pods of
the old revision will be terminated immediately. Successful
removal is awaited before any Pod of the new revision is
created. If you manually delete a Pod, the lifecycle is controlled
by the ReplicaSet and the replacement will be created immediately
(even if the old Pod is still in a Terminating state). If you need
an "at most" guarantee for your Pods, you should consider using a
StatefulSet.
Still, Recreate will avoid the current milliseconds of overlap
between the incoming pod's first attempt to acquire the lease and
the outgoing pod's attempt to release during Deployment rollouts.
Other MCO container restarts and
drains-for-multi-control-plane-rolls and such may still have racey
overlap, but should be limited to the:
Worst graceful lease acquisition is {26s}.
latency (at least with the current 4.14 single-node lease
configuration).
[1]: https://issues.redhat.com/browse/OCPBUGS-16905
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node/1692132139799154688
[3]: https://issues.redhat.com/browse/OCPBUGS-10924
[4]: openshift@ace637f
[5]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy
[6]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-surge
[7]: openshift/api $ git --no-pager grep -B12 maxSurge origin/release-4.1 -- vendor/k8s.io/api/apps/v1/types.go
origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // The maximum number of pods that can be scheduled above the desired number of
origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // pods.
origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%).
origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // This can not be 0 if MaxUnavailable is 0.
origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // Absolute number is calculated from percentage by rounding up.
origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // Defaults to 25%.
origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // Example: when this is set to 30%, the new ReplicaSet can be scaled up immediately when
origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // the rolling update starts, such that the total number of old and new pods do not exceed
origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // 130% of desired pods. Once old pods have been killed,
origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // new ReplicaSet can be scaled up further, ensuring that total number of pods running
origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // at any time during the update is at most 130% of desired pods.
origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // +optional
origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go: MaxSurge *intstr.IntOrString `json:"maxSurge,omitempty" protobuf:"bytes,2,opt,name=maxSurge"
[8]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment
|
@openshift-cherrypick-robot: Jira Issue OCPBUGS-19357 has been cloned as Jira Issue OCPBUGS-19400. Will retitle bug to link to clone. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-19400, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
yuqi-zhang
left a comment
There was a problem hiding this comment.
/lgtm
/label backport-risk-assessed
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: openshift-cherrypick-robot, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/jira refresh |
|
@sinnykumari: This pull request references Jira Issue OCPBUGS-19400, which is valid. The bug has been moved to the POST state. 6 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/label cherry-pick-approved |
|
@openshift-cherrypick-robot: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
@openshift-cherrypick-robot: Jira Issue OCPBUGS-19400: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-19400 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Fix included in accepted release 4.13.0-0.nightly-2023-09-23-231516 |
|
this only cherry picks 2 out of the 3 changes of https://github.com/openshift/machine-config-operator/pull/3920/files it that what you expected @yuqi-zhang ? I think this causes |
ace637f (OCPBUGS-10924: Switch default SA to machine-config-operator, 2023-06-23, openshift#3740) moved the 4.14 machine-config operator to a non-default ServiceAccount and ClusterRoleBinding. But 4.13 and earlier remain on the default ServiceAccount. 1cdb75f (install: Recreate and delayed default ServiceAccount deletion, 2023-09-19, openshift#3923, OCPBUGS-19400) brought Recreate logic back to 4.13.14 [1] and later (good), but also brought back a 'delete' manifest for the default ClusterRoleBinding, which leads to the 4.13 cluster-version operator fighting with itself over whether that ClusterRoleBinding should exist (it should exist on 4.13). For example, [2] updates from 4.12.36 to 4.13.14, and has: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1707415968109563904/artifacts/e2e-aws-upgrade/clusterversion.json | jq -r '.items[].status.conditions[] | select(.type == "Upgradeable") | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2023-09-28T17:09:41Z Upgradeable=False ResourceDeletesInProgress: Cluster minor level upgrades are not allowed while resource deletions are in progress; resources=clusterrolebinding "default-account-openshift-machine-config-operator" By dropping the deletion manifest from 4.13, we avoid contention between two manifests, and leave the default ClusterRoleBinding alone until a later update to 4.14 will remove it. [1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.13.14 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1707415968109563904
ace637f (OCPBUGS-10924: Switch default SA to machine-config-operator, 2023-06-23, openshift#3740) moved the 4.14 machine-config operator to a non-default ServiceAccount and ClusterRoleBinding. But 4.13 and earlier remain on the default ServiceAccount. 1cdb75f (install: Recreate and delayed default ServiceAccount deletion, 2023-09-19, openshift#3923, OCPBUGS-19400) brought Recreate logic back to 4.13.14 [1] and later (good), but also brought back a 'delete' manifest for the default ClusterRoleBinding, which leads to the 4.13 cluster-version operator fighting with itself over whether that ClusterRoleBinding should exist (it should exist on 4.13) [2]. For example, [3] updates from 4.12.36 to 4.13.14, and has: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1707415968109563904/artifacts/e2e-aws-upgrade/clusterversion.json | jq -r '.items[].status.conditions[] | select(.type == "Upgradeable") | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2023-09-28T17:09:41Z Upgradeable=False ResourceDeletesInProgress: Cluster minor level upgrades are not allowed while resource deletions are in progress; resources=clusterrolebinding "default-account-openshift-machine-config-operator" By dropping the deletion manifest from 4.13, we avoid contention between two manifests, and leave the default ClusterRoleBinding alone until a later update to 4.14 will remove it. [1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.13.14 [2]: https://issues.redhat.com/browse/OCPBUGS-10924 [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1707415968109563904
ace637f (OCPBUGS-10924: Switch default SA to machine-config-operator, 2023-06-23, openshift#3740) moved the 4.14 machine-config operator to a non-default ServiceAccount and ClusterRoleBinding. But 4.13 and earlier remain on the default ServiceAccount. 1cdb75f (install: Recreate and delayed default ServiceAccount deletion, 2023-09-19, openshift#3923, OCPBUGS-19400) brought Recreate logic back to 4.13.14 [1] and later (good), but also brought back a 'delete' manifest for the default ClusterRoleBinding, which leads to the 4.13 cluster-version operator fighting with itself over whether that ClusterRoleBinding should exist (it should exist on 4.13) [2]. For example, [3] updates from 4.12.36 to 4.13.14, and has: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1707415968109563904/artifacts/e2e-aws-upgrade/clusterversion.json | jq -r '.items[].status.conditions[] | select(.type == "Upgradeable") | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' 2023-09-28T17:09:41Z Upgradeable=False ResourceDeletesInProgress: Cluster minor level upgrades are not allowed while resource deletions are in progress; resources=clusterrolebinding "default-account-openshift-machine-config-operator" By dropping the deletion manifest from 4.13, we avoid contention between two manifests, and leave the default ClusterRoleBinding alone until a later update to 4.14 will remove it. [1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.13.14 [2]: https://issues.redhat.com/browse/OCPBUGS-21721 [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1707415968109563904
Good catch! Fixing in #3983. |
This is an automated cherry-pick of #3920
/assign yuqi-zhang