[release-4.14] OCPBUGS-19357: install: Recreate and delayed default ServiceAccount deletion by openshift-cherrypick-robot · Pull Request #3920 · openshift/machine-config-operator

openshift-cherrypick-robot · 2023-09-18T19:14:49Z

This is an automated cherry-pick of #3895

/assign yuqi-zhang

Single-node updates from 4.13 to 4.14 are having slow leader handoffs during operator Deployment updates [1]. For example [2] in Loki: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node/1692132139799154688"} | unpack | pod="machine-config-operator-74d7475455-cgcmg" |= "leaderelection" gives: machine-config-operator-7987b6bcdd-6zvrt I0817 11:26:14.747630 1 leaderelection.go:248] attempting to acquire leader lease openshift-machine-config-operator/machine-config... I0817 11:26:14.756376 1 leaderelection.go:258] successfully acquired lease openshift-machine-config-operator/machine-config E0817 11:31:15.153289 1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps machine-config) E0817 11:37:42.204539 1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF E0817 11:53:05.221213 1 leaderelection.go:330] error retrieving resource lock openshift-machine-config-operator/machine-config: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-machine-config-operator/configmaps/machine-config": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF machine-config-operator-74d7475455-cgcmg I0817 12:21:41.019562 1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}. I0817 12:21:41.036459 1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config... machine-config-operator-7987b6bcdd-6zvrt E0817 12:21:41.780276 1 leaderelection.go:306] Failed to release lock: configmaps "machine-config" is forbidden: User "system:serviceaccount:openshift-machine-config-operator:default" cannot update resource "configmaps" in API group "" in the namespace "openshift-machine-config-operator" machine-config-operator-74d7475455-cgcmg I0817 12:27:27.767746 1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config I0817 12:35:29.649939 1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}. I0817 12:35:29.685549 1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config... I0817 12:35:29.696394 1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config That shows two lease handoffs: 1. During the update from 4.13 to 4.14, the cluster-version operator bumps the MCO Deployment to roll out the 4.14 image and other manifest changes. a. The outgoing 4.13 is using the default ServiceAccount, because it predates [3,4]. b. The CVO deletes the default ServiceAccount, because its manifest filename sorts it before the MCO Deployment. c. The CVO bumps the MCO Deployment. d. The MCO Deployment defaults to RollingUpdate [5] with a default maxSurge of 25% rounded up [6]. That maxSurge default is very old, dating back to at least the version of Kubernetes vendored by OpenShift 4.1 [7]. e. The Deployment controller surges the incoming pod, and at 12:21:41.036459 it checks to see if the lease is available, but it isn't, because the outgoing pod is still running. f. The Deployment controller TERMs the outgoing containers, and at 12:21:41.780276 the outgoing pod complains that it is no longer authorized to delete the lease ConfigMap, because the default ServiceAccount has been deleted in 1.b. g. The incoming pod acquires the lease 5m46s later at 12:27:27, when the outgoing container's final lease goes stale. 2. When the single node goes down for a reboot to roll out the 4.14 RHCOS, and the new container coming up at 12:35:29 acquires the lease smoothly. This commit improves the deployment rollout with two changes: * Shifting the default ServiceAccount deletion to a 90 manifest moves it behind 0000_80_machine-config-operator_04_deployment.yaml and 0000_80_machine-config-operator_06_clusteroperator.yaml. Moving behind the ClusterOperator ensures that deletion is not initiated until the incoming operator pod has updated the ClusterOperator to confirm the update has completed. Positioning between the Deployment and the ClusterOperator would have put the deletion after the CVO bumped the Deployment, but might still have raced against the outgoing operator pod actually shutting down. This should allow the outgoing pod to gracefully remove the ConfigMap and Lease resources during graceful shutdowns. * Shifting the Deployment to Recreate [8]. But also as discussed in [8]: Note: This will only guarantee Pod termination previous to creation for upgrades. If you upgrade a Deployment, all Pods of the old revision will be terminated immediately. Successful removal is awaited before any Pod of the new revision is created. If you manually delete a Pod, the lifecycle is controlled by the ReplicaSet and the replacement will be created immediately (even if the old Pod is still in a Terminating state). If you need an "at most" guarantee for your Pods, you should consider using a StatefulSet. Still, Recreate will avoid the current milliseconds of overlap between the incoming pod's first attempt to acquire the lease and the outgoing pod's attempt to release during Deployment rollouts. Other MCO container restarts and drains-for-multi-control-plane-rolls and such may still have racey overlap, but should be limited to the: Worst graceful lease acquisition is {26s}. latency (at least with the current 4.14 single-node lease configuration). [1]: https://issues.redhat.com/browse/OCPBUGS-16905 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node/1692132139799154688 [3]: https://issues.redhat.com/browse/OCPBUGS-10924 [4]: openshift@ace637f [5]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy [6]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#max-surge [7]: openshift/api $ git --no-pager grep -B12 maxSurge origin/release-4.1 -- vendor/k8s.io/api/apps/v1/types.go origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // The maximum number of pods that can be scheduled above the desired number of origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // pods. origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // Value can be an absolute number (ex: 5) or a percentage of desired pods (ex: 10%). origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // This can not be 0 if MaxUnavailable is 0. origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // Absolute number is calculated from percentage by rounding up. origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // Defaults to 25%. origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // Example: when this is set to 30%, the new ReplicaSet can be scaled up immediately when origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // the rolling update starts, such that the total number of old and new pods do not exceed origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // 130% of desired pods. Once old pods have been killed, origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // new ReplicaSet can be scaled up further, ensuring that total number of pods running origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // at any time during the update is at most 130% of desired pods. origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go- // +optional origin/release-4.1:vendor/k8s.io/api/apps/v1/types.go: MaxSurge *intstr.IntOrString `json:"maxSurge,omitempty" protobuf:"bytes,2,opt,name=maxSurge" [8]: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#recreate-deployment

openshift-ci-robot · 2023-09-18T19:14:58Z

@openshift-cherrypick-robot: Jira Issue OCPBUGS-16905 has been cloned as Jira Issue OCPBUGS-19357. Will retitle bug to link to clone.
/retitle [release-4.14] OCPBUGS-19357: install: Recreate and delayed default ServiceAccount deletion

Details

In response to this:

This is an automated cherry-pick of #3895

/assign yuqi-zhang

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2023-09-18T19:15:26Z

@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-19357, which is valid. The bug has been moved to the POST state.

6 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
dependent bug Jira Issue OCPBUGS-16905 is in the state MODIFIED, which is one of the valid states (MODIFIED, ON_QA, VERIFIED)
dependent Jira Issue OCPBUGS-16905 targets the "4.15.0" version, which is one of the valid target versions: 4.15.0
bug has dependents

Requesting review from QA contact:
/cc @sergiordlr

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This is an automated cherry-pick of #3895

/assign yuqi-zhang

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuqi-zhang

/lgtm
/label backport-risk-assessed

This should be a pretty safe change generally speaking, and can help some environments with upgrade time

openshift-ci · 2023-09-18T20:49:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: openshift-cherrypick-robot, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2023-09-18T22:27:37Z

@openshift-cherrypick-robot: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/okd-scos-e2e-gcp-op	`496abc1`	link	false	`/test okd-scos-e2e-gcp-op`
ci/prow/e2e-gcp-rt	`496abc1`	link	false	`/test e2e-gcp-rt`
ci/prow/e2e-aws-ovn-fips-op	`496abc1`	link	false	`/test e2e-aws-ovn-fips-op`
ci/prow/e2e-gcp-rt-op	`496abc1`	link	false	`/test e2e-gcp-rt-op`
ci/prow/e2e-vsphere-upi-zones	`496abc1`	link	false	`/test e2e-vsphere-upi-zones`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

sergiordlr · 2023-09-19T09:27:11Z

Verified using SNO IPI on GCP

We upgraded from 4.13 to 4.14+current fix:

$ oc get clusterversion -o yaml 
...
    history:
    - acceptedRisks: |-
        Target release version="" image="registry.build03.ci.openshift.org/ci-ln-k4c8qv2/release:latest" cannot be verified, but continuing anyway because the update was forced: release images that are not accessed via digest cannot be verified
        Forced through blocking failures: Multiple precondition checks failed:
        * Precondition "ClusterVersionUpgradeable" failed because of "AdminAckRequired": Kubernetes 1.27 and therefore OpenShift 4.14 remove several APIs which require admin consideration. Please see the knowledge article https://access.redhat.com/articles/6958395 for details and instructions.
        * Precondition "EtcdRecentBackup" failed because of "ControllerStarted": RecentBackup: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
        * Precondition "ClusterVersionRecommendedUpdate" failed because of "UnknownUpdate": RetrievedUpdates=False (VersionNotFound), so the recommended status of updating from 4.13.0-0.nightly-2023-09-12-074803 to 4.14.0-0.ci.test-2023-09-19-074741-ci-ln-k4c8qv2-latest is unknown.
      completionTime: "2023-09-19T09:13:40Z"
      image: registry.build03.ci.openshift.org/ci-ln-k4c8qv2/release:latest
      startedTime: "2023-09-19T08:28:54Z"
      state: Completed
      verified: false
      version: 4.14.0-0.ci.test-2023-09-19-074741-ci-ln-k4c8qv2-latest
    - completionTime: "2023-09-19T08:19:27Z"
      image: registry.ci.openshift.org/ocp/release@sha256:e03f07af4872b73f425c6095cabba4600106b063005d5236991f7225fedd9704
      startedTime: "2023-09-19T07:50:26Z"
      state: Completed
      verified: false
      version: 4.13.0-0.nightly-2023-09-12-074803

Before the operator pod was upgraded, we had these messages in the machine-config-operator pod:

W0919 09:00:09.731668       1 warnings.go:70] unknown field "spec.infra.metadata.resourceVersion"
W0919 09:00:09.731670       1 warnings.go:70] unknown field "spec.infra.metadata.uid"
I0919 09:00:28.092785       1 helpers.go:93] Shutting down due to: terminated
I0919 09:00:28.099910       1 helpers.go:96] Context cancelled
I0919 09:00:28.115105       1 operator.go:286] Shutting down MachineConfigOperator
I0919 09:00:28.192052       1 start.go:115] Stopped leading. Terminating.

After the operator pod was updated, we can find these messages in the machine-config-operator pod:

2023-09-19T09:00:43.987657360+00:00 stderr F I0919 09:00:43.987594       1 start.go:49] Version: 4.14.0-0.ci.test-2023-09-19-074741-ci-ln-k4c8qv2-latest (Raw: machine-config-daemon-4.6.0-202006240615.p0-2312-g2bd89b29-dirty, Hash: 2bd89b29582b689694ee9f3dddb97962dc891230)
2023-09-19T09:00:43.988008877+00:00 stderr F I0919 09:00:43.987965       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
2023-09-19T09:00:43.989377943+00:00 stderr F I0919 09:00:43.988649       1 metrics.go:74] Registering Prometheus metrics
2023-09-19T09:00:43.989377943+00:00 stderr F I0919 09:00:43.988680       1 metrics.go:81] Starting metrics listener on 127.0.0.1:8797
2023-09-19T09:00:44.064489189+00:00 stderr F I0919 09:00:44.064427       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
2023-09-19T09:00:44.248651741+00:00 stderr F I0919 09:00:44.247322       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config

It took 16 for the new operator pod to acquire the lease.

And after the node was rebooted, we can find these messages in the machine-config-operator pod:

Defaulted container "machine-config-operator" out of: machine-config-operator, kube-rbac-proxy
I0919 09:10:08.585284       1 start.go:49] Version: 4.14.0-0.ci.test-2023-09-19-074741-ci-ln-k4c8qv2-latest (Raw: machine-config-daemon-4.6.0-202006240615.p0-2312-g2bd89b29-dirty, Hash: 2bd89b29582b689694ee9f3dddb97962dc891230)
I0919 09:10:08.592032       1 metrics.go:74] Registering Prometheus metrics
I0919 09:10:08.596207       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
I0919 09:10:08.597132       1 metrics.go:81] Starting metrics listener on 127.0.0.1:8797
I0919 09:10:08.618197       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
I0919 09:10:08.625006       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config

After the reboot it could get the lease quickly as well.

We can add the qe-approved label

/label qe-approved

sinnykumari · 2023-09-19T10:26:22Z

Thanks Sergio for verifying, backport PR would need cherry-pick-approved label.

sergiordlr · 2023-09-19T14:21:38Z

/label cherry-pick-approved

openshift-ci-robot · 2023-09-19T14:29:30Z

@openshift-cherrypick-robot: Jira Issue OCPBUGS-19357: All pull requests linked via external trackers have merged:

openshift/machine-config-operator#3920

Jira Issue OCPBUGS-19357 has been moved to the MODIFIED state.

Details

In response to this:

This is an automated cherry-pick of #3895

/assign yuqi-zhang

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yuqi-zhang · 2023-09-19T14:41:43Z

/cherry-pick release-4.13

Since the original bug was reported against 4.13

openshift-cherrypick-robot · 2023-09-19T14:42:24Z

@yuqi-zhang: new pull request created: #3923

Details

In response to this:

/cherry-pick release-4.13

Since the original bug was reported against 4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-merge-robot · 2023-09-20T03:35:17Z

Fix included in accepted release 4.14.0-0.nightly-2023-09-19-201452

openshift-cherrypick-robot mentioned this pull request Sep 18, 2023

OCPBUGS-16905: install: Recreate and delayed default ServiceAccount deletion #3895

Merged

openshift-cherrypick-robot assigned yuqi-zhang Sep 18, 2023

openshift-ci Bot changed the title ~~[release-4.14] OCPBUGS-16905: install: Recreate and delayed default ServiceAccount deletion~~ [release-4.14] OCPBUGS-19357: install: Recreate and delayed default ServiceAccount deletion Sep 18, 2023

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Sep 18, 2023

openshift-ci Bot requested review from cdoern, cheesesashimi and sergiordlr September 18, 2023 19:15

yuqi-zhang approved these changes Sep 18, 2023

View reviewed changes

openshift-ci Bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Sep 18, 2023

openshift-ci Bot assigned anuragthehatter, eurijon, mburman5, mhanss, rbbratta, reihl, rioliu-rh, rlobillo and sergiordlr Sep 18, 2023

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Sep 18, 2023

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 18, 2023

openshift-ci Bot added the qe-approved Signifies that QE has signed off on this PR label Sep 19, 2023

openshift-ci Bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Sep 19, 2023

openshift-merge-robot merged commit 600d383 into openshift:release-4.14 Sep 19, 2023

openshift-cherrypick-robot mentioned this pull request Sep 19, 2023

[release-4.13] OCPBUGS-19400: install: Recreate and delayed default ServiceAccount deletion #3923

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release-4.14] OCPBUGS-19357: install: Recreate and delayed default ServiceAccount deletion#3920

[release-4.14] OCPBUGS-19357: install: Recreate and delayed default ServiceAccount deletion#3920
openshift-merge-robot merged 1 commit into
openshift:release-4.14from
openshift-cherrypick-robot:cherry-pick-3895-to-release-4.14

openshift-cherrypick-robot commented Sep 18, 2023

Uh oh!

openshift-ci-robot commented Sep 18, 2023

Uh oh!

openshift-ci-robot commented Sep 18, 2023

Uh oh!

yuqi-zhang left a comment

Uh oh!

openshift-ci Bot commented Sep 18, 2023

Uh oh!

openshift-ci Bot commented Sep 18, 2023

Uh oh!

sergiordlr commented Sep 19, 2023

Uh oh!

sinnykumari commented Sep 19, 2023

Uh oh!

sergiordlr commented Sep 19, 2023

Uh oh!

openshift-ci-robot commented Sep 19, 2023

Uh oh!

yuqi-zhang commented Sep 19, 2023

Uh oh!

openshift-cherrypick-robot commented Sep 19, 2023

Uh oh!

openshift-merge-robot commented Sep 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

Conversation

openshift-cherrypick-robot commented Sep 18, 2023

Uh oh!

openshift-ci-robot commented Sep 18, 2023

Uh oh!

openshift-ci-robot commented Sep 18, 2023

Uh oh!

yuqi-zhang left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Sep 18, 2023

Uh oh!

openshift-ci Bot commented Sep 18, 2023

Uh oh!

sergiordlr commented Sep 19, 2023

Uh oh!

sinnykumari commented Sep 19, 2023

Uh oh!

sergiordlr commented Sep 19, 2023

Uh oh!

openshift-ci-robot commented Sep 19, 2023

Uh oh!

yuqi-zhang commented Sep 19, 2023

Uh oh!

openshift-cherrypick-robot commented Sep 19, 2023

Uh oh!

openshift-merge-robot commented Sep 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants