OCPBUGS-16905: Operator: switch upgrade strategy to recreate by yuqi-zhang · Pull Request #3884 · openshift/machine-config-operator

yuqi-zhang · 2023-08-24T13:55:21Z

Test to see if this will help lease acquisition slowness.

Test to see if this will help lease acquisition slowness

openshift-ci-robot · 2023-08-24T13:55:26Z

@yuqi-zhang: This pull request references Jira Issue OCPBUGS-16905, which is invalid:

expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Test to see if this will help lease acquisition slowness.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-08-24T13:58:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2023-08-24T14:54:56Z

Using payload testing to see how this plays in the 4.13-to-4.14 job I'd been poking at in the bug:

/payload-job periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node

openshift-ci · 2023-08-24T14:55:08Z

@wking: trigger 1 job(s) for the /payload-(job|aggregate) command

periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/311f53f0-428e-11ee-9b82-6e5fe4af1d09-0

openshift-ci · 2023-08-24T16:33:56Z

@yuqi-zhang: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/unit	`49a03c6`	link	true	`/test unit`
ci/prow/okd-scos-e2e-aws-ovn	`49a03c6`	link	false	`/test okd-scos-e2e-aws-ovn`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

yuqi-zhang · 2023-08-24T16:37:50Z

The unit test failure can be ignored, it is a known flake

The e2e-upgrade seems fast:

I0824 15:44:12.806636       1 start.go:49] Version: 4.14.0-0.ci.test-2023-08-24-140708-ci-op-vdvk4x1r-latest (Raw: machine-config-daemon-4.6.0-202006240615.p0-2299-gda307322-dirty, Hash: da3073229a67e8e1dac0ef1fdc901581af4a7560)
I0824 15:44:12.806885       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
I0824 15:44:12.810125       1 metrics.go:74] Registering Prometheus metrics
I0824 15:44:12.810264       1 metrics.go:81] Starting metrics listener on 127.0.0.1:8797
I0824 15:44:12.842265       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
I0824 15:44:12.876952       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config

Although maybe that's normal.

The payload job seems to have failed?

sinnykumari · 2023-08-25T14:44:33Z

I believe original issue is about reducing leaderelection time on SNO cluster which still looks like taking ~5 mins as per sno-gcp-op job

I0824 16:15:07.514570       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
I0824 16:20:04.644469       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config

wking · 2023-08-28T21:43:42Z

Previous run had AllJobsTriggered: WithErrors: Jobs triggered with errors, but no details (as far as I can tell) about what those errors were. Trying again:

/payload-job periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node

openshift-ci · 2023-08-28T21:43:47Z

@wking: trigger 1 job(s) for the /payload-(job|aggregate) command

periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f524e110-45eb-11ee-838f-2510ce4f786c-0

wking · 2023-08-28T22:17:15Z

Checking the run Sinny was poking at, I do see container starts around 16:15, but they don't seem to be part of a Deployment roll, so I don't think they exercise the new logic:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3884/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-single-node/1694710054273421312/artifacts/e2e-gcp-op-single-node/gather-extra/artifacts/events.json | jq -r '[.items[] | select((.involvedObject.name // "" | startswith("machine-config-operator")))] | sort_by(.firstTimestamp)[] | .firstTimestamp + " " + (.involvedObject | .kind + " " + .name) + " " + .reason + ": " + .message' | tail -n20
2023-08-24T16:05:04Z Deployment machine-config-operator ConfigMapUpdated: Updated ConfigMap/kube-rbac-proxy -n openshift-machine-config-operator:
cause by changes in data.config-file.yaml
2023-08-24T16:09:56Z Deployment machine-config-operator ConfigMapUpdated: Updated ConfigMap/kube-rbac-proxy -n openshift-machine-config-operator:
cause by changes in data.config-file.yaml
2023-08-24T16:13:58Z Pod machine-config-operator-5977698469-5txwf NetworkNotReady: network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
2023-08-24T16:13:59Z Pod machine-config-operator-5977698469-5txwf FailedMount: MountVolume.SetUp failed for volume "images" : object "openshift-machine-config-operator"/"machine-config-operator-images" not registered
2023-08-24T16:13:59Z Pod machine-config-operator-5977698469-5txwf FailedMount: MountVolume.SetUp failed for volume "proxy-tls" : object "openshift-machine-config-operator"/"mco-proxy-tls" not registered
2023-08-24T16:14:15Z Pod machine-config-operator-5977698469-5txwf FailedMount: MountVolume.SetUp failed for volume "proxy-tls" : failed to sync secret cache: timed out waiting for the condition
2023-08-24T16:14:16Z Pod machine-config-operator-5977698469-5txwf FailedMount: MountVolume.SetUp failed for volume "images" : failed to sync configmap cache: timed out waiting for the condition
2023-08-24T16:14:41Z Pod machine-config-operator-5977698469-5txwf AddedInterface: Add eth0 [10.128.0.33/23] from ovn-kubernetes
2023-08-24T16:15:06Z Pod machine-config-operator-5977698469-5txwf Pulled: Container image "registry.build02.ci.openshift.org/ci-op-vdvk4x1r/stable@sha256:670101c1f75e5b2d3f6e40c202c2da502c911e61a43b38fcd23d0742b7d29d8e" already present on machine
2023-08-24T16:15:07Z Pod machine-config-operator-5977698469-5txwf Created: Created container machine-config-operator
2023-08-24T16:15:07Z Pod machine-config-operator-5977698469-5txwf Started: Started container machine-config-operator
2023-08-24T16:15:07Z Pod machine-config-operator-5977698469-5txwf Pulled: Container image "registry.build02.ci.openshift.org/ci-op-vdvk4x1r/stable@sha256:b5e574c5b2fdd0a90d899793e2ce97792dc2d3e3fbf934a107139b1b4f2732a7" already present on machine
2023-08-24T16:15:07Z Pod machine-config-operator-5977698469-5txwf Created: Created container kube-rbac-proxy
2023-08-24T16:15:07Z Pod machine-config-operator-5977698469-5txwf Started: Started container kube-rbac-proxy
2023-08-24T16:20:24Z Deployment machine-config-operator ConfigMapUpdated: Updated ConfigMap/kube-rbac-proxy -n openshift-machine-config-operator:
cause by changes in data.config-file.yaml
2023-08-24T16:23:39Z Deployment machine-config-operator ConfigMapUpdated: Updated ConfigMap/kube-rbac-proxy -n openshift-machine-config-operator:
cause by changes in data.config-file.yaml

And checking the pod:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3884/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-single-node/1694710054273421312/artifacts/e2e-gcp-op-single-node/gather-extra/artifacts/pods.json | jq '.items[] | select(.metadata.labels["k8s-app"] == "machine-config-operator").status.containerStatuses[]'
{
  "containerID": "cri-o://44bea23a19ccfda451fee848f62ff873d2401bd7647904c53aeed5773281dd59",
  "image": "registry.build02.ci.openshift.org/ci-op-vdvk4x1r/stable@sha256:b5e574c5b2fdd0a90d899793e2ce97792dc2d3e3fbf934a107139b1b4f2732a7",
  "imageID": "registry.build02.ci.openshift.org/ci-op-vdvk4x1r/stable@sha256:b5e574c5b2fdd0a90d899793e2ce97792dc2d3e3fbf934a107139b1b4f2732a7",
  "lastState": {},
  "name": "kube-rbac-proxy",
  "ready": true,
  "restartCount": 11,
  "started": true,
  "state": {
    "running": {
      "startedAt": "2023-08-24T16:15:07Z"
    }
  }
}
{
  "containerID": "cri-o://b550ba232e6ecf683050d98183ef97203dba8424fef1c3ed892d25ec77b19de4",
  "image": "registry.build02.ci.openshift.org/ci-op-vdvk4x1r/stable@sha256:670101c1f75e5b2d3f6e40c202c2da502c911e61a43b38fcd23d0742b7d29d8e",
  "imageID": "registry.build02.ci.openshift.org/ci-op-vdvk4x1r/stable@sha256:670101c1f75e5b2d3f6e40c202c2da502c911e61a43b38fcd23d0742b7d29d8e",
  "lastState": {},
  "name": "machine-config-operator",
  "ready": true,
  "restartCount": 9,
  "started": true,
  "state": {
    "running": {
      "startedAt": "2023-08-24T16:15:07Z"
    }
  }
}

So there's maybe trouble with graceful leader release or something, but it's definitely container restarts (with no lastState?) and not Deployment updates rolling out pods with an intentional strategy, that's slowing leader management in that CI run.

wking · 2023-08-29T21:26:54Z

We're going with #3895 to pick up the default ServiceAccount deletion shift.

openshift-ci-robot · 2023-08-29T21:27:01Z

@yuqi-zhang: This pull request references Jira Issue OCPBUGS-16905. The bug has been updated to no longer refer to the pull request using the external bug tracker.

Details

In response to this:

Test to see if this will help lease acquisition slowness.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Operator: switch upgrade strategy to recreate

49a03c6

Test to see if this will help lease acquisition slowness

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 24, 2023

openshift-ci Bot requested review from cheesesashimi and jkyros August 24, 2023 13:58

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 24, 2023

wking mentioned this pull request Aug 29, 2023

OCPBUGS-16905: install: Recreate and delayed default ServiceAccount deletion #3895

Merged

wking closed this Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-16905: Operator: switch upgrade strategy to recreate#3884

OCPBUGS-16905: Operator: switch upgrade strategy to recreate#3884
yuqi-zhang wants to merge 1 commit into
openshift:masterfrom
yuqi-zhang:fix-operator-lease

yuqi-zhang commented Aug 24, 2023

Uh oh!

openshift-ci-robot commented Aug 24, 2023

Uh oh!

openshift-ci Bot commented Aug 24, 2023

Uh oh!

wking commented Aug 24, 2023

Uh oh!

openshift-ci Bot commented Aug 24, 2023

Uh oh!

openshift-ci Bot commented Aug 24, 2023

Uh oh!

yuqi-zhang commented Aug 24, 2023

Uh oh!

sinnykumari commented Aug 25, 2023

Uh oh!

wking commented Aug 28, 2023

Uh oh!

openshift-ci Bot commented Aug 28, 2023

Uh oh!

wking commented Aug 28, 2023

Uh oh!

wking commented Aug 29, 2023

Uh oh!

openshift-ci-robot commented Aug 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yuqi-zhang commented Aug 24, 2023

Uh oh!

openshift-ci-robot commented Aug 24, 2023

Uh oh!

openshift-ci Bot commented Aug 24, 2023

Uh oh!

wking commented Aug 24, 2023

Uh oh!

openshift-ci Bot commented Aug 24, 2023

Uh oh!

openshift-ci Bot commented Aug 24, 2023

Uh oh!

yuqi-zhang commented Aug 24, 2023

Uh oh!

sinnykumari commented Aug 25, 2023

Uh oh!

wking commented Aug 28, 2023

Uh oh!

openshift-ci Bot commented Aug 28, 2023

Uh oh!

wking commented Aug 28, 2023

Uh oh!

wking commented Aug 29, 2023

Uh oh!

openshift-ci-robot commented Aug 29, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants