Skip to content

OCPBUGS-16905: Operator: switch upgrade strategy to recreate#3884

Closed
yuqi-zhang wants to merge 1 commit into
openshift:masterfrom
yuqi-zhang:fix-operator-lease
Closed

OCPBUGS-16905: Operator: switch upgrade strategy to recreate#3884
yuqi-zhang wants to merge 1 commit into
openshift:masterfrom
yuqi-zhang:fix-operator-lease

Conversation

@yuqi-zhang

Copy link
Copy Markdown
Contributor

Test to see if this will help lease acquisition slowness.

Test to see if this will help lease acquisition slowness
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 24, 2023
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@yuqi-zhang: This pull request references Jira Issue OCPBUGS-16905, which is invalid:

  • expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Test to see if this will help lease acquisition slowness.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci Bot requested review from cheesesashimi and jkyros August 24, 2023 13:58
@openshift-ci

openshift-ci Bot commented Aug 24, 2023

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 24, 2023
@wking

wking commented Aug 24, 2023

Copy link
Copy Markdown
Member

Using payload testing to see how this plays in the 4.13-to-4.14 job I'd been poking at in the bug:

/payload-job periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node

@openshift-ci

openshift-ci Bot commented Aug 24, 2023

Copy link
Copy Markdown
Contributor

@wking: trigger 1 job(s) for the /payload-(job|aggregate) command

  • periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/311f53f0-428e-11ee-9b82-6e5fe4af1d09-0

@openshift-ci

openshift-ci Bot commented Aug 24, 2023

Copy link
Copy Markdown
Contributor

@yuqi-zhang: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/unit 49a03c6 link true /test unit
ci/prow/okd-scos-e2e-aws-ovn 49a03c6 link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@yuqi-zhang

Copy link
Copy Markdown
Contributor Author

The unit test failure can be ignored, it is a known flake

The e2e-upgrade seems fast:

I0824 15:44:12.806636       1 start.go:49] Version: 4.14.0-0.ci.test-2023-08-24-140708-ci-op-vdvk4x1r-latest (Raw: machine-config-daemon-4.6.0-202006240615.p0-2299-gda307322-dirty, Hash: da3073229a67e8e1dac0ef1fdc901581af4a7560)
I0824 15:44:12.806885       1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}.
I0824 15:44:12.810125       1 metrics.go:74] Registering Prometheus metrics
I0824 15:44:12.810264       1 metrics.go:81] Starting metrics listener on 127.0.0.1:8797
I0824 15:44:12.842265       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
I0824 15:44:12.876952       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config

Although maybe that's normal.

The payload job seems to have failed?

@sinnykumari

Copy link
Copy Markdown
Contributor

I believe original issue is about reducing leaderelection time on SNO cluster which still looks like taking ~5 mins as per sno-gcp-op job

I0824 16:15:07.514570       1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-config-operator/machine-config...
I0824 16:20:04.644469       1 leaderelection.go:255] successfully acquired lease openshift-machine-config-operator/machine-config

@wking

wking commented Aug 28, 2023

Copy link
Copy Markdown
Member

Previous run had AllJobsTriggered: WithErrors: Jobs triggered with errors, but no details (as far as I can tell) about what those errors were. Trying again:

/payload-job periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node

@openshift-ci

openshift-ci Bot commented Aug 28, 2023

Copy link
Copy Markdown
Contributor

@wking: trigger 1 job(s) for the /payload-(job|aggregate) command

  • periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f524e110-45eb-11ee-838f-2510ce4f786c-0

@wking

wking commented Aug 28, 2023

Copy link
Copy Markdown
Member

Checking the run Sinny was poking at, I do see container starts around 16:15, but they don't seem to be part of a Deployment roll, so I don't think they exercise the new logic:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3884/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-single-node/1694710054273421312/artifacts/e2e-gcp-op-single-node/gather-extra/artifacts/events.json | jq -r '[.items[] | select((.involvedObject.name // "" | startswith("machine-config-operator")))] | sort_by(.firstTimestamp)[] | .firstTimestamp + " " + (.involvedObject | .kind + " " + .name) + " " + .reason + ": " + .message' | tail -n20
2023-08-24T16:05:04Z Deployment machine-config-operator ConfigMapUpdated: Updated ConfigMap/kube-rbac-proxy -n openshift-machine-config-operator:
cause by changes in data.config-file.yaml
2023-08-24T16:09:56Z Deployment machine-config-operator ConfigMapUpdated: Updated ConfigMap/kube-rbac-proxy -n openshift-machine-config-operator:
cause by changes in data.config-file.yaml
2023-08-24T16:13:58Z Pod machine-config-operator-5977698469-5txwf NetworkNotReady: network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
2023-08-24T16:13:59Z Pod machine-config-operator-5977698469-5txwf FailedMount: MountVolume.SetUp failed for volume "images" : object "openshift-machine-config-operator"/"machine-config-operator-images" not registered
2023-08-24T16:13:59Z Pod machine-config-operator-5977698469-5txwf FailedMount: MountVolume.SetUp failed for volume "proxy-tls" : object "openshift-machine-config-operator"/"mco-proxy-tls" not registered
2023-08-24T16:14:15Z Pod machine-config-operator-5977698469-5txwf FailedMount: MountVolume.SetUp failed for volume "proxy-tls" : failed to sync secret cache: timed out waiting for the condition
2023-08-24T16:14:16Z Pod machine-config-operator-5977698469-5txwf FailedMount: MountVolume.SetUp failed for volume "images" : failed to sync configmap cache: timed out waiting for the condition
2023-08-24T16:14:41Z Pod machine-config-operator-5977698469-5txwf AddedInterface: Add eth0 [10.128.0.33/23] from ovn-kubernetes
2023-08-24T16:15:06Z Pod machine-config-operator-5977698469-5txwf Pulled: Container image "registry.build02.ci.openshift.org/ci-op-vdvk4x1r/stable@sha256:670101c1f75e5b2d3f6e40c202c2da502c911e61a43b38fcd23d0742b7d29d8e" already present on machine
2023-08-24T16:15:07Z Pod machine-config-operator-5977698469-5txwf Created: Created container machine-config-operator
2023-08-24T16:15:07Z Pod machine-config-operator-5977698469-5txwf Started: Started container machine-config-operator
2023-08-24T16:15:07Z Pod machine-config-operator-5977698469-5txwf Pulled: Container image "registry.build02.ci.openshift.org/ci-op-vdvk4x1r/stable@sha256:b5e574c5b2fdd0a90d899793e2ce97792dc2d3e3fbf934a107139b1b4f2732a7" already present on machine
2023-08-24T16:15:07Z Pod machine-config-operator-5977698469-5txwf Created: Created container kube-rbac-proxy
2023-08-24T16:15:07Z Pod machine-config-operator-5977698469-5txwf Started: Started container kube-rbac-proxy
2023-08-24T16:20:24Z Deployment machine-config-operator ConfigMapUpdated: Updated ConfigMap/kube-rbac-proxy -n openshift-machine-config-operator:
cause by changes in data.config-file.yaml
2023-08-24T16:23:39Z Deployment machine-config-operator ConfigMapUpdated: Updated ConfigMap/kube-rbac-proxy -n openshift-machine-config-operator:
cause by changes in data.config-file.yaml

And checking the pod:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3884/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-single-node/1694710054273421312/artifacts/e2e-gcp-op-single-node/gather-extra/artifacts/pods.json | jq '.items[] | select(.metadata.labels["k8s-app"] == "machine-config-operator").status.containerStatuses[]'
{
  "containerID": "cri-o://44bea23a19ccfda451fee848f62ff873d2401bd7647904c53aeed5773281dd59",
  "image": "registry.build02.ci.openshift.org/ci-op-vdvk4x1r/stable@sha256:b5e574c5b2fdd0a90d899793e2ce97792dc2d3e3fbf934a107139b1b4f2732a7",
  "imageID": "registry.build02.ci.openshift.org/ci-op-vdvk4x1r/stable@sha256:b5e574c5b2fdd0a90d899793e2ce97792dc2d3e3fbf934a107139b1b4f2732a7",
  "lastState": {},
  "name": "kube-rbac-proxy",
  "ready": true,
  "restartCount": 11,
  "started": true,
  "state": {
    "running": {
      "startedAt": "2023-08-24T16:15:07Z"
    }
  }
}
{
  "containerID": "cri-o://b550ba232e6ecf683050d98183ef97203dba8424fef1c3ed892d25ec77b19de4",
  "image": "registry.build02.ci.openshift.org/ci-op-vdvk4x1r/stable@sha256:670101c1f75e5b2d3f6e40c202c2da502c911e61a43b38fcd23d0742b7d29d8e",
  "imageID": "registry.build02.ci.openshift.org/ci-op-vdvk4x1r/stable@sha256:670101c1f75e5b2d3f6e40c202c2da502c911e61a43b38fcd23d0742b7d29d8e",
  "lastState": {},
  "name": "machine-config-operator",
  "ready": true,
  "restartCount": 9,
  "started": true,
  "state": {
    "running": {
      "startedAt": "2023-08-24T16:15:07Z"
    }
  }
}

So there's maybe trouble with graceful leader release or something, but it's definitely container restarts (with no lastState?) and not Deployment updates rolling out pods with an intentional strategy, that's slowing leader management in that CI run.

@wking

wking commented Aug 29, 2023

Copy link
Copy Markdown
Member

We're going with #3895 to pick up the default ServiceAccount deletion shift.

@wking wking closed this Aug 29, 2023
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@yuqi-zhang: This pull request references Jira Issue OCPBUGS-16905. The bug has been updated to no longer refer to the pull request using the external bug tracker.

Details

In response to this:

Test to see if this will help lease acquisition slowness.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants