OCPBUGS-16905: Operator: switch upgrade strategy to recreate#3884
OCPBUGS-16905: Operator: switch upgrade strategy to recreate#3884yuqi-zhang wants to merge 1 commit into
Conversation
Test to see if this will help lease acquisition slowness
|
@yuqi-zhang: This pull request references Jira Issue OCPBUGS-16905, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Using payload testing to see how this plays in the 4.13-to-4.14 job I'd been poking at in the bug: /payload-job periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node |
|
@wking: trigger 1 job(s) for the /payload-(job|aggregate) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/311f53f0-428e-11ee-9b82-6e5fe4af1d09-0 |
|
@yuqi-zhang: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
The unit test failure can be ignored, it is a known flake The e2e-upgrade seems fast: Although maybe that's normal. The payload job seems to have failed? |
|
I believe original issue is about reducing leaderelection time on SNO cluster which still looks like taking ~5 mins as per sno-gcp-op job |
|
Previous run had /payload-job periodic-ci-openshift-release-master-nightly-4.14-upgrade-from-stable-4.13-e2e-aws-upgrade-ovn-single-node |
|
@wking: trigger 1 job(s) for the /payload-(job|aggregate) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f524e110-45eb-11ee-838f-2510ce4f786c-0 |
|
Checking the run Sinny was poking at, I do see container starts around 16:15, but they don't seem to be part of a Deployment roll, so I don't think they exercise the new logic: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3884/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-single-node/1694710054273421312/artifacts/e2e-gcp-op-single-node/gather-extra/artifacts/events.json | jq -r '[.items[] | select((.involvedObject.name // "" | startswith("machine-config-operator")))] | sort_by(.firstTimestamp)[] | .firstTimestamp + " " + (.involvedObject | .kind + " " + .name) + " " + .reason + ": " + .message' | tail -n20
2023-08-24T16:05:04Z Deployment machine-config-operator ConfigMapUpdated: Updated ConfigMap/kube-rbac-proxy -n openshift-machine-config-operator:
cause by changes in data.config-file.yaml
2023-08-24T16:09:56Z Deployment machine-config-operator ConfigMapUpdated: Updated ConfigMap/kube-rbac-proxy -n openshift-machine-config-operator:
cause by changes in data.config-file.yaml
2023-08-24T16:13:58Z Pod machine-config-operator-5977698469-5txwf NetworkNotReady: network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
2023-08-24T16:13:59Z Pod machine-config-operator-5977698469-5txwf FailedMount: MountVolume.SetUp failed for volume "images" : object "openshift-machine-config-operator"/"machine-config-operator-images" not registered
2023-08-24T16:13:59Z Pod machine-config-operator-5977698469-5txwf FailedMount: MountVolume.SetUp failed for volume "proxy-tls" : object "openshift-machine-config-operator"/"mco-proxy-tls" not registered
2023-08-24T16:14:15Z Pod machine-config-operator-5977698469-5txwf FailedMount: MountVolume.SetUp failed for volume "proxy-tls" : failed to sync secret cache: timed out waiting for the condition
2023-08-24T16:14:16Z Pod machine-config-operator-5977698469-5txwf FailedMount: MountVolume.SetUp failed for volume "images" : failed to sync configmap cache: timed out waiting for the condition
2023-08-24T16:14:41Z Pod machine-config-operator-5977698469-5txwf AddedInterface: Add eth0 [10.128.0.33/23] from ovn-kubernetes
2023-08-24T16:15:06Z Pod machine-config-operator-5977698469-5txwf Pulled: Container image "registry.build02.ci.openshift.org/ci-op-vdvk4x1r/stable@sha256:670101c1f75e5b2d3f6e40c202c2da502c911e61a43b38fcd23d0742b7d29d8e" already present on machine
2023-08-24T16:15:07Z Pod machine-config-operator-5977698469-5txwf Created: Created container machine-config-operator
2023-08-24T16:15:07Z Pod machine-config-operator-5977698469-5txwf Started: Started container machine-config-operator
2023-08-24T16:15:07Z Pod machine-config-operator-5977698469-5txwf Pulled: Container image "registry.build02.ci.openshift.org/ci-op-vdvk4x1r/stable@sha256:b5e574c5b2fdd0a90d899793e2ce97792dc2d3e3fbf934a107139b1b4f2732a7" already present on machine
2023-08-24T16:15:07Z Pod machine-config-operator-5977698469-5txwf Created: Created container kube-rbac-proxy
2023-08-24T16:15:07Z Pod machine-config-operator-5977698469-5txwf Started: Started container kube-rbac-proxy
2023-08-24T16:20:24Z Deployment machine-config-operator ConfigMapUpdated: Updated ConfigMap/kube-rbac-proxy -n openshift-machine-config-operator:
cause by changes in data.config-file.yaml
2023-08-24T16:23:39Z Deployment machine-config-operator ConfigMapUpdated: Updated ConfigMap/kube-rbac-proxy -n openshift-machine-config-operator:
cause by changes in data.config-file.yamlAnd checking the pod: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/3884/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-single-node/1694710054273421312/artifacts/e2e-gcp-op-single-node/gather-extra/artifacts/pods.json | jq '.items[] | select(.metadata.labels["k8s-app"] == "machine-config-operator").status.containerStatuses[]'
{
"containerID": "cri-o://44bea23a19ccfda451fee848f62ff873d2401bd7647904c53aeed5773281dd59",
"image": "registry.build02.ci.openshift.org/ci-op-vdvk4x1r/stable@sha256:b5e574c5b2fdd0a90d899793e2ce97792dc2d3e3fbf934a107139b1b4f2732a7",
"imageID": "registry.build02.ci.openshift.org/ci-op-vdvk4x1r/stable@sha256:b5e574c5b2fdd0a90d899793e2ce97792dc2d3e3fbf934a107139b1b4f2732a7",
"lastState": {},
"name": "kube-rbac-proxy",
"ready": true,
"restartCount": 11,
"started": true,
"state": {
"running": {
"startedAt": "2023-08-24T16:15:07Z"
}
}
}
{
"containerID": "cri-o://b550ba232e6ecf683050d98183ef97203dba8424fef1c3ed892d25ec77b19de4",
"image": "registry.build02.ci.openshift.org/ci-op-vdvk4x1r/stable@sha256:670101c1f75e5b2d3f6e40c202c2da502c911e61a43b38fcd23d0742b7d29d8e",
"imageID": "registry.build02.ci.openshift.org/ci-op-vdvk4x1r/stable@sha256:670101c1f75e5b2d3f6e40c202c2da502c911e61a43b38fcd23d0742b7d29d8e",
"lastState": {},
"name": "machine-config-operator",
"ready": true,
"restartCount": 9,
"started": true,
"state": {
"running": {
"startedAt": "2023-08-24T16:15:07Z"
}
}
}So there's maybe trouble with graceful leader release or something, but it's definitely container restarts (with no |
|
We're going with #3895 to pick up the default ServiceAccount deletion shift. |
|
@yuqi-zhang: This pull request references Jira Issue OCPBUGS-16905. The bug has been updated to no longer refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Test to see if this will help lease acquisition slowness.