Validate that we're avoiding double reboots when m-o-c and the MCO change

Splitting this out from https://github.com/openshift/machine-config-operator/pull/1946#issuecomment-670257516

> Ugh and wait a second, [this failed GCP run](https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1291464088030810112) seems to have rebooted the masters *twice* - from [this MCD log](https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1291464088030810112/artifacts/launch/pods/openshift-machine-config-operator_machine-config-daemon-8fv4g_machine-config-daemon.log):
`I0806 20:59:03.361444    2297 update.go:1455] Starting update from rendered-worker-a3629c84fa68ef33ff7fa7b5c501041f to rendered-worker-b88d93e7e9a96f5d961386e6e811875a: &{osUpdate:false kargs:false fips:false passwd:false files:false units:true kernelType:false extensions:false}`
> Are we potentially racing in the MCC...something like container runtime controller generating a MC after we've already resync'd the core configs?  
> /me goes to diff the MCs

> Are we potentially racing in the MCC...something like container runtime controller generating a MC after we've already resync'd the core configs?

The problem there appears to be quite simple; in this test scenario we're changing both the MCO and `machine-os-content`.

- New CVO started and installed updated `configmap/machine-config-osimageurl` 
- The old MCO rendered a new MC with that update, started a rollout to masters/workers
- Then the new MCO took over, rolled out another config update with template changes encapsulated inside it
- So we then upgraded *again*

I thought we had addressed this...but if so, it is lost in the dim spaces between neuron firings for me.

Update: I spot checked about 10 other e2e-gcp-upgrade jobs on this repo, and am not seeing this repeat.  I think it's a real race, but perhaps rare.  OTOH when it does occur it's rather bad for upgrade disruption.

If I'm wrong and this race was somehow introduced by my PR, at the least we need an e2e test that verifies during an upgrade job we aren't double rebooting (another test would be that there are at most two rendered machineconfigs per pool).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate that we're avoiding double reboots when m-o-c and the MCO change #1991

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Validate that we're avoiding double reboots when m-o-c and the MCO change #1991

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions