-
Notifications
You must be signed in to change notification settings - Fork 474
Description
Splitting this out from #1946 (comment)
Ugh and wait a second, this failed GCP run seems to have rebooted the masters twice - from this MCD log:
I0806 20:59:03.361444 2297 update.go:1455] Starting update from rendered-worker-a3629c84fa68ef33ff7fa7b5c501041f to rendered-worker-b88d93e7e9a96f5d961386e6e811875a: &{osUpdate:false kargs:false fips:false passwd:false files:false units:true kernelType:false extensions:false}
Are we potentially racing in the MCC...something like container runtime controller generating a MC after we've already resync'd the core configs?
/me goes to diff the MCs
Are we potentially racing in the MCC...something like container runtime controller generating a MC after we've already resync'd the core configs?
The problem there appears to be quite simple; in this test scenario we're changing both the MCO and machine-os-content.
- New CVO started and installed updated
configmap/machine-config-osimageurl - The old MCO rendered a new MC with that update, started a rollout to masters/workers
- Then the new MCO took over, rolled out another config update with template changes encapsulated inside it
- So we then upgraded again
I thought we had addressed this...but if so, it is lost in the dim spaces between neuron firings for me.
Update: I spot checked about 10 other e2e-gcp-upgrade jobs on this repo, and am not seeing this repeat. I think it's a real race, but perhaps rare. OTOH when it does occur it's rather bad for upgrade disruption.
If I'm wrong and this race was somehow introduced by my PR, at the least we need an e2e test that verifies during an upgrade job we aren't double rebooting (another test would be that there are at most two rendered machineconfigs per pool).