-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: have autoupgrade process look at decommission status, not availability #53515
Copy link
Copy link
Open
Labels
A-kv-decom-rolling-restartDecommission and Rolling RestartsDecommission and Rolling RestartsC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-kvKV TeamKV Team
Description
Reported privately in https://github.com/cockroachlabs/support/issues/584 we saw an instance of the following:
- A node (n14) was added, and allocated a node ID.
- The node was immediately removed (fast enough, I think, to not get to persist a liveness record of its own).
- A month later they tried upgrading their cluster, relying on the autoupgrade process.
- The autoupgrade process found the added node to be unavailable and kept spinning
59512:I200824 08:03:41.404150 1038 server/server_update.go:50 [n1] failed attempt to upgrade cluster version, error: node 14 not running (UNKNOWN), cannot determine version
59516:I200824 08:04:09.303473 1038 server/server_update.go:50 [n1] failed attempt to upgrade cluster version, error: node 14 not running (UNKNOWN), cannot determine version
59520:I200824 08:04:35.472315 1038 server/server_update.go:50 [n1] failed attempt to upgrade cluster version, error: node 14 not running (UNKNOWN), cannot determine version
59529:I200824 08:05:05.895085 1038 server/server_update.go:50 [n1] failed attempt to upgrade cluster version, error: node 14 not running (UNKNOWN), cannot determine version
- We couldn't decommission n14 because no corresponding liveness record existed for it.
- We were able to manually bump the cluster version because this process scanned all the available liveness records to ensure that there were no dead nodes that weren't decommissioned.
We should have the autoupgrade process behave similarly by looking at the fully decommissioned bit we added in #50329, instead of just looking at availability. (Separately, it'd be nice to always have a liveness record created when adding a node to the cluster, something that I think will be made easier by #52526).
Jira issue: CRDB-3857
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
A-kv-decom-rolling-restartDecommission and Rolling RestartsDecommission and Rolling RestartsC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-kvKV TeamKV Team