-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Method to ask cockroachdb if it is "safe" to decommission a node #70486
Description
Is your feature request related to a problem? Please describe.
Our operators have automated the provisioning of cockroachdb clusters on-premise. We would like to be able ask cockroachdb if it is safe to remove a node.
The main concern is around data redundancy, i.e How do we know if we will have enough replicas in zone or region?
We don't want to inspect zone constraints, we want to simply ask cockroach if we remove a node, will we be able to avoid an outage? We want to guarantee that we can maintain the correct RF for all databases on the clusters.
For more context, we have to imagine that end users have access to a webui portal, where they can remove nodes. At scale we can't manually verify every removal of a node for 100s of clusters.
For example:
If we have 9 nodes across 3 regions, can we safely remove 4 nodes and maintain quorum for the databases with 5 RF?
If we have 6 nodes in 1 region, can we safely remove 1 node?
Do we have under replicated ranges that are about to be up-replicated to X node?
Describe the solution you'd like
A solution to ask this question from SQL layer would be easy for operators to use.
Alternatively:
cockroach node decommission --dry_run
Describe alternatives you've considered
SQL statements retrieving the replication factor for all zones and then comparing it to node counts.
Additional context
We have seen that there is "cockroach node decommission". However it does not appear to finish gracefully in situations as described above.
gz#9825
gz#10113
gz#10216
Jira issue: CRDB-10098
Epic CRDB-20924