-
Notifications
You must be signed in to change notification settings - Fork 4.1k
server: create a mechanism to gracefully and quickly shut down an entire cluster #58417
Description
Requested by @awoods187 and @jseldess
tldr: we want a way to reliably shut down an entire cluster.
Today during multi-user demos and tests, users run afoul of production-level rules when trying to shut down an entire cluster: the shutdown process is designed+optimized to ensure that the cluster remains available when one node is shut down.
This means, in particular, that a node does not let itself shut down if it is unable to find a replacement live node to transfer range leases to. This logic is needed to preserve cluster availability through rolling restarts and other production operations on live clusters.
However this logic is also incompatible with interactive use, when a human user following a tutorial attempts to stop an entire cluster at the end of a tutorial/guide/lesson. Their experience is that "nodes refuse to shut down" and they have to resort to ungraceful shutdowns.
We generally would prefer to not encourage (nor teach) ungraceful shutdowns, as they are more likely to be detrimental to cluster health, and thus certainly should not be used for routine operations.
So we really want a tool / method / operation to "shut down an entire cluster gracefully", separate from the incremental node shutdown/restart which preserves cluster health.
For several reasons (not detailed here), it is unreasonable to expect that the same mechanism can be used for both availability-preserving node shutdowns/restarts, and availability-destroying whole-cluster shutdowns.
Additionally, there is at least one reason to desire different mechanisms: a whole-cluster shutdown should preserve the location of range leases, so that the patterns of data locality and traffic does not change significantly if/when the cluster is restarted. Graceful individual node shutdowns, by definition, are designed to change this traffic by redirecting it to other live nodes.
Jira issue: CRDB-3395