-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage: Cluster can't recover if all nodes are decommissioned and restarted #27444
Description
A user on our forum reported that they accidentally decommissioned and restarted all the nodes in their cluster: https://forum.cockroachlabs.com/t/restarting-from-decommissioned-nodes/1826
It turns out that if you do this in any size of cluster (including a local 3 node cluster), the cluster never recovers after the restart because all the nodes still consider themselves in a draining state, which causes all RequestLease requests get dropped on the floor:
cockroach/pkg/storage/replica_range_lease.go
Lines 527 to 531 in 218a954
| if r.store.IsDraining() { | |
| // We've retired from active duty. | |
| return r.mu.pendingLeaseRequest.newResolvedHandle(roachpb.NewError( | |
| newNotLeaseHolderError(nil, r.store.StoreID(), r.mu.state.Desc))) | |
| } |
This affects all lease requests, including those for r1, so no range ever gets a leaseholder. This keeps anything from happening in the cluster and even breaks a surprising fraction of our debug pages. It also, of course, prevents the recommission command from working.
I was able to "fix" such a cluster by removing the code highlighted above and rebuilding the binary, but that's not a great solution given that in most cases we really don't want to request new leases on decommissioned nodes.
It may be preferable to reject decommissioning attempts that would prevent a cluster from continuing to function. That'd be tough to guarantee, though, if we want to protect all ranges from losing quorum. So perhaps there's a middle ground where we allow draining nodes to take leases for ranges that don't have a known active leaseholder.
I'm curious for @nvanbenschoten and @asubiotto's thoughts.