storage: Cluster can't recover if all nodes are decommissioned and restarted

A user on our forum reported that they accidentally decommissioned and restarted all the nodes in their cluster: https://forum.cockroachlabs.com/t/restarting-from-decommissioned-nodes/1826

It turns out that if you do this in any size of cluster (including a local 3 node cluster), the cluster never recovers after the restart because all the nodes still consider themselves in a draining state, which causes all `RequestLease` requests get dropped on the floor:

https://github.com/cockroachdb/cockroach/blob/218a95432fbd3dd9bdde153c7f280eae9a702dd6/pkg/storage/replica_range_lease.go#L527-L531

This affects **all** lease requests, including those for r1, so no range ever gets a leaseholder. This keeps anything from happening in the cluster and even breaks a surprising fraction of our debug pages. It also, of course, prevents the `recommission` command from working.

I was able to "fix" such a cluster by removing the code highlighted above and rebuilding the binary, but that's not a great solution given that in most cases we really don't want to request new leases on decommissioned nodes.

It may be preferable to reject decommissioning attempts that would prevent a cluster from continuing to function. That'd be tough to guarantee, though, if we want to protect all ranges from losing quorum. So perhaps there's a middle ground where we allow draining nodes to take leases for ranges that don't have a known active leaseholder.

I'm curious for @nvanbenschoten and @asubiotto's thoughts.

	if r.store.IsDraining() {
	// We've retired from active duty.
	return r.mu.pendingLeaseRequest.newResolvedHandle(roachpb.NewError(
	newNotLeaseHolderError(nil, r.store.StoreID(), r.mu.state.Desc)))
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: Cluster can't recover if all nodes are decommissioned and restarted #27444

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

storage: Cluster can't recover if all nodes are decommissioned and restarted #27444

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions