Skip to content

storage: Cluster can't recover if all nodes are decommissioned and restarted #27444

@a-robinson

Description

@a-robinson

A user on our forum reported that they accidentally decommissioned and restarted all the nodes in their cluster: https://forum.cockroachlabs.com/t/restarting-from-decommissioned-nodes/1826

It turns out that if you do this in any size of cluster (including a local 3 node cluster), the cluster never recovers after the restart because all the nodes still consider themselves in a draining state, which causes all RequestLease requests get dropped on the floor:

if r.store.IsDraining() {
// We've retired from active duty.
return r.mu.pendingLeaseRequest.newResolvedHandle(roachpb.NewError(
newNotLeaseHolderError(nil, r.store.StoreID(), r.mu.state.Desc)))
}

This affects all lease requests, including those for r1, so no range ever gets a leaseholder. This keeps anything from happening in the cluster and even breaks a surprising fraction of our debug pages. It also, of course, prevents the recommission command from working.

I was able to "fix" such a cluster by removing the code highlighted above and rebuilding the binary, but that's not a great solution given that in most cases we really don't want to request new leases on decommissioned nodes.

It may be preferable to reject decommissioning attempts that would prevent a cluster from continuing to function. That'd be tough to guarantee, though, if we want to protect all ranges from losing quorum. So perhaps there's a middle ground where we allow draining nodes to take leases for ranges that don't have a known active leaseholder.

I'm curious for @nvanbenschoten and @asubiotto's thoughts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-replicationRelating to Raft, consensus, and coordination.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.O-communityOriginated from the communityS-0-corruption-or-data-lossUnrecoverable corruption, data loss, or other catastrophic issues that can’t be fixed by upgrading.

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions