Skip to content

stability: Rebalances must be atomic #12768

@bdarnell

Description

@bdarnell

Consider a deployment with three datacenters (and a replication configuration that places one replica of each range in each DC). During a rebalance, a range that normally requires a quorum of 2/3 will briefly change to a configuration with a 3/4 quorum and two replicas in one DC. If that DC (or the network) fails before the fourth replica is removed, the range becomes unavailable (If rebalancing downreplicated before upreplicating, then the intermediate state would have a quorum of 2/2 and a failure of either datacenter would lead to unavailability and much harder recovery). While the window is fairly small, there is a risk that an ill-timed DC failure could make one or more ranges unavailable for the duration.

The most general solution to this problem is to implement "joint consensus", the original raft membership change proposal (it is described in the original raft paper and in section 4.3 of Diego's dissertation). Etcd implements a much simpler membership change protocol which has the limitation that only one change can be made at a time (leading to the brief exposure of the intermediate 3/4 state). Joint consensus would let us make these rebalancing changes atomically.

We might also be able to mitigate the problem by using ideas from Flexible Paxos. Flexible Paxos shows that committing entries can be done with a bare majority of nodes, so as long as the leader is not in the failed datacenter the removal of the fourth node can be completed while the DC is down and the range is restored to its 2/3 state. However, if the leader were in the failed DC then the two surviving ones would be unable to make progress since they would have to assume that the former leader is still making progress on the other side of its partition. I'm not sure if there's a full solution here or not.

Running with a higher replication factor (5 replicas in 3 DCs) could also mitigate the problem if the rebalancer were aware of it (so when the range goes from 3/5 to 4/6, those six replicas are guaranteed to be two per DC). This might be a quick fix to prevent this problem from striking a critical system range. Running with more DCs than replicas (3 replicas in 4 DCs, so you never need to place two replicas in the same DC) also avoids this problem without increasing storage costs, but it has the significant downside that all rebalancing must happen over the more expensive inter-DC links.

Metadata

Metadata

Assignees

Labels

A-kv-replicationRelating to Raft, consensus, and coordination.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)S-1-blocking-adoption

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions