-
Notifications
You must be signed in to change notification settings - Fork 4.1k
stability: Rebalances must be atomic #12768
Description
Consider a deployment with three datacenters (and a replication configuration that places one replica of each range in each DC). During a rebalance, a range that normally requires a quorum of 2/3 will briefly change to a configuration with a 3/4 quorum and two replicas in one DC. If that DC (or the network) fails before the fourth replica is removed, the range becomes unavailable (If rebalancing downreplicated before upreplicating, then the intermediate state would have a quorum of 2/2 and a failure of either datacenter would lead to unavailability and much harder recovery). While the window is fairly small, there is a risk that an ill-timed DC failure could make one or more ranges unavailable for the duration.
The most general solution to this problem is to implement "joint consensus", the original raft membership change proposal (it is described in the original raft paper and in section 4.3 of Diego's dissertation). Etcd implements a much simpler membership change protocol which has the limitation that only one change can be made at a time (leading to the brief exposure of the intermediate 3/4 state). Joint consensus would let us make these rebalancing changes atomically.
We might also be able to mitigate the problem by using ideas from Flexible Paxos. Flexible Paxos shows that committing entries can be done with a bare majority of nodes, so as long as the leader is not in the failed datacenter the removal of the fourth node can be completed while the DC is down and the range is restored to its 2/3 state. However, if the leader were in the failed DC then the two surviving ones would be unable to make progress since they would have to assume that the former leader is still making progress on the other side of its partition. I'm not sure if there's a full solution here or not.
Running with a higher replication factor (5 replicas in 3 DCs) could also mitigate the problem if the rebalancer were aware of it (so when the range goes from 3/5 to 4/6, those six replicas are guaranteed to be two per DC). This might be a quick fix to prevent this problem from striking a critical system range. Running with more DCs than replicas (3 replicas in 4 DCs, so you never need to place two replicas in the same DC) also avoids this problem without increasing storage costs, but it has the significant downside that all rebalancing must happen over the more expensive inter-DC links.