Skip to content

storage: replicate queue has tendency to remove just-added replicas #17879

@a-robinson

Description

@a-robinson
  1. After the replicate queue adds a new replica as part of a rebalance, it immediately re-queues the replica for processing so that it can choose to remove a replica.
  2. As part of choosing which replica to remove, we filter from consideration any replicas that are necessary for the range's quorum. This filtering compares the raft commit index from each replica to the quorum commit index, considering any replicas even a single commit behind not part of the quorum.

Both of these actions are reasonable, but they interact poorly in high-latency clusters or any cluster where one of the previously existing replicas was lagging behind. The problem is that in many cases, the newly added replica won't have finished receiving/applying its snapshot and catching up (or at least the leaseholder isn't yet aware that it has done so), and so if any of the other replicas is also behind (as can easily happen on indigo where the nodes are different distances apart) then those two behind replicas are the only ones we'll consider removing. The other two replicas are considered necessary for the quorum since they're the only two that are up-to-date.

This unfortunate behavior can easily lead to thrashing if the replica that the allocator wanted to rebalance away from is one of the two that can't be removed. This affects both stats-based rebalancing and the range-count form of rebalancing, although its effect is more severe for stats-based rebalancing. It happens very reliably whenever running indigo with no data other than the timeseries ranges.

We could fix it in a few different ways, but we might not want to so close to 1.1. If we don't fix it, we'll definitely have to disable stats-based rebalancing by default (#17645).

The first approaches to fixing that come to mind:

  1. Wait before re-queueing after doing a rebalance. We could just wait for a set amount of time on the order of seconds, or for a dynamic time estimated based on how long the snapshot should take, or we could poll the range state until the new replica gets caught up. These approaches would all be susceptible to the Scanner re-adding the replica to the replicate queue, circumventing the wait.
  2. Be less harsh in filterUnremovableReplicas. If a replica is less than N commits behind, don't rule it out. If we did this, then even if an existing replica is behind by a commit or two there will still be 3 valid replicas, meaning any replica can be behind even if the new replica hasn't caught up.

I like 2, but may be forgetting a reason why we can't loosen that up. I thought we used to allow a cushion here, but we clearly don't right now. @petermattis

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions