-
Notifications
You must be signed in to change notification settings - Fork 4.1k
storage: Delayed update of per-store write stats can cause rebalance thrashing #17970
Description
After a rebalance has happened, the LogicalBytes and WritesPerSecond logically change for the store that the replica was added to or removed from. If the LogicalBytes stat changes by a large amount, then the store will re-gossip its StoreCapacity ahead of schedule. While it isn't guaranteed by any means, in practice this does a mostly decent job of getting updated information spread throughout the cluster quickly enough that a bunch more rebalance operations aren't based on outdated information.
The same is not true for WritesPerSecond, though. We don't start counting WritesPerSecond stats until a replica has been on a new store for 5 seconds, so a good deal more rebalancing decisions can be made without considering the additional writes on the node. In some circumstances (as seen on indigo), this can make for rebalance thrashing where we add a replica to a node, then decide that the replica isn't a good fit for it. If the store's WritesPerSecond stat had been updated, we wouldn't move the replica.
As seen on indigo, this was closely intertwined with #17971.
@BramGruneir has previously suggested passing along WritesPerSecond stats as part of a rebalance to combat this. I'm still not sure about that, but the leaseholder that made the rebalance decisions should be capable of updating its own local copy of the other store's descriptor so that it has a more accurate view when deciding which replica to remove as part of a rebalance.