-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: allocation algorithm should transfer replicas and leases away from an overloaded store #82611
Description
There have been several recent incidents where we have noted that despite admission control being on, it is possible for overload (CPU, IO) to occur on a store. This is partially due to follower store load not being subjected to admission control on the leaseholder store.
The store rebalancer and replicate queue in this case are unaware of the degraded state of the store, as they consider only QPS and range count respectively; in addition to existing constraint signals such as disk fullness and L0 sublevels [#78608].
The solution would be to add normalized CPU usage to be considered, in much the same way as L0 sub-levels.
This solution should be implemented in two phases:
- Add a constraint on store CPU for transferring replicas or leases towards an overloaded store. This check already exists for l0-sublevels.
- Enforce the constraint when considering replicas or leases for transfer away from an overloaded store. i.e. When a store does not meet the overload check, stores with leases for ranges, with a replica on the overloaded store will begin transferring them elsewhere. The overloaded store itself will likewise seek to transfer it's own leases.
This overload check must be both relative to the cluster average, as well as set to some threshold.
e.g.
sys_usr_cpu_normalized > threshold && sys_usr_cpu_normalized > mean(cluster...) * 1.1
Jira issue: CRDB-16533