Skip to content

kvserver: allocation algorithm should transfer replicas and leases away from an overloaded store #82611

@kvoli

Description

@kvoli

There have been several recent incidents where we have noted that despite admission control being on, it is possible for overload (CPU, IO) to occur on a store. This is partially due to follower store load not being subjected to admission control on the leaseholder store.

The store rebalancer and replicate queue in this case are unaware of the degraded state of the store, as they consider only QPS and range count respectively; in addition to existing constraint signals such as disk fullness and L0 sublevels [#78608].

The solution would be to add normalized CPU usage to be considered, in much the same way as L0 sub-levels.

This solution should be implemented in two phases:

  1. Add a constraint on store CPU for transferring replicas or leases towards an overloaded store. This check already exists for l0-sublevels.
  2. Enforce the constraint when considering replicas or leases for transfer away from an overloaded store. i.e. When a store does not meet the overload check, stores with leases for ranges, with a replica on the overloaded store will begin transferring them elsewhere. The overloaded store itself will likewise seek to transfer it's own leases.

This overload check must be both relative to the cluster average, as well as set to some threshold.

e.g.

sys_usr_cpu_normalized > threshold && sys_usr_cpu_normalized > mean(cluster...) * 1.1

Jira issue: CRDB-16533

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kvAnything in KV that doesn't belong in a more specific category.A-kv-distributionRelating to rebalancing and leasing.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-sreFor issues SRE opened or otherwise cares about tracking.T-kvKV Team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions