Skip to content

allocator,admission: consider resource utilization + throttling signals directly #83490

@irfansharif

Description

@irfansharif

Is your feature request related to a problem? Please describe.

Allocation in CRDB is in terms of an abstract '# of batch requests' unit which as a measure can be fairly divorced from actual hardware consumption. It's difficult to tune (impossible to normalize to capacity), reason about, and lends to awkward calibration in practice (#76252).

Describe the solution you'd like

Modelling allocation directly in terms of resource utilization without collapsing the different resource dimensions (disk bandwidth, IOPS, CPU) into a single unit. Allocation should also be thought as operating on a layer "above" admission control -- AC introduces artificial delays to prevent node overload and ignoring this throttling would prevent us from distinguishing between two replicas with an identical rate of resource use where one of them could be pushing a rate much higher were it placed elsewhere with headroom. At a high-level, we should develop + use measures for:

  • what kind of utilization any given node is observing for each resource dimension (easy for CPU, more difficult for bandwidth+IOPS unless provisioned amount is provided, or measured directly during process start -- TODO; necessary to understand what headroom is available or could to be made available through rebalancing)
  • what resource dimension(s) are observing saturation and on what nodes (can be surfaced through admission control);
  • some attribution of resource use to individual ranges/tenants per-store (easy for disk bandwidth + IOPs if ignoring the page cache and subtracting effects of the block cache, possible for CPU with rfcs: fine-grained cpu attribution #82356)
  • the rate of throttling experienced per-replica/tenant because of saturation (useful to forecast effect of a lease transfer or replica movement).

It's worth exploring approaches for what allocation could look like when juggling discrete resource dimensions.

Additional context

This issue is a resuscitation of #34590 with more words.

Jira issue: CRDB-17098

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-admission-controlA-kv-distributionRelating to rebalancing and leasing.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-kvKV Team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions