Skip to content

kv: BatchRequest count used as QPS metric for rebalancing, not Request count #50620

@nvb

Description

@nvb

In a recent customer report, we found that load-based rebalancing was failing to properly balance leaseholders, resulting in imbalanced load. After some deep investigation, @yuzefovich and @tbg found that the problem could be traced back to lookup joins. Specifically, one of the nodes was being tasked with running all of the lookup joins in the workload. The question is, why wasn't this picked up by either the hotranges report or load-based balancing?

After an in-person discussion, we believe this is because both of these sources rely on the leaseholderStats on each replica. These statistics only track the number of BatchRequests evaluated on a leaseholder, and not the number of individual requests:

if r.leaseholderStats != nil && ba.Header.GatewayNodeID != 0 {
r.leaseholderStats.record(ba.Header.GatewayNodeID)
}

The hypothesis is that if this line was changed to r.leaseholderStats.recordCount(len(ba.Requests), ba.Header.GatewayNodeID), load-based lease rebalancing would have avoided this issue. This is because (in v19.2) lookup joins issue batches of 100 scans. So these batches would only be counted once towards a range's qps for load balancing purposes, but would place 100x the load on the leaseholder evaluating them.

We should test that hypothesis.

We should also determine whether we actually want to make a change here. It's problematic that we don't consider the size of batch requests in these heuristics. However, changing that now could lead to surprising effects in other areas. For instance, it would also weigh follow-the-workload rebalancing in favor of gateways that issue multi-request batches. Do we want that?

Jira issue: CRDB-4109

Metadata

Metadata

Assignees

Labels

A-kv-distributionRelating to rebalancing and leasing.C-performancePerf of queries or internals. Solution not expected to change functional behavior.T-kvKV TeamX-staleno-issue-activity

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions