kv: paused followers interacts poorly with leader-not-leaseholder state

In a cluster that was overloaded due to an index backfill, I saw a large number of ranges serving foreground traffic begin hitting circuit breaker errors. Digging in deeper, it became clear that all ranges in this state had split leaders and leaseholders, and the leaseholders were on an overloaded node. Earlier in the test, I had set `admission.kv.pause_replication_io_threshold` to 0.8 to avoid replicating to overloaded followers.

The combination of paused replicas and the leader-not-leaseholder split effectively caused unavailability. Even though the leaseholder could propose writes, it would never hear about their result, so it would never acknowledge the result of those writes to clients.

Should we allow the leaseholder to be paused?

----

On a related note, I noticed that while all nodes had a non-zero value for the `admission.raft.paused_replicas`, the overloaded node itself (purple) had a few spikes where it reported thousands of paused replicas. My understanding is that this metric is reported from leader side, not the follower side, so I don't understand this. Is there any reason why an overloaded node would start pausing other replicas?

<img width="960" alt="Screen Shot 2022-07-21 at 5 51 46 PM" src="https://user-images.githubusercontent.com/5438456/180322522-af01d1ca-9a6e-4a05-ada8-0b32eb73fa7c.png">

Also, the description of this metric says: `The count is emitted by the leaseholder of each range.`

Should this instead say: `The count is emitted by the leader of each range.`

Jira issue: CRDB-17924

Epic CRDB-15069

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: paused followers interacts poorly with leader-not-leaseholder state #84884

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kv: paused followers interacts poorly with leader-not-leaseholder state #84884

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions