A node that is offline for a short period can cause a performance issue when it rejoins the cluster

**Describe the problem**

In several customer cases, we have seen a node rejoin a cluster after it is offline for a short period of time. During the time it is offline, replicas that are on that node begin upreplicating to other nodes, however in many cases when it comes back it still is the owner of a number of replicas. These replicas all begin to attempt to catch up with their raft log when the node is back and this can cause a severe IO spike on the node. In many cases, this causes LSM inversion, and admission control will throttle writes to this store.  In addition, leases begin to be transferred onto this node which compounds the problem. 

**To Reproduce**

What did you do? Describe in your own words.

If possible, provide steps to reproduce the behavior:

1. Set up a cluster (12 nodes) with a 50% KV workload with 20K splits and 8K block writes.
2. Take one node down, wait 10 minutes and then bring the node back on.
4. Approximately 1-2 minutes after the node is restarted and all IO to the node will stop. This continues on for about 3-5 minutes. 

**Expected behavior**
There should be no noticeable impact of stopping and restarting a node.

**Additional data / screenshots**
![image](https://user-images.githubusercontent.com/46975918/212140448-47dc0dd3-20dc-4c32-a3c3-f70e508682f7.png)

**Environment:**
 - CockroachDB version - master


Jira issue: CRDB-23378

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A node that is offline for a short period can cause a performance issue when it rejoins the cluster #95159

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A node that is offline for a short period can cause a performance issue when it rejoins the cluster #95159

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions