-
Notifications
You must be signed in to change notification settings - Fork 4.1k
A node that is offline for a short period can cause a performance issue when it rejoins the cluster #95159
Description
Describe the problem
In several customer cases, we have seen a node rejoin a cluster after it is offline for a short period of time. During the time it is offline, replicas that are on that node begin upreplicating to other nodes, however in many cases when it comes back it still is the owner of a number of replicas. These replicas all begin to attempt to catch up with their raft log when the node is back and this can cause a severe IO spike on the node. In many cases, this causes LSM inversion, and admission control will throttle writes to this store. In addition, leases begin to be transferred onto this node which compounds the problem.
To Reproduce
What did you do? Describe in your own words.
If possible, provide steps to reproduce the behavior:
- Set up a cluster (12 nodes) with a 50% KV workload with 20K splits and 8K block writes.
- Take one node down, wait 10 minutes and then bring the node back on.
- Approximately 1-2 minutes after the node is restarted and all IO to the node will stop. This continues on for about 3-5 minutes.
Expected behavior
There should be no noticeable impact of stopping and restarting a node.
Environment:
- CockroachDB version - master
Jira issue: CRDB-23378
