Skip to content

A node that is offline for a short period can cause a performance issue when it rejoins the cluster #95159

@andrewbaptist

Description

@andrewbaptist

Describe the problem

In several customer cases, we have seen a node rejoin a cluster after it is offline for a short period of time. During the time it is offline, replicas that are on that node begin upreplicating to other nodes, however in many cases when it comes back it still is the owner of a number of replicas. These replicas all begin to attempt to catch up with their raft log when the node is back and this can cause a severe IO spike on the node. In many cases, this causes LSM inversion, and admission control will throttle writes to this store. In addition, leases begin to be transferred onto this node which compounds the problem.

To Reproduce

What did you do? Describe in your own words.

If possible, provide steps to reproduce the behavior:

  1. Set up a cluster (12 nodes) with a 50% KV workload with 20K splits and 8K block writes.
  2. Take one node down, wait 10 minutes and then bring the node back on.
  3. Approximately 1-2 minutes after the node is restarted and all IO to the node will stop. This continues on for about 3-5 minutes.

Expected behavior
There should be no noticeable impact of stopping and restarting a node.

Additional data / screenshots
image

Environment:

  • CockroachDB version - master

Jira issue: CRDB-23378

Metadata

Metadata

Assignees

Labels

A-admission-controlA-kvAnything in KV that doesn't belong in a more specific category.A-kv-distributionRelating to rebalancing and leasing.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions