Skip to content

kvserver: lease transferred to follower waiting for split #79385

@tbg

Description

@tbg

Describe the problem

In an experiment, a ten node cluster was run with IO overload on n3, and a 2TB bank import. This caused many of the splits involving a follower on n3 to get into a state where the n3 replica was "uninitialized" (because the split trigger would be wildly delayed.

We were seeing evidence that sometimes, the lease would get transferred to the n3 replica, creating an outage. SSTs would then get stuck in NotleaseholderError loops and bounce around for hours.

The full internal thread is here

To Reproduce

Set up 10 node AWS cluster via roachprod according to steps in https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1649015732041819 and deploy the following unit to n3:

# Throughput killer
roachprod ssh tobias-import:3 -- sudo systemd-run --unit fiotp --working-directory=/mnt/data1/ -- fio --rw=readwrite --name=test --size=50M --direct=1 --bs=1024k --ioengine=libaio --iodepth=16 --directory=/mnt/data1/ --time_based --timeout 2400h

Ran this on
71e32a6

Expected behavior

We don't transfer leases to replicas that have never been initialized.

Environment:
master on roachprod AWS

Jira issue: CRDB-14780

Epic CRDB-16160

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-kvKV Team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions