-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: lease transferred to follower waiting for split #79385
Description
Describe the problem
In an experiment, a ten node cluster was run with IO overload on n3, and a 2TB bank import. This caused many of the splits involving a follower on n3 to get into a state where the n3 replica was "uninitialized" (because the split trigger would be wildly delayed.
We were seeing evidence that sometimes, the lease would get transferred to the n3 replica, creating an outage. SSTs would then get stuck in NotleaseholderError loops and bounce around for hours.
The full internal thread is here
To Reproduce
Set up 10 node AWS cluster via roachprod according to steps in https://cockroachlabs.slack.com/archives/C0KB9Q03D/p1649015732041819 and deploy the following unit to n3:
# Throughput killer
roachprod ssh tobias-import:3 -- sudo systemd-run --unit fiotp --working-directory=/mnt/data1/ -- fio --rw=readwrite --name=test --size=50M --direct=1 --bs=1024k --ioengine=libaio --iodepth=16 --directory=/mnt/data1/ --time_based --timeout 2400h
Ran this on
71e32a6
Expected behavior
We don't transfer leases to replicas that have never been initialized.
Environment:
master on roachprod AWS
Jira issue: CRDB-14780
Epic CRDB-16160