Skip to content

kvserver: disk stall prevents lease transfer #81100

@erikgrinaker

Description

@erikgrinaker

A node experiencing a persistent disk stall will fail to maintain liveness, since we do a sync disk write before each heartbeat:

// We synchronously write to all disks before updating liveness because we
// don't want any excessively slow disks to prevent leases from being
// shifted to other nodes. A slow/stalled disk would block here and cause
// the node to lose its leases.
if err := storage.WriteSyncNoop(ctx, eng); err != nil {
return Record{}, errors.Wrapf(err, "couldn't update node liveness because disk write failed")
}

This will in turn cause it to lose all leases. However, it will also prevent anyone else from acquiring the lease, causing all relevant ranges to get stuck until the disk recovers or the node is shut down (e.g. due to the Pebble disk stall detector which panics).

Notes from elsewhere below.


Reproduction Details

I'm intentionally trying to reduce the failure modes to the simplest case here -- we can look into more complicated ones when we've picked the low-hanging fruit.

  • 6 nodes: n6 only used for workload, n5 is the faulty node.
  • All system ranges are pinned to n1-4, to avoid interactions with them.
  • Logs are stored on a separate disk, so logging is not stalled.
  • All client traffic is directed at n1-4, to avoid interactions with client processing.

Instructions:

  • Create a 6-node cluster and stage it with necessary binaries:

    roachprod create -n 6 grinaker-diskstall
    roachprod stage grinaker-diskstall release v21.2.9
    roachprod put grinaker-diskstall bin.docker_amd64/workload
    
  • Wrap the SSDs in a delayable device, with 0 delay initially:

    roachprod run grinaker-diskstall 'sudo umount /mnt/data1'
    roachprod run grinaker-diskstall 'echo "0 `sudo blockdev --getsz /dev/nvme0n1` delay /dev/nvme0n1 0 0" | sudo dmsetup create delayed'
    roachprod run grinaker-diskstall 'sudo mount /dev/mapper/delayed /mnt/data1'
    
  • Start the cluster and configure it:

    roachprod start --racks 5 grinaker-diskstall:1-5
    
    ALTER RANGE meta CONFIGURE ZONE USING num_replicas = 3, constraints = '[-rack=4]';
    ALTER RANGE liveness CONFIGURE ZONE USING num_replicas = 3, constraints = '[-rack=4]';
    ALTER RANGE system CONFIGURE ZONE USING num_replicas = 3, constraints = '[-rack=4]';
    ALTER RANGE timeseries CONFIGURE ZONE USING num_replicas = 3, constraints = '[-rack=4]';
    ALTER DATABASE system CONFIGURE ZONE USING num_replicas = 3, constraints = '[-rack=4]';
    ALTER RANGE default CONFIGURE ZONE USING gc.ttlseconds = 600;
    
  • Start a KV workload from n6 against n1-4:

    ./workload run kv --init --splits 100 --read-percent 95 --batch 10 --min-block-bytes 4000 --max-block-bytes 4000 --tolerate-errors <pgurls>
    
  • Introduce an IO delay of 100s per operation on n5 (immediately takes effect, but command hangs for a while):

    echo "0 `sudo blockdev --getsz /dev/nvme0n1` delay /dev/nvme0n1 0 100000" | sudo dmsetup reload delayed && sudo dmsetup resume delayed
    
    # Inspect delay
    sudo dmsetup table delayed
    
    # Disable delay
    echo "0 `sudo blockdev --getsz /dev/nvme0n1` delay /dev/nvme0n1 0 0" | sudo dmsetup reload delayed && sudo dmsetup resume delayed
    

The workload throughput will now have dropped to ~0.

When n5 stalls, all other nodes have it cached as the leaseholder, and will send RPCs to it first. These nodes rely on n5 to return a NotLeaseHolderError to inform them that the lease is invalid, at which point they will try a different node. When this other node receives the RPC, it will detect the invalid lease, acquire a new lease, then return a successful response to the caller who updates its lease cache.

What happens here is that n5 never returns a NotLeaseHolderError, which prevents the rest from happening -- including another node claiming the lease. The lease thus remains invalid, and the workload stalls.

So why doesn't n5 return a NotLeaseHolderError? It notices that its lease is invalid and tries to obtain it for itself first. However, while doing so it first sends a heartbeat to make sure it's live before acquiring the lease. Recall that sending a heartbeat will perform a sync write (which is what made it lose the lease in the first place), and because the disk stalls this operation therefore also stalls. Forever.

// If this replica is previous & next lease holder, manually heartbeat to become live.
if status.OwnedBy(nextLeaseHolder.StoreID) &&
p.repl.store.StoreID() == nextLeaseHolder.StoreID {
if err = p.repl.store.cfg.NodeLiveness.Heartbeat(ctx, status.Liveness); err != nil {
log.Errorf(ctx, "failed to heartbeat own liveness record: %s", err)

If I add a simple 5-second timeout for lease acquisitions, this operations fails and returns a NotLeaseHolderError, allowing a different node to claim the lease. You may think "how does context cancellation work when we're blocked on a sync disk write?". Because we're not actually blocked on a sync write, we're blocked on a semaphore waiting for a previous heartbeat to complete, and that's blocked on a sync write:

// Allow only one heartbeat at a time.
nodeID := nl.gossip.NodeID.Get()
sem := nl.sem(nodeID)
select {
case sem <- struct{}{}:
case <-ctx.Done():
return ctx.Err()
}

With this timeout in place, things are much better:

  • A read-only workload recovers within 15 seconds of the disk stall.
  • A write-only workload hangs forever, because all clients are blocked on Raft application. However, new reads and writes succeed after 15 seconds.
  • A write-only workload with 10-second statement timeouts recovers within 30 seconds.
  • A read/write workload with 10-second statement timeouts recovers within 15 seconds.

To completely fix this, such that clients don't get stuck on Raft application (or waiting on their latches), we'll need circuit breaker support (#74712). However, even without that, this is a much better state of affairs where clients with timeouts will recover within a reasonable time.

I'm not sure a timeout is necessarily the best approach here, so I'm going to look into a few options, but this is definitely tractable in the short term and likely backportable in some form.

Jira issue: CRDB-15145

gz#13737

Epic CRDB-19227

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kvAnything in KV that doesn't belong in a more specific category.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-kvKV Team

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions