Skip to content

Unavailability due to lease transfer onto replica waiting for Raft snapshot #68539

@erikgrinaker

Description

@erikgrinaker

We've seen two customers have issues with leases being transferred to nodes that are still waiting for Raft snapshots, and if there are many snapshots queued up (as in the case of a new node joining a cluster with lots of ranges) this can cause the range to be unavailable for a significant period.

@nvanbenschoten writes in an internal ticket:

The high-level overview of what went wrong from the customer's perspective was that during a period of concurrent upreplication to an incoming node and decommissioning of an outgoing node, they saw two periods of high SQL latency, each lasting for a few minutes. We found that during each of these outages, one range was effectively unavailable.

We poked around in the graphs and found that during these periods, we had a high frequency of NotLeaseholderErrors. Our hunch was that we had a lease transfer to a replica that was behind on its Raft log, so we entered a redirection loop (not a hot one, we exponentially backoff) where the outgoing leaseholder redirected to the incoming leaseholder, and vice versa. Since the new leaseholder needs to apply the Raft log up to the lease transfer before it can take over as the leaseholder, we essentially had no leaseholder for a few minutes. We confirmed that this was the case and we able to grab logs from the problem range in each instance.

The logs show that in each case, the lease was transferred to a replica shortly after it was added to a range and promoted from a LEARNER to a VOTER. The leaseholder then waited for a few minutes to apply a VIA_SNAPSHOT_QUEUE snapshot before it was able to start serving traffic. So presumably, the snapshot contained the lease transfer.

So what does this mean? It means that the outgoing leaseholder transferred its lease and then the leader (same replica? unclear) truncated the Raft log shortly afterward. This preventing the incoming leaseholder from applying the lease transfer through standard log application. Instead, the leader needed to send it a Raft snapshot to catch it up. And this snapshot was delayed for a few minutes because there were so many other snapshots in the system at the time. We see queueing on the receiver (queued: 46.68s and queued: 47.36s) and there was presumably also queuing on the sender in the raftsnapshot queue.

Obviously, it's not a good idea to truncate a Raft log ahead of a leaseholder. It's especially bad to do so ahead of an incoming leaseholder that hasn't yet found out that it has the lease. How can we improve this? I think there are two ways to approach this and we should explore both:

  1. don't truncate our Raft log in front of an incoming leaseholder. We do our best not to truncate the Raft log in front of any follower, so something must have gone wrong here. We should page this back in and determine whether there's a missing safeguard. Is the INITIAL snapshot sent during upreplication subject to the same protection (see snapshotLogTruncationConstraints) against immediately log truncation that the raft snapshot queue is? It appears to be here.
  2. prioritize Raft snapshots sent to leaseholder replicas at the various levels that a snapshot can queue (i.e. at sender and receiver). We don't seem to do this anywhere, but it seems like a good idea. If a leaseholder needs a snapshot, it needs it ASAP because the range will be unavailable (to writes in the general can and to reads+writes in cases like this) until it applies the snapshot.

/cc @cockroachdb/kv

gz#9425

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-replicationRelating to Raft, consensus, and coordination.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.S-2-temp-unavailabilityTemp crashes or other availability problems. Can be worked around or resolved by restarting.T-kvKV Team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions