Skip to content

kv: always transfer expiration-based leases during lease transfers #81764

@nvb

Description

@nvb

Action item from #81561.

Described in #81561 (comment).

In a private slack conversation, @tbg had an additional proposal to minimize the blast radius of an ill-advised lease transfer. It's quite clever and serves as an actionable recovery mechanism. The idea stems out of:

However, even though the leaseholder does not recognize itself as such, it continues to heartbeat its liveness record, indirectly extending its lease so that it does not expire.

This is a key part of the hazard here. With expiration-based leases, a leaseholder needs to periodically (every 4.5 seconds) extend the lease by 9 seconds or it will expire and be up for grabs. This requires the leaseholder to recognize itself as the leaseholder within 9 seconds after a lease transfer. However, with epoch-based leases, this lease extension is indirectly performed through the node's liveness record. This means that a newly appointed leaseholder can continue to hold on to the lease even if it doesn't recognize itself as the leaseholder for an unbounded amount of time.

Tobi's proposal is that even on the portion of the keyspace that can use epoch-based leases, lease transfers could always install an expiration-based lease. The new leaseholder would then need to learn about this lease within 9 seconds or the lease would expire. When performing its first lease extension, it would promote the lease back to an epoch-based lease. This limits the damage of a bad lease transfer to a 9-second outage. There a likely some bugs lurking here because we don't regularly switch between epoch and expiration-based leases, but doing so is meant to be possible.

One potential hazard is that this limits the potential lease candidates to those replicas which are less than 9s behind on their log. If a lease target is persistently more than 9s behind, the lease could thrash back and forth. This could partially re-introduce #38065, or an even more disruptive variant of that issue (due to the thrashing). I'm not sure whether that's a real concern, as 9s of replication lag is severe and a leaseholder should not attempt to transfer a lease to a replica that is so far behind on its log. However, it's difficult to reason through whether the quota pool provides any real protection here. We'll need to keep this hazard in mind.

I think we should explore this recovery mechanism in addition to the proposed protection mechanism presented in this issue. With those two changes, we will be in a much better place.

Jira issue: CRDB-16064

Epic CRDB-16160

Metadata

Metadata

Assignees

Labels

A-kvAnything in KV that doesn't belong in a more specific category.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)N-followupNeeds followup.O-postmortemOriginated from a Postmortem action item.T-kvKV Team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions