Skip to content

kv: use tscache summary to eliminate txn retries due to lease transfers and range merges #61986

@nvb

Description

@nvb

Background / Motivation

In #60521, we began shipping a summary of an outgoing leaseholder's timestamp cache during lease transfers and range merges. The incoming / post-merge leaseholder then applies this summary after assuming control of the keyspace to ensure that no future writes are allowed to invalidate prior reads.

The structure, represented by a new ReadSummary proto, includes information like read on key "a" were served up to time 10. Because the timestamp cache is used to prevent transaction replays (see CanCreateTxnRecord), it also includes information like txn 1234 already committed.

This replaced an existing mechanism where the incoming leaseholder would conservatively bump its timestamp cache all the way up to the lease start time / merge freeze time across its entire keyspace. The higher fidelity summary allows the new leaseholder to avoid bumping its timestamp cache as aggressively. This can limit the impact of lease transfers on foreground traffic - reducing false read-write contention which causes transaction retries (e.g. see TestStoreLeaseTransferTimestampCacheRead) and reducing false positives for transaction replay detection which causes transaction aborts (e.g. see TestStoreLeaseTransferTimestampCacheTxnRecord).

However, currently, the ReadSummary structure still only maintains a low-resolution snapshot of the outgoing leaseholder's timestamp cache. So while #60521 introduced the mechanism to ship a timestamp cache summary (and fixed two bugs in the process), it didn't actually begin taking full advantage of this new mechanism.

Proposed Change

Now that the ReadSummary mechanism is in place, we can begin using it to limit transaction retries due to lease transfers and range merges.

First, we'll want to clean up the compatibility code added in #60521, now that we can be sure that all nodes in any cluster that touch our changes here will be aware of the ReadSummary and will be using the per-range closed timestamp system. This will allow us to remove code like this and this.

Once we do that, we'll want to augment the ReadSummary structure with the ability to carry high-resolution information, subject to some memory limits. We'll update Replica.GetCurrentReadSummary to collect this information from the leaseholder's timestamp cache. The details of this structure are TBD, as are the details of the memory limits. We'll likely want to prioritize the local keyspace segment over the global keyspace segment, as txn aborts are more disruptive than txn pushes, because they cannot be refreshed away.

Once we can represent and capture higher-resolution information in a ReadSummary, we'll want to introduce some form of compression of these summaries. This will allow us to achieve the desired scheme of shipping a high-res, high-mem summary on log entries but only persisting a low-res, low-mem summary indefinitely in a range's keyspace:

// When a ReadSummary is set in a ReplicatedEvalResult, there is always also a
// write to the RangePriorReadSummaryKey in the RaftCommand.WriteBatch. The
// persisted summary may be identical to the summary in this field, but it
// does not have to be. Notably, we intended for the summary included in the
// ReplicatedEvalResult to eventually be a much higher-resolution version of
// the ReadSummmary than the version persisted. This scheme of persisting a
// compressed ReadSummary indefinitely and including a higher-resolution
// ReadSummary on the RaftCommand allows us to optimize for the common case
// where the lease transfer is applied on the new leaseholder through Raft log
// application while ensuring correctness in the case where the lease transfer
// is applied on the new leaseholder through a Raft snapshot.

With these changes in place, we should see a stark drop in the disruption of lease transfers and range merges to foreground traffic.

Jira issue: CRDB-2701

gz#15057

gz#19312

Epic CRDB-34172

Metadata

Metadata

Assignees

Labels

A-kv-distributionRelating to rebalancing and leasing.A-kv-transactionsRelating to MVCC and the transactional model.A-read-committedRelated to the introduction of Read CommittedC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsP-3Issues/test failures with no fix SLAT-kvKV Team

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions