-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kv: use tscache summary to eliminate txn retries due to lease transfers and range merges #61986
Description
Background / Motivation
In #60521, we began shipping a summary of an outgoing leaseholder's timestamp cache during lease transfers and range merges. The incoming / post-merge leaseholder then applies this summary after assuming control of the keyspace to ensure that no future writes are allowed to invalidate prior reads.
The structure, represented by a new ReadSummary proto, includes information like read on key "a" were served up to time 10. Because the timestamp cache is used to prevent transaction replays (see CanCreateTxnRecord), it also includes information like txn 1234 already committed.
This replaced an existing mechanism where the incoming leaseholder would conservatively bump its timestamp cache all the way up to the lease start time / merge freeze time across its entire keyspace. The higher fidelity summary allows the new leaseholder to avoid bumping its timestamp cache as aggressively. This can limit the impact of lease transfers on foreground traffic - reducing false read-write contention which causes transaction retries (e.g. see TestStoreLeaseTransferTimestampCacheRead) and reducing false positives for transaction replay detection which causes transaction aborts (e.g. see TestStoreLeaseTransferTimestampCacheTxnRecord).
However, currently, the ReadSummary structure still only maintains a low-resolution snapshot of the outgoing leaseholder's timestamp cache. So while #60521 introduced the mechanism to ship a timestamp cache summary (and fixed two bugs in the process), it didn't actually begin taking full advantage of this new mechanism.
Proposed Change
Now that the ReadSummary mechanism is in place, we can begin using it to limit transaction retries due to lease transfers and range merges.
First, we'll want to clean up the compatibility code added in #60521, now that we can be sure that all nodes in any cluster that touch our changes here will be aware of the ReadSummary and will be using the per-range closed timestamp system. This will allow us to remove code like this and this.
Once we do that, we'll want to augment the ReadSummary structure with the ability to carry high-resolution information, subject to some memory limits. We'll update Replica.GetCurrentReadSummary to collect this information from the leaseholder's timestamp cache. The details of this structure are TBD, as are the details of the memory limits. We'll likely want to prioritize the local keyspace segment over the global keyspace segment, as txn aborts are more disruptive than txn pushes, because they cannot be refreshed away.
Once we can represent and capture higher-resolution information in a ReadSummary, we'll want to introduce some form of compression of these summaries. This will allow us to achieve the desired scheme of shipping a high-res, high-mem summary on log entries but only persisting a low-res, low-mem summary indefinitely in a range's keyspace:
cockroach/pkg/kv/kvserver/kvserverpb/proposer_kv.proto
Lines 177 to 187 in e09b93f
| // When a ReadSummary is set in a ReplicatedEvalResult, there is always also a | |
| // write to the RangePriorReadSummaryKey in the RaftCommand.WriteBatch. The | |
| // persisted summary may be identical to the summary in this field, but it | |
| // does not have to be. Notably, we intended for the summary included in the | |
| // ReplicatedEvalResult to eventually be a much higher-resolution version of | |
| // the ReadSummmary than the version persisted. This scheme of persisting a | |
| // compressed ReadSummary indefinitely and including a higher-resolution | |
| // ReadSummary on the RaftCommand allows us to optimize for the common case | |
| // where the lease transfer is applied on the new leaseholder through Raft log | |
| // application while ensuring correctness in the case where the lease transfer | |
| // is applied on the new leaseholder through a Raft snapshot. |
With these changes in place, we should see a stark drop in the disruption of lease transfers and range merges to foreground traffic.
Jira issue: CRDB-2701
gz#15057
gz#19312
Epic CRDB-34172