kvserver/closedts: auto-tune closed ts#142721
Conversation
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
bf66903 to
b0b486c
Compare
561eb14 to
e3a84f9
Compare
790e1b9 to
462153d
Compare
arulajmani
left a comment
There was a problem hiding this comment.
Great work here! 🚀
Reminder that we could extend policy_refresher_test.go as well with a mocked out latency map. But let's do that in a followup PR.
Reviewed 4 of 11 files at r1, 15 of 15 files at r2, 11 of 11 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)
-- commits line 16 at r3:
nit: "by the time that all follower replicas have received this closed timestamp, it's greater than the present time, so they can serve reads without redirecting to the leaseholder".
-- commits line 20 at r3:
"we fail to bump closed timestamps" begs a few questions -- why would this happen? maybe we could just drop this line.
-- commits line 36 at r3:
nit: a more succinct way to put this would be "higher lead time leads to high commit wait times, which inturn impacts write latency"
-- commits line 39 at r3:
"followers not fully replicating the changes" sounds off. Should this instead say "a closed timestamp that lags present time, causing follower replicas to redirect..."
-- commits line 48 at r3:
Is this stale? I would imagine this bit would be subsumed by the PolicyRefresher bit in the design.
-- commits line 63 at r3:
"the policy refresher"
pkg/kv/kvserver/closedts/policy_calculation.go line 19 at r2 (raw file):
// network RTT. Returns default max RTT for no latency info, or midpoint RTT for // known latency buckets. Panics on unknown policy. func computeNetworkRTTBasedOnPolicy(policy ctpb.RangeClosedTimestampPolicy) time.Duration {
Should we add a test for this?
pkg/kv/kvserver/closedts/policy_calculation.go line 17 at r3 (raw file):
// FindBucketBasedOnNetworkRTT maps a network RTT to a closed timestamp policy bucket. // It divides RTT by policy interval, adds 1 for zero-based indexing, and offsets by
nit: the details about the computation don't need to live on the function level comment; instead, they're better placed closed to the logic.
pkg/kv/kvserver/replica.go line 1342 at r3 (raw file):
// RefreshPolicy updates the replica's cached closed timestamp policy based on // zone configurations and provided node latencies.
"span configurations" at this level.
pkg/kv/kvserver/replica.go line 1346 at r3 (raw file):
// For ranges serving global reads, // 1. If the provided map is nil, the policy will be // LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO. The latency will be hardcoded to
nit: at this layer, because we're just refreshing the policy, I think this commentary will be clearer if we didn't add words about what latency is inferred from the policy.
If you want to add it, maybe move that bit of the commentary inside the function where we're assigning policies instead
pkg/kv/kvserver/replica.go line 1376 at r3 (raw file):
} } // FindBucketBasedOnNetworkRTT returns LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO
This commentary is misplaced -- should we move it to FindBucketBasedOnNetworkRTT instead?
pkg/kv/kvserver/closedts/policy.go line 17 at r2 (raw file):
const ( defaultMaxNetworkRTT = 150 * time.Millisecond closedTimestampPolicyLatencyInterval = 20 * time.Millisecond
How do you feel about renaming this to "closedTimestampPolicyBucketWidth"?
pkg/kv/kvserver/replica_closedts.go line 25 at r3 (raw file):
// timestamp. Falls back to LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO if the // range's policy is not found in targetByPolicy. This fallback handles race // conditions where the policy refresher observes cluster version upgrades
Could you say more about this cluster version upgrades race?
pkg/kv/kvserver/replica_closedts.go line 35 at r3 (raw file):
return policy, target } if policy >= ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_20MS &&
It may be worth a comment explaining how this happens.
pkg/kv/kvserver/closedts/policy_calculation_test.go line 20 at r3 (raw file):
// TestNetworkRTTAndPolicyCalculations tests the conversion between network // RTT and closed timestamp policy. func TestNetworkRTTAndPolicyCalculations(t *testing.T) {
Nice test!
pkg/kv/kvserver/closedts/policy_calculation_test.go line 61 at r3 (raw file):
}, { name: "low-end bucket boundary",
Related to my comment on r1, let's add test cases for mid/low/high for each of the buckets. Mostly to serve as regression tests.
pkg/kv/kvserver/closedts/sidetransport/sender.go line 310 at r3 (raw file):
} // If the cluster is not fully upgraded, only look at closed timestamps policies
nit: consider moving this commentary closer to the logic below.
For here, maybe something like: // maxClosedTimestampPolicy returns the distinct number of closed timestamp policies for which the side transport should send updates.
pkg/kv/kvserver/closedts/sidetransport/sender.go line 317 at r3 (raw file):
// kv.closed_timestamp.policy_refresh_interval is 0. func (s *Sender) maxClosedTimestampPolicy() int { // Even if no auto-tuning is disabled, still send them. No ranges would use
nit: if you take my comment above, this could be just deleted, as it's subsumed by the commentary above.
pkg/kv/kvserver/closedts/sidetransport/sender.go line 441 at r3 (raw file):
// LAI. needExplicit = true } else if lastMsg.policy != closeRes.Policy {
Around version upgrades, is it possible that the policy has been refreshed but by the range but we're not sending an update for latency buckets (presumably because maxClosedTimestampPolicy above was called when the cluster version had not been bumped)?
At the least, it would be worth trying to add a test that reproduces this race to see how things behave.
pkg/kv/kvserver/closedts/ctpb/service.proto line 42 at r2 (raw file):
// hardcoded network latency 150ms by default. LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO = 1; // Lead policy for ranges with global reads policy. Following policies are
nit: "The following policies"
pkg/kv/kvserver/closedts/sidetransport/sender_test.go line 392 at r3 (raw file):
// TestSenderWithLatencyTracker verifies that the sender correctly updates // closed timestamp policies based on network latency between nodes. func TestSenderWithLatencyTracker(t *testing.T) {
Nice test! 💯
pkg/ccl/multiregionccl/multiregion_test.go line 164 at r3 (raw file):
// This test verifies closed timestamp policy behavior with and without the // policy refresher. With the refresher enabled, we expect latency-based
Do you mean autotuning? The policy refresher should be running regardless, right?
wenyihu6
left a comment
There was a problem hiding this comment.
Sounds good, I added a new item in #143888 to track this. Thanks for the nice rewording as always!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)
Previously, arulajmani (Arul Ajmani) wrote…
nit: "by the time that all follower replicas have received this closed timestamp, it's greater than the present time, so they can serve reads without redirecting to the leaseholder".
Done.
Previously, arulajmani (Arul Ajmani) wrote…
"we fail to bump closed timestamps" begs a few questions -- why would this happen? maybe we could just drop this line.
Removed for clarity. This may happen due to any of the reasons here - https://github.com/cockroachdb/cockroach/blob/872849938a5ab0c96d39bdd9274ea4897b2947e2/pkg/kv/kvserver/closedts/sidetransport/sender.go#L188-L198. In those cases, those ranges would be removed from the side transport ranges map.
Previously, arulajmani (Arul Ajmani) wrote…
nit: a more succinct way to put this would be "higher lead time leads to high commit wait times, which inturn impacts write latency"
Done.
Previously, arulajmani (Arul Ajmani) wrote…
"followers not fully replicating the changes" sounds off. Should this instead say "a closed timestamp that lags present time, causing follower replicas to redirect..."
Done.
Previously, arulajmani (Arul Ajmani) wrote…
Is this stale? I would imagine this bit would be subsumed by the
PolicyRefresherbit in the design.
Yes, corrected now.
Previously, arulajmani (Arul Ajmani) wrote…
"the policy refresher"
Done.
pkg/kv/kvserver/replica.go line 1342 at r3 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
"span configurations" at this level.
Done.
pkg/kv/kvserver/replica.go line 1346 at r3 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: at this layer, because we're just refreshing the policy, I think this commentary will be clearer if we didn't add words about what latency is inferred from the policy.
If you want to add it, maybe move that bit of the commentary inside the function where we're assigning policies instead
Moved the comment closer to the logic below.
pkg/kv/kvserver/replica.go line 1376 at r3 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
This commentary is misplaced -- should we move it to
FindBucketBasedOnNetworkRTTinstead?
Moved to FindBucketBasedOnNetworkRTT instead.
pkg/kv/kvserver/replica_closedts.go line 25 at r3 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Could you say more about this cluster version upgrades race?
I added more details to the race condition I was worried about. I think it is exactly what you described here - https://reviewable.io/reviews/cockroachdb/cockroach/142721#-ONKsU7WCKoy8IrrX8hq.
pkg/kv/kvserver/replica_closedts.go line 35 at r3 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
It may be worth a comment explaining how this happens.
Done.
pkg/kv/kvserver/closedts/policy.go line 17 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
How do you feel about renaming this to "closedTimestampPolicyBucketWidth"?
I like it, done.
pkg/kv/kvserver/closedts/policy_calculation.go line 19 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Should we add a test for this?
I initially added tests for them in the second commit. I've now moved the computeNetworkRTTBasedOnPolicy test to the first commit and rebased.
pkg/kv/kvserver/closedts/policy_calculation.go line 17 at r3 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: the details about the computation don't need to live on the function level comment; instead, they're better placed closed to the logic.
Done.
pkg/kv/kvserver/closedts/ctpb/service.proto line 42 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: "The following policies"
Done.
pkg/kv/kvserver/closedts/sidetransport/sender.go line 310 at r3 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: consider moving this commentary closer to the logic below.
For here, maybe something like: // maxClosedTimestampPolicy returns the distinct number of closed timestamp policies for which the side transport should send updates.
Done.
pkg/kv/kvserver/closedts/sidetransport/sender.go line 317 at r3 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: if you take my comment above, this could be just deleted, as it's subsumed by the commentary above.
Removed.
pkg/kv/kvserver/closedts/sidetransport/sender.go line 441 at r3 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Around version upgrades, is it possible that the policy has been refreshed but by the range but we're not sending an update for latency buckets (presumably because
maxClosedTimestampPolicyabove was called when the cluster version had not been bumped)?At the least, it would be worth trying to add a test that reproduces this race to see how things behave.
Sg, added another item for testing in #143888. I think it is possible, and this is what I was trying to guard against in
cockroach/pkg/kv/kvserver/replica_closedts.go
Lines 32 to 49 in e7f411e
pkg/ccl/multiregionccl/multiregion_test.go line 164 at r3 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Do you mean autotuning? The policy refresher should be running regardless, right?
Done.
pkg/kv/kvserver/closedts/policy_calculation_test.go line 61 at r3 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Related to my comment on r1, let's add test cases for mid/low/high for each of the buckets. Mostly to serve as regression tests.
Done.
pkg/kv/kvserver/closedts/sidetransport/sender_test.go line 392 at r3 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Nice test! 💯
Thanks 😃
arulajmani
left a comment
There was a problem hiding this comment.
modulo some nits and one comment about how we want to handle the case where latency information isn't available for a peer. We should add a test for that case in this patch before merging, but this is very close.
Reviewed 2 of 13 files at r5, 1 of 11 files at r6, 10 of 10 files at r9, 10 of 10 files at r10, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @wenyihu6)
pkg/kv/kvserver/closedts/policy_calculation.go line 19 at r2 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
I initially added tests for them in the second commit. I've now moved the
computeNetworkRTTBasedOnPolicytest to the first commit and rebased.
Great, thanks!
pkg/kv/kvserver/closedts/policy_calculation.go line 17 at r3 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
Done.
nit: Consider moving everything after the first sentence inline. So this would look something like:
// computeNetworkRTTBasedOnPolicy returns the network RTT based on the closed
// timestamp policy.
func computeNetworkRTTBasedOnPolicy(policy ctpb.RangeClosedTimestampPolicy) time.Duration {
switch {
case policy == ctpb.LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO:
// If no latency info is available, return the default max RTT.
return defaultMaxNetworkRTT
case policy >= ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_20MS &&
policy <= ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_EQUAL_OR_GREATER_THAN_300MS:
// For known latency buckets, we return the midpoint RTT for the bucket.
// The midpointRTT for bucket N is (N+0.5)*interval.
bucket := int(policy) - int(ctpb.LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO) - 1
return time.Duration((float64(bucket) + 0.5) * float64(closedTimestampPolicyBucketWidth.Nanoseconds()))
default:
panic(fmt.Sprintf("unknown closed timestamp policy: %s", policy))
}
}The idea being that a casual reader doesn't need to know about these intricate details.
-- commits line 48 at r10:
"adds newly added" seems redundant. Should we just say "adds"?
pkg/kv/kvserver/closedts/policy_calculation.go line 19 at r10 (raw file):
// bucket. func FindBucketBasedOnNetworkRTT(networkRTT time.Duration) ctpb.RangeClosedTimestampPolicy { // If maxLatency is negative (i.e. no peer latency is provided), It returns
nit: s/It returns/return/g
pkg/kv/kvserver/closedts/policy_calculation.go line 27 at r10 (raw file):
return ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_EQUAL_OR_GREATER_THAN_300MS } // It divides RTT by policy interval, adds 1 for zero-based indexing, and
nit: let's reword this comment as well, similar to other suggestions, now that we've inlined it, so as to not use "It" at the start.
pkg/kv/kvserver/replica_closedts.go line 32 at r10 (raw file):
return policy, target } // If the range's policy is not found in targetByPolicy, it falls back to
nit: s/it falls back/fall back/g
pkg/kv/kvserver/replica_closedts.go line 37 at r10 (raw file):
// This fallback handles race conditions between the policy refresher and side // transport during cluster version upgrades. The race can occur in this // sequence: 1. Side transport prepares a policy map before cluster upgrade is
nit: should we have these bullets be on separate lines?
pkg/kv/kvserver/replica_closedts.go line 44 at r10 (raw file):
// // In this case, we temporarily fall back to the no-latency policy until the // side transport catches up with the new cluster version.
Would you mind adding a TODO to remove this once version compatibility with 25.1 is no longer a concern? At that point, maybe we could refactor some of this because we could expand the assertion below to panic if ok == false in all cases.
pkg/kv/kvserver/replica.go line 1364 at r10 (raw file):
} // For ranges serving global reads, it determines the maximum latency
nit: "it determines" sounds off when inlined. Should we just say "For ranges serving global reads, determine the maximum leaseholder-to-peer replica using the provided map and set an appropriate policy bucket. This then controls how far in the future timestamps will be closed for the range."
pkg/kv/kvserver/replica.go line 1371 at r10 (raw file):
maxLatency := time.Duration(-1) for _, peer := range r.shMu.state.Desc.InternalReplicas { if peerLatency, ok := latencies[peer.NodeID]; ok {
If a peer isn't in the map, should we be instead using defaultMaxNetworkRTT instead of skipping it? Separately, should we log (maybe a log.Every) if that's the case?
wenyihu6
left a comment
There was a problem hiding this comment.
Addressed your comment and added a new test - TestRefreshPolicyWithVariousLatencies.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani)
Previously, arulajmani (Arul Ajmani) wrote…
"adds newly added" seems redundant. Should we just say "adds"?
Done.
pkg/kv/kvserver/replica.go line 1364 at r10 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: "it determines" sounds off when inlined. Should we just say "For ranges serving global reads, determine the maximum leaseholder-to-peer replica using the provided map and set an appropriate policy bucket. This then controls how far in the future timestamps will be closed for the range."
Done.
pkg/kv/kvserver/replica.go line 1371 at r10 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
If a peer isn't in the map, should we be instead using
defaultMaxNetworkRTTinstead of skipping it? Separately, should we log (maybe a log.Every) if that's the case?
Using defaultMaxNetworkRTT for peer replica latency if it is missing now. I made a note about the logs in #143890 which I plan to address as we add more metrics for this work.
pkg/kv/kvserver/replica_closedts.go line 32 at r10 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: s/it falls back/fall back/g
Done.
pkg/kv/kvserver/replica_closedts.go line 37 at r10 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: should we have these bullets be on separate lines?
Done.
pkg/kv/kvserver/replica_closedts.go line 44 at r10 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Would you mind adding a TODO to remove this once version compatibility with 25.1 is no longer a concern? At that point, maybe we could refactor some of this because we could expand the assertion below to panic if ok == false in all cases.
Done.
pkg/kv/kvserver/closedts/policy_calculation.go line 17 at r3 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: Consider moving everything after the first sentence inline. So this would look something like:
// computeNetworkRTTBasedOnPolicy returns the network RTT based on the closed // timestamp policy. func computeNetworkRTTBasedOnPolicy(policy ctpb.RangeClosedTimestampPolicy) time.Duration { switch { case policy == ctpb.LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO: // If no latency info is available, return the default max RTT. return defaultMaxNetworkRTT case policy >= ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_20MS && policy <= ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_EQUAL_OR_GREATER_THAN_300MS: // For known latency buckets, we return the midpoint RTT for the bucket. // The midpointRTT for bucket N is (N+0.5)*interval. bucket := int(policy) - int(ctpb.LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO) - 1 return time.Duration((float64(bucket) + 0.5) * float64(closedTimestampPolicyBucketWidth.Nanoseconds())) default: panic(fmt.Sprintf("unknown closed timestamp policy: %s", policy)) } }The idea being that a casual reader doesn't need to know about these intricate details.
Done.
pkg/kv/kvserver/closedts/policy_calculation.go line 19 at r10 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: s/It returns/return/g
Done.
pkg/kv/kvserver/closedts/policy_calculation.go line 27 at r10 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: let's reword this comment as well, similar to other suggestions, now that we've inlined it, so as to not use "It" at the start.
Done.
arulajmani
left a comment
There was a problem hiding this comment.
Let's ship it
Reviewed 15 of 15 files at r11, 1 of 1 files at r12, 6 of 6 files at r13, 12 of 12 files at r14, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @wenyihu6)
pkg/kv/kvserver/replica.go line 1371 at r10 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
Using
defaultMaxNetworkRTTfor peer replica latency if it is missing now. I made a note about the logs in #143890 which I plan to address as we add more metrics for this work.
Adding the logs later sounds good!
pkg/kv/kvserver/replica_closedts_test.go line 1157 at r14 (raw file):
latencies: map[roachpb.NodeID]time.Duration{ tc.Target(0).NodeID: 10 * time.Millisecond, // tc.Target(1,2).NodeID is missing, simulating nil.
Im not sure what you mean by "simulating nil". Would it be clearer if you just said // tc.Target(1, 2).NodeID are missing.
Here, and elsehwere.
wenyihu6
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani)
pkg/kv/kvserver/replica_closedts_test.go line 1157 at r14 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Im not sure what you mean by "simulating nil". Would it be clearer if you just said // tc.Target(1, 2).NodeID are missing.
Here, and elsehwere.
Removed.
This patch adds latency based closed timestamp policies to ctpb.LatencyBasedRangeClosedTimestampPolicy. Part of: cockroachdb#59680 Release note: none
For global table ranges, we aim to ensure follower replicas can serve present-time reads by calculating a target closed timestamp that leads the current time (using the `LEAD_FOR_GLOBAL_READS` policy). The goal of this estimation is to be confident that by the time that all follower have received this closed timestamp, it's greater than the present time, so they can serve reads without redirecting to the leaseholder. Note that this is an estimated goal, not the exact lead time. It is still possible that some follower nodes are really behind. Calculation: ``` raft_propagation_time = max_network_rtt*1.5 + raft_overhead(20ms) side_propagation_time = max_network_rtt*0.5 + side_transport_close_interval(200ms by default) closed_ts_propagation = max(raft_propagation_time,side_propagation_time) closed_ts_at_sender = now(sender’s current clock) + maxClockOffset(500ms by default) + buffer (25ms) + closed_ts_propagation ``` Currently, `max_network_rtt` is hardcoded at 150ms, leading to a default 800ms lead. Problems with the current implementation: Tuning trade-offs: A higher lead timestamp leads to high commit wait times, which inturn impacts write latency. A lower lead timestamp may result in a closed timestamp that lags present time, causing follower replicas to redirect, which inturn impacts read latency. Cluster wide setting: kv.closed_timestamp.lead_for_global_reads_override only applies to the entire cluster, not per range. For geographically distant replicas, a short lead time may not be enough for full application. However, increasing the lead time cluster wide can harm write performance for replicas that are close together and don’t require as much time to replicate. This commit adds latency based policies to the side transport senders. For side transport senders, nodes create a map of closed timestamps for each policy without consulting leaseholders. Senders pass this map to leaseholders, who decide whether to "bump" their closed timestamp and which policy to use. Senders then send this information to receivers to opt in ranges and pick the closed timestamps based on the policies ranges opt in. Currently, side transport senders create only two policies, offering limited control over closed timestamps based on range specific latencies. The new design introduces 16 additional policies, each representing a different network latency bucket ranges fall under. Instead of relying on a hardcoded 150ms latency for closed timestamp calculations, ranges will now use the appropriate bucket based on its observed network conditions. Periodically, the policy refresher iterates over all leaseholders on a node, updating cached closed timestamp policies based on the observed latency between the leaseholder replica and the furthest replica. This policy is then used by side transport and by raft side closed timestamp computation. Instead of using the hardcoded 150ms network roundtrip latency, a more dynamic latency would be used based on the latency based closed timestamp policy the range holds. kv.closed_timestamp.lead_for_global_reads_auto_tune.enabled can now be used to auto-tune the lead time that global table ranges use to publish close timestamps. The kv.closed_timestamp.lead_for_global_reads_override cluster setting takes precedence over this one. If auto-tuning is disabled or no data is observed, the system falls back to a hardcoded computed lead time. Resolves: cockroachdb#59680 Release note: none
|
Rebased on master to pick up the latest changes this PR is stacked on top of. |
|
Extended CI failure looks like an infra flake - "Interrupted". Kicked off another run for it. Thanks for the series of reviews 🥳! bors r=arulajmani |
|
Build succeeded: |
kvserver/closedts: add latency based closed timestamp policies
This patch adds latency based closed timestamp policies to
ctpb.LatencyBasedRangeClosedTimestampPolicy.
Part of: #59680
Release note: none
kvserver/closedts: auto-tune closed ts
For global table ranges, we aim to ensure follower replicas can serve
present-time reads by calculating a target closed timestamp that leads the
current time (using the
LEAD_FOR_GLOBAL_READSpolicy). The goal of thisestimation is to be confident that by the time that all follower have received
this closed timestamp, it's greater than the present time, so they can serve
reads without redirecting to the leaseholder. Note that this is an estimated
goal, not the exact lead time. It is still possible that some follower nodes are
really behind.
Calculation:
Currently,
max_network_rttis hardcoded at 150ms, leading to a default 800mslead.
Problems with the current implementation:
Tuning trade-offs: A higher lead timestamp leads to high commit wait times,
which inturn impacts write latency. A lower lead timestamp may result in a
closed timestamp that lags present time, causing follower replicas to redirect,
which inturn impacts read latency.
Cluster wide setting: kv.closed_timestamp.lead_for_global_reads_override only
applies to the entire cluster, not per range. For geographically distant
replicas, a short lead time may not be enough for full application. However,
increasing the lead time cluster wide can harm write performance for replicas
that are close together and don’t require as much time to replicate.
This commit adds newly added latency based policies to the side transport
senders.
For side transport senders, nodes create a map of closed timestamps for each
policy without consulting leaseholders. Senders pass this map to leaseholders,
who decide whether to "bump" their closed timestamp and which policy to use.
Senders then send this information to receivers to opt in ranges and pick the
closed timestamps based on the policies ranges opt in.
Currently, side transport senders create only two policies, offering limited
control over closed timestamps based on range specific latencies. The new design
introduces 16 additional policies, each representing a different network latency
bucket ranges fall under. Instead of relying on a hardcoded 150ms latency for
closed timestamp calculations, ranges will now use the appropriate bucket
based on its observed network conditions.
Periodically, the policy refresher iterates over all leaseholders on a node,
updating cached closed timestamp policies based on the observed latency between
the leaseholder replica and the furthest replica. This policy is then used by
side transport and by raft side closed timestamp computation.
Instead of using the hardcoded 150ms network roundtrip latency, a more
dynamic latency would be used based on the latency based closed
timestamp policy the range holds.
kv.closed_timestamp.lead_for_global_reads_auto_tune.enabled can now be used
to auto-tune the lead time that global table ranges use to publish
close timestamps. The kv.closed_timestamp.lead_for_global_reads_override
cluster setting takes precedence over this one. If auto-tuning is disabled or
no data is observed, the system falls back to a hardcoded computed lead time.
Resolves: #59680
Release note: none