kvserver/closedts: auto-tune closed ts by wenyihu6 · Pull Request #142721 · cockroachdb/cockroach

wenyihu6 · 2025-03-12T09:20:54Z

kvserver/closedts: add latency based closed timestamp policies

This patch adds latency based closed timestamp policies to
ctpb.LatencyBasedRangeClosedTimestampPolicy.

Part of: #59680
Release note: none

kvserver/closedts: auto-tune closed ts

For global table ranges, we aim to ensure follower replicas can serve
present-time reads by calculating a target closed timestamp that leads the
current time (using the LEAD_FOR_GLOBAL_READS policy). The goal of this
estimation is to be confident that by the time that all follower have received
this closed timestamp, it's greater than the present time, so they can serve
reads without redirecting to the leaseholder. Note that this is an estimated
goal, not the exact lead time. It is still possible that some follower nodes are
really behind.

Calculation:

raft_propagation_time = max_network_rtt*1.5 + raft_overhead(20ms)
side_propagation_time = max_network_rtt*0.5 + side_transport_close_interval(200ms by default)
closed_ts_propagation = max(raft_propagation_time,side_propagation_time)
closed_ts_at_sender = now(sender’s current clock) + maxClockOffset(500ms by
default) + buffer (25ms) + closed_ts_propagation

Currently, max_network_rtt is hardcoded at 150ms, leading to a default 800ms
lead.

Problems with the current implementation:

Tuning trade-offs: A higher lead timestamp leads to high commit wait times,
which inturn impacts write latency. A lower lead timestamp may result in a
closed timestamp that lags present time, causing follower replicas to redirect,
which inturn impacts read latency.

Cluster wide setting: kv.closed_timestamp.lead_for_global_reads_override only
applies to the entire cluster, not per range. For geographically distant
replicas, a short lead time may not be enough for full application. However,
increasing the lead time cluster wide can harm write performance for replicas
that are close together and don’t require as much time to replicate.

This commit adds newly added latency based policies to the side transport
senders.

For side transport senders, nodes create a map of closed timestamps for each
policy without consulting leaseholders. Senders pass this map to leaseholders,
who decide whether to "bump" their closed timestamp and which policy to use.
Senders then send this information to receivers to opt in ranges and pick the
closed timestamps based on the policies ranges opt in.

Currently, side transport senders create only two policies, offering limited
control over closed timestamps based on range specific latencies. The new design
introduces 16 additional policies, each representing a different network latency
bucket ranges fall under. Instead of relying on a hardcoded 150ms latency for
closed timestamp calculations, ranges will now use the appropriate bucket
based on its observed network conditions.

Periodically, the policy refresher iterates over all leaseholders on a node,
updating cached closed timestamp policies based on the observed latency between
the leaseholder replica and the furthest replica. This policy is then used by
side transport and by raft side closed timestamp computation.

Instead of using the hardcoded 150ms network roundtrip latency, a more
dynamic latency would be used based on the latency based closed
timestamp policy the range holds.

kv.closed_timestamp.lead_for_global_reads_auto_tune.enabled can now be used
to auto-tune the lead time that global table ranges use to publish
close timestamps. The kv.closed_timestamp.lead_for_global_reads_override
cluster setting takes precedence over this one. If auto-tuning is disabled or
no data is observed, the system falls back to a hardcoded computed lead time.

Resolves: #59680
Release note: none

blathers-crl · 2025-03-12T09:20:57Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2025-03-12T09:21:22Z

This change is

arulajmani

Great work here! 🚀

Reminder that we could extend policy_refresher_test.go as well with a mocked out latency map. But let's do that in a followup PR.

Reviewed 4 of 11 files at r1, 15 of 15 files at r2, 11 of 11 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)

-- commits line 16 at r3:
nit: "by the time that all follower replicas have received this closed timestamp, it's greater than the present time, so they can serve reads without redirecting to the leaseholder".

-- commits line 20 at r3:
"we fail to bump closed timestamps" begs a few questions -- why would this happen? maybe we could just drop this line.

-- commits line 36 at r3:
nit: a more succinct way to put this would be "higher lead time leads to high commit wait times, which inturn impacts write latency"

-- commits line 39 at r3:
"followers not fully replicating the changes" sounds off. Should this instead say "a closed timestamp that lags present time, causing follower replicas to redirect..."

-- commits line 48 at r3:
Is this stale? I would imagine this bit would be subsumed by the PolicyRefresher bit in the design.

-- commits line 63 at r3:
"the policy refresher"

pkg/kv/kvserver/closedts/policy_calculation.go line 19 at r2 (raw file):

// network RTT. Returns default max RTT for no latency info, or midpoint RTT for
// known latency buckets. Panics on unknown policy.
func computeNetworkRTTBasedOnPolicy(policy ctpb.RangeClosedTimestampPolicy) time.Duration {

Should we add a test for this?

pkg/kv/kvserver/closedts/policy_calculation.go line 17 at r3 (raw file):

// FindBucketBasedOnNetworkRTT maps a network RTT to a closed timestamp policy bucket.
// It divides RTT by policy interval, adds 1 for zero-based indexing, and offsets by

nit: the details about the computation don't need to live on the function level comment; instead, they're better placed closed to the logic.

pkg/kv/kvserver/replica.go line 1342 at r3 (raw file):

// RefreshPolicy updates the replica's cached closed timestamp policy based on
// zone configurations and provided node latencies.

"span configurations" at this level.

pkg/kv/kvserver/replica.go line 1346 at r3 (raw file):

// For ranges serving global reads,
// 1. If the provided map is nil, the policy will be
// LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO. The latency will be hardcoded to

nit: at this layer, because we're just refreshing the policy, I think this commentary will be clearer if we didn't add words about what latency is inferred from the policy.

If you want to add it, maybe move that bit of the commentary inside the function where we're assigning policies instead

pkg/kv/kvserver/replica.go line 1376 at r3 (raw file):

			}
		}
		// FindBucketBasedOnNetworkRTT returns LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO

This commentary is misplaced -- should we move it to FindBucketBasedOnNetworkRTT instead?

pkg/kv/kvserver/closedts/policy.go line 17 at r2 (raw file):

const (
	defaultMaxNetworkRTT                 = 150 * time.Millisecond
	closedTimestampPolicyLatencyInterval = 20 * time.Millisecond

How do you feel about renaming this to "closedTimestampPolicyBucketWidth"?

pkg/kv/kvserver/replica_closedts.go line 25 at r3 (raw file):

// timestamp. Falls back to LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO if the
// range's policy is not found in targetByPolicy. This fallback handles race
// conditions where the policy refresher observes cluster version upgrades

Could you say more about this cluster version upgrades race?

pkg/kv/kvserver/replica_closedts.go line 35 at r3 (raw file):

		return policy, target
	}
	if policy >= ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_20MS &&

It may be worth a comment explaining how this happens.

pkg/kv/kvserver/closedts/policy_calculation_test.go line 20 at r3 (raw file):

// TestNetworkRTTAndPolicyCalculations tests the conversion between network
// RTT and closed timestamp policy.
func TestNetworkRTTAndPolicyCalculations(t *testing.T) {

Nice test!

pkg/kv/kvserver/closedts/policy_calculation_test.go line 61 at r3 (raw file):

		},
		{
			name:           "low-end bucket boundary",

Related to my comment on r1, let's add test cases for mid/low/high for each of the buckets. Mostly to serve as regression tests.

pkg/kv/kvserver/closedts/sidetransport/sender.go line 310 at r3 (raw file):

}

// If the cluster is not fully upgraded, only look at closed timestamps policies

nit: consider moving this commentary closer to the logic below.

For here, maybe something like: // maxClosedTimestampPolicy returns the distinct number of closed timestamp policies for which the side transport should send updates.

pkg/kv/kvserver/closedts/sidetransport/sender.go line 317 at r3 (raw file):

// kv.closed_timestamp.policy_refresh_interval is 0.
func (s *Sender) maxClosedTimestampPolicy() int {
	// Even if no auto-tuning is disabled, still send them. No ranges would use

nit: if you take my comment above, this could be just deleted, as it's subsumed by the commentary above.

pkg/kv/kvserver/closedts/sidetransport/sender.go line 441 at r3 (raw file):

			// LAI.
			needExplicit = true
		} else if lastMsg.policy != closeRes.Policy {

Around version upgrades, is it possible that the policy has been refreshed but by the range but we're not sending an update for latency buckets (presumably because maxClosedTimestampPolicy above was called when the cluster version had not been bumped)?

At the least, it would be worth trying to add a test that reproduces this race to see how things behave.

pkg/kv/kvserver/closedts/ctpb/service.proto line 42 at r2 (raw file):

  // hardcoded network latency 150ms by default.
  LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO = 1;
  // Lead policy for ranges with global reads policy. Following policies are

nit: "The following policies"

pkg/kv/kvserver/closedts/sidetransport/sender_test.go line 392 at r3 (raw file):

// TestSenderWithLatencyTracker verifies that the sender correctly updates
// closed timestamp policies based on network latency between nodes.
func TestSenderWithLatencyTracker(t *testing.T) {

Nice test! 💯

pkg/ccl/multiregionccl/multiregion_test.go line 164 at r3 (raw file):

// This test verifies closed timestamp policy behavior with and without the
// policy refresher. With the refresher enabled, we expect latency-based

Do you mean autotuning? The policy refresher should be running regardless, right?

wenyihu6

Sounds good, I added a new item in #143888 to track this. Thanks for the nice rewording as always!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)

-- commits line 16 at r3:

Previously, arulajmani (Arul Ajmani) wrote…

nit: "by the time that all follower replicas have received this closed timestamp, it's greater than the present time, so they can serve reads without redirecting to the leaseholder".

Done.

-- commits line 20 at r3:

Previously, arulajmani (Arul Ajmani) wrote…

"we fail to bump closed timestamps" begs a few questions -- why would this happen? maybe we could just drop this line.

Removed for clarity. This may happen due to any of the reasons here - https://github.com/cockroachdb/cockroach/blob/872849938a5ab0c96d39bdd9274ea4897b2947e2/pkg/kv/kvserver/closedts/sidetransport/sender.go#L188-L198. In those cases, those ranges would be removed from the side transport ranges map.

-- commits line 36 at r3:

Previously, arulajmani (Arul Ajmani) wrote…

nit: a more succinct way to put this would be "higher lead time leads to high commit wait times, which inturn impacts write latency"

Done.

-- commits line 39 at r3:

Previously, arulajmani (Arul Ajmani) wrote…

"followers not fully replicating the changes" sounds off. Should this instead say "a closed timestamp that lags present time, causing follower replicas to redirect..."

Done.

-- commits line 48 at r3:

Previously, arulajmani (Arul Ajmani) wrote…

Is this stale? I would imagine this bit would be subsumed by the PolicyRefresher bit in the design.

Yes, corrected now.

-- commits line 63 at r3:

Previously, arulajmani (Arul Ajmani) wrote…

"the policy refresher"

Done.

pkg/kv/kvserver/replica.go line 1342 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

"span configurations" at this level.

Done.

pkg/kv/kvserver/replica.go line 1346 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: at this layer, because we're just refreshing the policy, I think this commentary will be clearer if we didn't add words about what latency is inferred from the policy.

If you want to add it, maybe move that bit of the commentary inside the function where we're assigning policies instead

Moved the comment closer to the logic below.

pkg/kv/kvserver/replica.go line 1376 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

This commentary is misplaced -- should we move it to FindBucketBasedOnNetworkRTT instead?

Moved to FindBucketBasedOnNetworkRTT instead.

pkg/kv/kvserver/replica_closedts.go line 25 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Could you say more about this cluster version upgrades race?

I added more details to the race condition I was worried about. I think it is exactly what you described here - https://reviewable.io/reviews/cockroachdb/cockroach/142721#-ONKsU7WCKoy8IrrX8hq.

pkg/kv/kvserver/replica_closedts.go line 35 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

It may be worth a comment explaining how this happens.

Done.

pkg/kv/kvserver/closedts/policy.go line 17 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

How do you feel about renaming this to "closedTimestampPolicyBucketWidth"?

I like it, done.

pkg/kv/kvserver/closedts/policy_calculation.go line 19 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should we add a test for this?

I initially added tests for them in the second commit. I've now moved the computeNetworkRTTBasedOnPolicy test to the first commit and rebased.

pkg/kv/kvserver/closedts/policy_calculation.go line 17 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: the details about the computation don't need to live on the function level comment; instead, they're better placed closed to the logic.

Done.

pkg/kv/kvserver/closedts/ctpb/service.proto line 42 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: "The following policies"

Done.

pkg/kv/kvserver/closedts/sidetransport/sender.go line 310 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: consider moving this commentary closer to the logic below.

For here, maybe something like: // maxClosedTimestampPolicy returns the distinct number of closed timestamp policies for which the side transport should send updates.

Done.

pkg/kv/kvserver/closedts/sidetransport/sender.go line 317 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: if you take my comment above, this could be just deleted, as it's subsumed by the commentary above.

Removed.

pkg/kv/kvserver/closedts/sidetransport/sender.go line 441 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Around version upgrades, is it possible that the policy has been refreshed but by the range but we're not sending an update for latency buckets (presumably because maxClosedTimestampPolicy above was called when the cluster version had not been bumped)?

At the least, it would be worth trying to add a test that reproduces this race to see how things behave.

Sg, added another item for testing in #143888. I think it is possible, and this is what I was trying to guard against in

cockroach/pkg/kv/kvserver/replica_closedts.go

Lines 32 to 49 in e7f411e

    
           // If the range's policy is not found in targetByPolicy, it falls back to 
        
           // LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO. 
        
           // 
        
           // This fallback handles race conditions between the policy refresher and side 
        
           // transport during cluster version upgrades. The race can occur in this 
        
           // sequence: 1. Side transport prepares a policy map before cluster upgrade is 
        
           // complete. 2. Cluster upgrade completes. 3. Policy refresher sees the 
        
           // upgrade and quickly updates replica policies to use latency-based policies. 
        
           // 4. Replica tries to use a latency-based policy but the policy map from step 
        
           // 1 doesn't include it yet. 
        
           // 
        
           // In this case, we temporarily fall back to the no-latency policy until the 
        
           // side transport catches up with the new cluster version. 
        
           if policy >= ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_20MS && 
        
           	policy <= ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_EQUAL_OR_GREATER_THAN_300MS { 
        
           	return ctpb.LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO, 
        
           		targetByPolicy[ctpb.LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO] 
        
           }

.

pkg/ccl/multiregionccl/multiregion_test.go line 164 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Do you mean autotuning? The policy refresher should be running regardless, right?

Done.

pkg/kv/kvserver/closedts/policy_calculation_test.go line 61 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Related to my comment on r1, let's add test cases for mid/low/high for each of the buckets. Mostly to serve as regression tests.

Done.

pkg/kv/kvserver/closedts/sidetransport/sender_test.go line 392 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Nice test! 💯

Thanks 😃

arulajmani

modulo some nits and one comment about how we want to handle the case where latency information isn't available for a peer. We should add a test for that case in this patch before merging, but this is very close.

Reviewed 2 of 13 files at r5, 1 of 11 files at r6, 10 of 10 files at r9, 10 of 10 files at r10, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @wenyihu6)

pkg/kv/kvserver/closedts/policy_calculation.go line 19 at r2 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

I initially added tests for them in the second commit. I've now moved the computeNetworkRTTBasedOnPolicy test to the first commit and rebased.

Great, thanks!

pkg/kv/kvserver/closedts/policy_calculation.go line 17 at r3 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Done.

nit: Consider moving everything after the first sentence inline. So this would look something like:

// computeNetworkRTTBasedOnPolicy returns the network RTT based on the closed
// timestamp policy.
func computeNetworkRTTBasedOnPolicy(policy ctpb.RangeClosedTimestampPolicy) time.Duration {
	switch {
	case policy == ctpb.LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO:
		// If no latency info is available, return the default max RTT.
		return defaultMaxNetworkRTT
	case policy >= ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_20MS &&
		policy <= ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_EQUAL_OR_GREATER_THAN_300MS:
		// For known latency buckets, we return the midpoint RTT for the bucket.
		// The midpointRTT for bucket N is (N+0.5)*interval.
		bucket := int(policy) - int(ctpb.LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO) - 1
		return time.Duration((float64(bucket) + 0.5) * float64(closedTimestampPolicyBucketWidth.Nanoseconds()))
	default:
		panic(fmt.Sprintf("unknown closed timestamp policy: %s", policy))
	}
}

The idea being that a casual reader doesn't need to know about these intricate details.

-- commits line 48 at r10:
"adds newly added" seems redundant. Should we just say "adds"?

pkg/kv/kvserver/closedts/policy_calculation.go line 19 at r10 (raw file):

// bucket.
func FindBucketBasedOnNetworkRTT(networkRTT time.Duration) ctpb.RangeClosedTimestampPolicy {
	// If maxLatency is negative (i.e. no peer latency is provided), It returns

nit: s/It returns/return/g

pkg/kv/kvserver/closedts/policy_calculation.go line 27 at r10 (raw file):

		return ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_EQUAL_OR_GREATER_THAN_300MS
	}
	// It divides RTT by policy interval, adds 1 for zero-based indexing, and

nit: let's reword this comment as well, similar to other suggestions, now that we've inlined it, so as to not use "It" at the start.

pkg/kv/kvserver/replica_closedts.go line 32 at r10 (raw file):

		return policy, target
	}
	// If the range's policy is not found in targetByPolicy, it falls back to

nit: s/it falls back/fall back/g

pkg/kv/kvserver/replica_closedts.go line 37 at r10 (raw file):

	// This fallback handles race conditions between the policy refresher and side
	// transport during cluster version upgrades. The race can occur in this
	// sequence: 1. Side transport prepares a policy map before cluster upgrade is

nit: should we have these bullets be on separate lines?

pkg/kv/kvserver/replica_closedts.go line 44 at r10 (raw file):

	//
	// In this case, we temporarily fall back to the no-latency policy until the
	// side transport catches up with the new cluster version.

Would you mind adding a TODO to remove this once version compatibility with 25.1 is no longer a concern? At that point, maybe we could refactor some of this because we could expand the assertion below to panic if ok == false in all cases.

pkg/kv/kvserver/replica.go line 1364 at r10 (raw file):

		}

		// For ranges serving global reads, it determines the maximum latency

nit: "it determines" sounds off when inlined. Should we just say "For ranges serving global reads, determine the maximum leaseholder-to-peer replica using the provided map and set an appropriate policy bucket. This then controls how far in the future timestamps will be closed for the range."

pkg/kv/kvserver/replica.go line 1371 at r10 (raw file):

		maxLatency := time.Duration(-1)
		for _, peer := range r.shMu.state.Desc.InternalReplicas {
			if peerLatency, ok := latencies[peer.NodeID]; ok {

If a peer isn't in the map, should we be instead using defaultMaxNetworkRTT instead of skipping it? Separately, should we log (maybe a log.Every) if that's the case?

wenyihu6

Addressed your comment and added a new test - TestRefreshPolicyWithVariousLatencies.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani)

-- commits line 48 at r10:

Previously, arulajmani (Arul Ajmani) wrote…

"adds newly added" seems redundant. Should we just say "adds"?

Done.

pkg/kv/kvserver/replica.go line 1364 at r10 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: "it determines" sounds off when inlined. Should we just say "For ranges serving global reads, determine the maximum leaseholder-to-peer replica using the provided map and set an appropriate policy bucket. This then controls how far in the future timestamps will be closed for the range."

Done.

pkg/kv/kvserver/replica.go line 1371 at r10 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

If a peer isn't in the map, should we be instead using defaultMaxNetworkRTT instead of skipping it? Separately, should we log (maybe a log.Every) if that's the case?

Using defaultMaxNetworkRTT for peer replica latency if it is missing now. I made a note about the logs in #143890 which I plan to address as we add more metrics for this work.

pkg/kv/kvserver/replica_closedts.go line 32 at r10 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: s/it falls back/fall back/g

Done.

pkg/kv/kvserver/replica_closedts.go line 37 at r10 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: should we have these bullets be on separate lines?

Done.

pkg/kv/kvserver/replica_closedts.go line 44 at r10 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Would you mind adding a TODO to remove this once version compatibility with 25.1 is no longer a concern? At that point, maybe we could refactor some of this because we could expand the assertion below to panic if ok == false in all cases.

Done.

pkg/kv/kvserver/closedts/policy_calculation.go line 17 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: Consider moving everything after the first sentence inline. So this would look something like:

// computeNetworkRTTBasedOnPolicy returns the network RTT based on the closed
// timestamp policy.
func computeNetworkRTTBasedOnPolicy(policy ctpb.RangeClosedTimestampPolicy) time.Duration {
	switch {
	case policy == ctpb.LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO:
		// If no latency info is available, return the default max RTT.
		return defaultMaxNetworkRTT
	case policy >= ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_20MS &&
		policy <= ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_EQUAL_OR_GREATER_THAN_300MS:
		// For known latency buckets, we return the midpoint RTT for the bucket.
		// The midpointRTT for bucket N is (N+0.5)*interval.
		bucket := int(policy) - int(ctpb.LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO) - 1
		return time.Duration((float64(bucket) + 0.5) * float64(closedTimestampPolicyBucketWidth.Nanoseconds()))
	default:
		panic(fmt.Sprintf("unknown closed timestamp policy: %s", policy))
	}
}

The idea being that a casual reader doesn't need to know about these intricate details.

Done.

pkg/kv/kvserver/closedts/policy_calculation.go line 19 at r10 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: s/It returns/return/g

Done.

pkg/kv/kvserver/closedts/policy_calculation.go line 27 at r10 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: let's reword this comment as well, similar to other suggestions, now that we've inlined it, so as to not use "It" at the start.

Done.

arulajmani

Let's ship it

Reviewed 15 of 15 files at r11, 1 of 1 files at r12, 6 of 6 files at r13, 12 of 12 files at r14, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @wenyihu6)

pkg/kv/kvserver/replica.go line 1371 at r10 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Using defaultMaxNetworkRTT for peer replica latency if it is missing now. I made a note about the logs in #143890 which I plan to address as we add more metrics for this work.

Adding the logs later sounds good!

pkg/kv/kvserver/replica_closedts_test.go line 1157 at r14 (raw file):

			latencies: map[roachpb.NodeID]time.Duration{
				tc.Target(0).NodeID: 10 * time.Millisecond,
				// tc.Target(1,2).NodeID is missing, simulating nil.

Im not sure what you mean by "simulating nil". Would it be clearer if you just said // tc.Target(1, 2).NodeID are missing.

Here, and elsehwere.

wenyihu6

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani)

pkg/kv/kvserver/replica_closedts_test.go line 1157 at r14 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Im not sure what you mean by "simulating nil". Would it be clearer if you just said // tc.Target(1, 2).NodeID are missing.

Here, and elsehwere.

Removed.

This patch adds latency based closed timestamp policies to ctpb.LatencyBasedRangeClosedTimestampPolicy. Part of: cockroachdb#59680 Release note: none

For global table ranges, we aim to ensure follower replicas can serve present-time reads by calculating a target closed timestamp that leads the current time (using the `LEAD_FOR_GLOBAL_READS` policy). The goal of this estimation is to be confident that by the time that all follower have received this closed timestamp, it's greater than the present time, so they can serve reads without redirecting to the leaseholder. Note that this is an estimated goal, not the exact lead time. It is still possible that some follower nodes are really behind. Calculation: ``` raft_propagation_time = max_network_rtt*1.5 + raft_overhead(20ms) side_propagation_time = max_network_rtt*0.5 + side_transport_close_interval(200ms by default) closed_ts_propagation = max(raft_propagation_time,side_propagation_time) closed_ts_at_sender = now(sender’s current clock) + maxClockOffset(500ms by default) + buffer (25ms) + closed_ts_propagation ``` Currently, `max_network_rtt` is hardcoded at 150ms, leading to a default 800ms lead. Problems with the current implementation: Tuning trade-offs: A higher lead timestamp leads to high commit wait times, which inturn impacts write latency. A lower lead timestamp may result in a closed timestamp that lags present time, causing follower replicas to redirect, which inturn impacts read latency. Cluster wide setting: kv.closed_timestamp.lead_for_global_reads_override only applies to the entire cluster, not per range. For geographically distant replicas, a short lead time may not be enough for full application. However, increasing the lead time cluster wide can harm write performance for replicas that are close together and don’t require as much time to replicate. This commit adds latency based policies to the side transport senders. For side transport senders, nodes create a map of closed timestamps for each policy without consulting leaseholders. Senders pass this map to leaseholders, who decide whether to "bump" their closed timestamp and which policy to use. Senders then send this information to receivers to opt in ranges and pick the closed timestamps based on the policies ranges opt in. Currently, side transport senders create only two policies, offering limited control over closed timestamps based on range specific latencies. The new design introduces 16 additional policies, each representing a different network latency bucket ranges fall under. Instead of relying on a hardcoded 150ms latency for closed timestamp calculations, ranges will now use the appropriate bucket based on its observed network conditions. Periodically, the policy refresher iterates over all leaseholders on a node, updating cached closed timestamp policies based on the observed latency between the leaseholder replica and the furthest replica. This policy is then used by side transport and by raft side closed timestamp computation. Instead of using the hardcoded 150ms network roundtrip latency, a more dynamic latency would be used based on the latency based closed timestamp policy the range holds. kv.closed_timestamp.lead_for_global_reads_auto_tune.enabled can now be used to auto-tune the lead time that global table ranges use to publish close timestamps. The kv.closed_timestamp.lead_for_global_reads_override cluster setting takes precedence over this one. If auto-tuning is disabled or no data is observed, the system falls back to a hardcoded computed lead time. Resolves: cockroachdb#59680 Release note: none

wenyihu6 · 2025-04-09T03:48:17Z

Rebased on master to pick up the latest changes this PR is stacked on top of.

wenyihu6 · 2025-04-09T04:48:11Z

Extended CI failure looks like an infra flake - "Interrupted". Kicked off another run for it.

Thanks for the series of reviews 🥳!

bors r=arulajmani

craig · 2025-04-09T05:25:17Z

Build succeeded:

wenyihu6 force-pushed the latencytrackernew2 branch 21 times, most recently from bf66903 to b0b486c Compare March 18, 2025 01:49

wenyihu6 requested a review from arulajmani March 18, 2025 01:50

wenyihu6 changed the title ~~kvserver/closedts: prototype for new static fixed size bucket idea~~ kvserver/closedts: auto-tune closed ts Mar 18, 2025

wenyihu6 force-pushed the latencytrackernew2 branch 5 times, most recently from 561eb14 to e3a84f9 Compare March 24, 2025 18:28

wenyihu6 force-pushed the latencytrackernew2 branch 19 times, most recently from 790e1b9 to 462153d Compare March 28, 2025 13:29

arulajmani reviewed Apr 8, 2025

View reviewed changes

wenyihu6 commented Apr 8, 2025

View reviewed changes

arulajmani reviewed Apr 8, 2025

View reviewed changes

wenyihu6 commented Apr 9, 2025

View reviewed changes

arulajmani approved these changes Apr 9, 2025

View reviewed changes

wenyihu6 commented Apr 9, 2025

View reviewed changes

wenyihu6 added 2 commits April 8, 2025 23:46

kvserver/closedts: add latency based closed timestamp policies

1019e8f

This patch adds latency based closed timestamp policies to ctpb.LatencyBasedRangeClosedTimestampPolicy. Part of: cockroachdb#59680 Release note: none

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver/closedts: auto-tune closed ts#142721

kvserver/closedts: auto-tune closed ts#142721
craig[bot] merged 2 commits intocockroachdb:masterfrom
wenyihu6:latencytrackernew2

wenyihu6 commented Mar 12, 2025 •

edited

Loading

Uh oh!

blathers-crl bot commented Mar 12, 2025

Uh oh!

cockroach-teamcity commented Mar 12, 2025

Uh oh!

arulajmani left a comment

Uh oh!

wenyihu6 left a comment

Uh oh!

arulajmani left a comment

Uh oh!

wenyihu6 left a comment

Uh oh!

arulajmani left a comment

Uh oh!

wenyihu6 left a comment

Uh oh!

wenyihu6 commented Apr 9, 2025

Uh oh!

wenyihu6 commented Apr 9, 2025 •

edited

Loading

Uh oh!

craig bot commented Apr 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	// If the range's policy is not found in targetByPolicy, it falls back to
	// LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO.
	//
	// This fallback handles race conditions between the policy refresher and side
	// transport during cluster version upgrades. The race can occur in this
	// sequence: 1. Side transport prepares a policy map before cluster upgrade is
	// complete. 2. Cluster upgrade completes. 3. Policy refresher sees the
	// upgrade and quickly updates replica policies to use latency-based policies.
	// 4. Replica tries to use a latency-based policy but the policy map from step
	// 1 doesn't include it yet.
	//
	// In this case, we temporarily fall back to the no-latency policy until the
	// side transport catches up with the new cluster version.
	if policy >= ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_LESS_THAN_20MS &&
	policy <= ctpb.LEAD_FOR_GLOBAL_READS_LATENCY_EQUAL_OR_GREATER_THAN_300MS {
	return ctpb.LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO,
	targetByPolicy[ctpb.LEAD_FOR_GLOBAL_READS_WITH_NO_LATENCY_INFO]
	}

Conversation

wenyihu6 commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blathers-crl bot commented Mar 12, 2025

Uh oh!

cockroach-teamcity commented Mar 12, 2025

Uh oh!

arulajmani left a comment

Choose a reason for hiding this comment

Uh oh!

wenyihu6 left a comment

Choose a reason for hiding this comment

Uh oh!

arulajmani left a comment

Choose a reason for hiding this comment

Uh oh!

wenyihu6 left a comment

Choose a reason for hiding this comment

Uh oh!

arulajmani left a comment

Choose a reason for hiding this comment

Uh oh!

wenyihu6 left a comment

Choose a reason for hiding this comment

Uh oh!

wenyihu6 commented Apr 9, 2025

Uh oh!

wenyihu6 commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

craig bot commented Apr 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wenyihu6 commented Mar 12, 2025 •

edited

Loading

wenyihu6 commented Apr 9, 2025 •

edited

Loading