kv: introduce QueryResolvedTimestamp request by nvb · Pull Request #67725 · cockroachdb/cockroach

nvb · 2021-07-16T21:09:22Z

Closes #67549.
Touches #67562.

This commit introduces a new QueryResolvedTimestampRequest type, which is the first step towards implementing bounded staleness reads. This new request type requests a resolved timestamp for the key span it is issued over.

A resolved timestamp for a key span is a timestamp at or below which all future reads within the span are guaranteed to produce the same results, i.e. at which MVCC history has become immutable. The most up-to-date such bound can be computed for a key span contained in a single range by taking the minimum of the leaseholder's closed timestamp and the timestamp preceding the earliest intent present on the range that overlaps with the key span of interest. This optimum timestamp is nondecreasing over time, since the closed timestamp will not regress and since it also prevents intents at lower timestamps from being created. Follower replicas can also provide a resolved timestamp, though it may not be the most recent one due to replication delay. However, a given follower replica will similarly produce a nondecreasing sequence of resolved timestamps.

QueryResolvedTimestampRequest returns a resolved timestamp for the input key span by returning the minimum of all replicas contacted in order to cover the key span. This means that repeated invocations of this operation will be guaranteed nondecreasing only if routed to the same replicas.

A CONSISTENT read at or below a key span's resolved timestamp will never block on replication or on conflicting transactions. However, as can be inferred from the previous paragraph, for this to be guaranteed, the read must be issued to the same replica or set of replicas (for multi-range reads) that were consulted when computing the key span's resolved timestamp.

A resolved timestamp for a key span is a sibling concept a resolved timestamp for a rangefeed, which is defined in pkg/kv/kvserver/rangefeed/resolved_timestamp.go. Whereas a resolved timestamp for a rangefeed refers to a timestamp below which no future updates will be published on the rangefeed, a resolved timestamp for a key span refers to a timestamp below which no future state modifications that could change the result of read requests will be made. Both concepts rely on some notion of immutability, but the former imparts this property on a stream of events while the latter imparts this property on materialized state.

This commit does not begin using the new QueryResolvedTimestampRequest. Its use will begin in a follow-up commit that implements the "Server-side negotiation fast-path". See the bounded staleness RFC for details.

cockroach-teamcity · 2021-07-16T21:09:30Z

This change is

tbg

The

The resolved timestamp of a key span is ...

and

However, within a given range,

paragraphs directly contradict each other, plus the closed timestamp is not a concept that even applies cleanly at the range level. How about this:

A resolved timestamp (for a key span) is a timestamp at or below which all future reads within the span are guaranteed to produce the same results, i.e. at which MVCC history has become immutable. The most up-to-date such bound can be computed, for an individual range, by taking the minimum of the leaseholder's closed timestamp and the timestamp preceding the earliest intent present on the range. This optimum timestamp is nondecreasing over time, since the closed timestamp will not regress and since it also prevents intents at lower timestamps from being created. Follower replicas can also provide a resolved timestamp, though it may not be the most recent one due to replication delay. However, a follower replica will similarly produce a nondecreasing sequence of closed timestamps. The newly introduced QueryResolvedTimestamp operation returns a resolved timestamp for the input key span by returning the minimum of all replicas contacted in order to cover the key span. This means that repeated invocations of this operation will be guaranteed nondecreasing only if routed to leaseholders for all constituent ranges, or when the same replicas were consulted in both invocations.

This would motivate replacing various s/the/a/ replacements in the commit message and commit.

I saw in the RFC updates how this request will be used, but it would be helpful if you briefly mentioned this in the commit/PR message, as currently it's not clear that this is an incomplete step (the new consistency level isn't there yet).

Ran out of time for the actual review, but will return to that tomorrow.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)

docs/RFCS/20210519_bounded_staleness_reads.md, line 255 at r1 (raw file):

Third, we will introduce a new `QueryResolvedTimestamp` request and response pair.
The intention is for this request type to be sent in `BatchRequests` with an
`INCONSISTENT` consistency level so that it routes to the nearest replica, skips

should we s/INCONSISTENT/NEAREST? Independently, we'll also want to update its description:

cockroach/pkg/roachpb/api.pb.go

Lines 58 to 61 in 1518a5c

    
           // INCONSISTENT reads return the latest available, committed values. 
        
           // They are more efficient, but may read stale values as pending 
        
           // intents are ignored. 
        
           INCONSISTENT ReadConsistencyType = 2

We haven't relied on this level much but now that it's getting locked into certain semantics it's a good time to be more precise.

nvb

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)

docs/RFCS/20210519_bounded_staleness_reads.md, line 255 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

should we s/INCONSISTENT/NEAREST? Independently, we'll also want to update its description:

cockroach/pkg/roachpb/api.pb.go

Lines 58 to 61 in 1518a5c

// INCONSISTENT reads return the latest available, committed values.

// They are more efficient, but may read stale values as pending

// intents are ignored.

INCONSISTENT ReadConsistencyType = 2

We haven't relied on this level much but now that it's getting locked into certain semantics it's a good time to be more precise.

Yes, thanks for bringing this up. I've also been thinking that we are coupling two concepts here that should really be split out into distinct fields. First, we have the read consistency levels. I actually don't mind this concept and I think the semantics are fairly reasonable on the server-side, though they could be spelled out more explicitly. But the bigger problem is that they have an overloaded meaning on the client.

I'm thinking we should remove this overloaded meaning and introduce a new routing policy concept, which dictates the way in which the DistSender decides how to route requests to the replicas in a range. There are currently two policies:

// RoutingPolicy specifies how a request should be routed to the
// replicas of its target range by the DistSender.
enum RoutingPolicy {
  // LEASEHOLDER means that the DistSender should route the request to the
  // leaseholder replica of its target range.
  LEASEHOLDER = 0;
  // NEAREST means that the DistSender should route the request to the
  // nearest replica of its target range.
  NEAREST = 1;
}

This policy would dictate this logic.

One change I'm thinking of making is having the TxnCoordSender be responsible for "promoting" a LEASEHOLDER routing policy to a NEAREST routing policy when it determines that a follower read is possible. This would avoid the hoops the DistSender currently goes through to handle this logic.

We can then extend this RoutingPolicy concept further and introduce a new policy that I imagine we'll want for stage 2 of bounded staleness reads - a SINGLE_REPLICA policy. This policy would be accompanied by a ReplicaDescriptor and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor. This is important to ensure that a QueryResolvedTimestampRequest and its follow-up ScanRequest are both sent to the same replica.

This is all a bit hazy in my mind, so let me know if this seems reasonable or if you have other thoughts on the topic.

tbg

mod my previous comment above.

Reviewed 23 of 23 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)

docs/RFCS/20210519_bounded_staleness_reads.md, line 255 at r1 (raw file):
This all makes sense to me 👍

and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor.

DistSender cache misses become pretty disastrous for the bounded staleness reads. Have we thought about this at all? I understand that we want to "prevent" meta lookups that might incur a WAN hop, but what good is a bounded staleness read if it errors out due to a need to fetch a range desc? I assume we must have the same restriction on the QueryResolvedTimestampRequest or we could pass the range descriptor along from there to the actual reads to bypass the cache problem.

pkg/kv/db.go, line 718 at r1 (raw file):

	b := &Batch{}
	b.queryResolvedTimestamp(begin, end)
	b.Header.ReadConsistency = roachpb.READ_UNCOMMITTED

I just read the doc comment for READ_UNCOMMITTED and think that one could be brushed up a bit. It can only be used with read-only ops (same as READ_INCONSISTENT, which also doesn't mention this), and intents are returned separately (i.e. not mixed into the committed data), and only for operations that support this (i.e. Get, Scan, ReverseScan, but not QueryResolvedTimestamp). Also, only CONSISTENT can be used in a txn. There are probably other rules that I forgot.

I'm also not sure we're really enforcing the above rules well. For example, can you run an AdminXYZRequest with INCONSISTENT? What about a CPut? You would never want to, but also not sure if we check anywhere.

pkg/kv/kvserver/replica_closedts_internal_test.go, line 266 at r1 (raw file):

		{
			name:          "closed timestamp after non-overlapping intent",
			span:          [2]string{"b", "c"},

Might want to test the endpoint too where the intent is sitting at the right boundary of span.

pkg/kv/kvserver/replica_closedts_test.go, line 764 at r1 (raw file):

		keySet[i] = make(roachpb.Key, len(scratchKey)+1)
		copy(keySet[i], scratchKey)
		keySet[i][len(scratchKey)] = byte(i)

What you have is fine, but if you wanted a more eye-friendly version of this:

n := len(scratchKey)
for i := range keySet {
  keySet[i] = append(scratchKey[:n:n], byte(i))
}

pkg/kv/kvserver/replica_closedts_test.go, line 850 at r1 (raw file):

	time.Sleep(testTime)
	atomic.StoreInt32(&done, 1)
	require.NoError(t, g.Wait())

Nice test. Could see it flake under heavy stress though, since it "only" has 3s to advance the resolved timestamp.

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 30 at r1 (raw file):

// intents returned for async intent cleanup by a single QueryResolvedTimestamp
// request.
var QueryResolvedTimestampMaxEncounteredIntents = settings.RegisterIntSetting(

Should validate to >0, shouldn't it?
Also we've been careful in other places to use a bytes limit as well , so we should do so here, too.

cockroach/pkg/kv/kvserver/gc/gc.go

Lines 63 to 96 in ce1f98f

    
           // MaxIntentsPerCleanupBatch is a maximum number of intents that GC will send 
        
           // for intent resolution as a single batch. 
        
           // Default value is set to half of the maximum lock table size at the time 
        
           // of writing. 
        
           // This value is subject to tuning in real environment as we have more 
        
           // data available. 
        
           var MaxIntentsPerCleanupBatch = settings.RegisterIntSetting( 
        
           	"kv.gc.intent_cleanup_batch_size", 
        
           	"if non zero, gc will split found intents into batches of this size when trying to resolve them", 
        
           	5000, 
        
           	func(batchSize int64) error { 
        
           		if batchSize < 0 { 
        
           			return errors.New("gc intent cleanup batch size must be non negative") 
        
           		} 
        
           		return nil 
        
           	}, 
        
           ) 
        
           // MaxIntentKeyBytesPerCleanupBatch is maximum number of intent bytes GC will try to 
        
           // send as a single batch to intent resolution. This number is approximate and 
        
           // only includes size of the intent keys. 
        
           // Default value is conservative limit to prevent pending intent key sizes from 
        
           // ballooning. 
        
           var MaxIntentKeyBytesPerCleanupBatch = settings.RegisterIntSetting( 
        
           	"kv.gc.intent_cleanup_batch_byte_size", 
        
           	"if non zero, gc will split found intents into batches of this size when trying to resolve them", 
        
           	1e6, 
        
           	func(batchSize int64) error { 
        
           		if batchSize < 0 { 
        
           			return errors.New("gc intent cleanup batch size must be non negative") 
        
           		} 
        
           		return nil 
        
           	}, 
        
           )

Also, any particular reason this request has its own threshold?

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 37 at r1 (raw file):

= 0

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 48 at r1 (raw file):

// QueryResolvedTimestamp requests the resolved timestamp of the key span it is
// issued over. The resolved timestamp is defined as the minimum between the
// closed timestamp and the timestamp immediately preceding each intents in the

"preceding the intents"

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 62 at r1 (raw file):

	// latches (see roachpb.INCONSISTENT) and often also on follower replicas,
	// so latches won't help them to synchronize with writes.
	closedTS := cArgs.EvalCtx.GetClosedTimestampV2(ctx)

This is subtle. Doesn't it depend on how exactly a write makes the intent and new closed timestamp visible? I think here the assumption is that the intents become visible first, and then the closed timestamp gets bumped (i.e. we first commit the apply batch and then update in-mem state, as usual).

In our desired future where evaluation is purely against an immutable state snapshot, we shouldn't rely on such implementation details, but I don't see how we can clean this particular instance up right now.

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 78 at r1 (raw file):

	intentCleanupThresh := cArgs.EvalCtx.Clock().Now().Add(-intentCleanupAge.Nanoseconds(), 0)
	minIntentTS, encounteredIntents, err := computeMinIntentTimestamp(
		reader, args.Span(), int(maxEncounteredIntents), intentCleanupThresh,

int() cast needs overflow check

irfansharif

LGTM.

irfansharif · 2021-07-21T14:44:39Z

pkg/roachpb/api.go

+
+var _ combinable = &AdminVerifyProtectedTimestampResponse{}
+
+// Combine implements the combinable interface.


[nit] Drop the comment or start it with the lowercase combine.

irfansharif · 2021-07-21T15:02:25Z

pkg/kv/kvserver/replica_closedts_internal_test.go

+
+// TestQueryResolvedTimestampResolvesAbandonedIntents verifies that
+// QueryResolvedTimestamp requests attempt to asynchronously resolve intents
+// that they encounter once the encountered intents are sufficiently stale.


s/the encountered intents/they

irfansharif · 2021-07-21T15:10:33Z

pkg/kv/kvserver/replica_closedts_test.go

 }
+
+// TestNonBlockingReadsAtResolvedTimestamp tests that reads served at or below a
+// key span's resolved timestamp never block or redirect to the leaseholder. The


In this test are we testing that these requests aren't being redirected to the leaseholder?

nvb

paragraphs directly contradict each other, plus the closed timestamp is not a concept that even applies cleanly at the range level. How about this:

This would motivate replacing various s/the/a/ replacements in the commit message and commit.

Done.

I saw in the RFC updates how this request will be used, but it would be helpful if you briefly mentioned this in the commit/PR message, as currently it's not clear that this is an incomplete step (the new consistency level isn't there yet).

Done.

TFTRs!

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @nvanbenschoten, and @tbg)

docs/RFCS/20210519_bounded_staleness_reads.md, line 255 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This all makes sense to me 👍

and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor.

DistSender cache misses become pretty disastrous for the bounded staleness reads. Have we thought about this at all? I understand that we want to "prevent" meta lookups that might incur a WAN hop, but what good is a bounded staleness read if it errors out due to a need to fetch a range desc? I assume we must have the same restriction on the QueryResolvedTimestampRequest or we could pass the range descriptor along from there to the actual reads to bypass the cache problem.

I wasn't clear here. We wouldn't use this to avoid meta lookups for the reasons you listed. The availability model we're using for bounded staleness reads is that meta lookups are fine.

The point of the SINGLE_REPLICA policy would be to ensure that the ScanRequest that follows a QueryResolvedTimestampRequest is always sent to the same replica that served the QueryResolvedTimestampRequest. So the sequence would look like:

batch 1:
    reqs = QueryResolvedTimestampRequest
    read_consistency = INCONSISTENT
    routing_policy = NEAREST

batch 2:
    reqs = ScanRequest
    read_consistency = CONSISTENT
    routing_policy = SINGLE_REPLICA
    routing_target = QueryResolvedTimestampResponse.Replica

if batch 2's routing target does not exist in cache, retry both batches

As long as the QueryResolvedTimestampRequest is using the range cache and willing to update the cache on misses, we'll eventually converge and issue both requests to the same replica.

pkg/kv/db.go, line 718 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I just read the doc comment for READ_UNCOMMITTED and think that one could be brushed up a bit. It can only be used with read-only ops (same as READ_INCONSISTENT, which also doesn't mention this), and intents are returned separately (i.e. not mixed into the committed data), and only for operations that support this (i.e. Get, Scan, ReverseScan, but not QueryResolvedTimestamp). Also, only CONSISTENT can be used in a txn. There are probably other rules that I forgot.

I'm also not sure we're really enforcing the above rules well. For example, can you run an AdminXYZRequest with INCONSISTENT? What about a CPut? You would never want to, but also not sure if we check anywhere.

👍 I'll see what I can do here when I start pulling on the routing policy thread.

pkg/kv/kvserver/replica_closedts_internal_test.go, line 266 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Might want to test the endpoint too where the intent is sitting at the right boundary of span.

Done.

pkg/kv/kvserver/replica_closedts_internal_test.go, line 304 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/the encountered intents/they

I think that risks some confusion about "they" referring to "QueryResolvedTimestamp requests", so I'll keep this explicit.

pkg/kv/kvserver/replica_closedts_test.go, line 732 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

In this test are we testing that these requests aren't being redirected to the leaseholder?

Yes, we're sending directly to each replica, so they would return NotLeaseholderErrors if they needed to redirect. I added a comment,

pkg/kv/kvserver/replica_closedts_test.go, line 764 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

What you have is fine, but if you wanted a more eye-friendly version of this:
n := len(scratchKey)
for i := range keySet {
  keySet[i] = append(scratchKey[:n:n], byte(i))
}

Done.

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 30 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Should validate to >0, shouldn't it?
Also we've been careful in other places to use a bytes limit as well , so we should do so here, too.

cockroach/pkg/kv/kvserver/gc/gc.go

Lines 63 to 96 in ce1f98f

// MaxIntentsPerCleanupBatch is a maximum number of intents that GC will send

// for intent resolution as a single batch.

// Default value is set to half of the maximum lock table size at the time

// of writing.

// This value is subject to tuning in real environment as we have more

// data available.

var MaxIntentsPerCleanupBatch = settings.RegisterIntSetting(

"kv.gc.intent_cleanup_batch_size",

"if non zero, gc will split found intents into batches of this size when trying to resolve them",

5000,

func(batchSize int64) error {

if batchSize < 0 {

return errors.New("gc intent cleanup batch size must be non negative")

}

return nil

},

)

// MaxIntentKeyBytesPerCleanupBatch is maximum number of intent bytes GC will try to

// send as a single batch to intent resolution. This number is approximate and

// only includes size of the intent keys.

// Default value is conservative limit to prevent pending intent key sizes from

// ballooning.

var MaxIntentKeyBytesPerCleanupBatch = settings.RegisterIntSetting(

"kv.gc.intent_cleanup_batch_byte_size",

"if non zero, gc will split found intents into batches of this size when trying to resolve them",

1e6,

func(batchSize int64) error {

if batchSize < 0 {

return errors.New("gc intent cleanup batch size must be non negative")

}

return nil

},

)

Also, any particular reason this request has its own threshold?

All good points. How do you feel about me using those two settings directly?

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 37 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

= 0

What did you mean by this?

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 48 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

"preceding the intents"

Done.

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 62 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is subtle. Doesn't it depend on how exactly a write makes the intent and new closed timestamp visible? I think here the assumption is that the intents become visible first, and then the closed timestamp gets bumped (i.e. we first commit the apply batch and then update in-mem state, as usual).

In our desired future where evaluation is purely against an immutable state snapshot, we shouldn't rely on such implementation details, but I don't see how we can clean this particular instance up right now.

Yes, this is subtle. We see variations of this with the GC threshold check as well. Avoiding this would require more than just an immutable state snapshot though. It would also require the in-memory and persistent effects of Raft entry application to be atomic from the perspective of whoever is capturing these snapshots. I haven't thought through that very far, but enforcing that does feel like it would be quite expensive.

In this case, I'm actually more ok with this ordering assumption than in most others because I would consider the inverse order (the closed timestamp being bumped before the intents are visible) to be a violation of the closed timestamp contract. It would allow an observer to see a closed timestamp and then later observe a newly written intent below that closed timestamp.

The case that concerns me more is assumptions about the order in which we update the persistent vs. in-memory GC threshold or the order in which we update the persistent vs. in-memory range descriptor.

pkg/roachpb/api.go, line 458 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

[nit] Drop the comment or start it with the lowercase combine.

Done.

tbg

My bad:

However, a given follower replica will similarly produce a nondecreasing
sequence of closed timestamps.

should be "resolved timestamps" (closed is correct to, but that's not what we're after)

Reviewed 11 of 11 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @nvanbenschoten)

docs/RFCS/20210519_bounded_staleness_reads.md, line 255 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I wasn't clear here. We wouldn't use this to avoid meta lookups for the reasons you listed. The availability model we're using for bounded staleness reads is that meta lookups are fine.

The point of the SINGLE_REPLICA policy would be to ensure that the ScanRequest that follows a QueryResolvedTimestampRequest is always sent to the same replica that served the QueryResolvedTimestampRequest. So the sequence would look like:
batch 1:
    reqs = QueryResolvedTimestampRequest
    read_consistency = INCONSISTENT
    routing_policy = NEAREST

batch 2:
    reqs = ScanRequest
    read_consistency = CONSISTENT
    routing_policy = SINGLE_REPLICA
    routing_target = QueryResolvedTimestampResponse.Replica

if batch 2's routing target does not exist in cache, retry both batches
As long as the QueryResolvedTimestampRequest is using the range cache and willing to update the cache on misses, we'll eventually converge and issue both requests to the same replica.

Ah, thanks. That's what I would've expected, but the reference to "cached range descriptor" threw me off prematurely. This all sounds good.

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 30 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

All good points. How do you feel about me using those two settings directly?

Good, if we also update the verbiage on these settings to reflect the new generality. Mostly for us, in case we want to make future tweaks.

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 37 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What did you mean by this?

Typo. I meant >= 0 and was really saying, use settings.NonNegativeDuration.

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 78 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

int() cast needs overflow check

Ping

nvb

should be "resolved timestamps" (closed is correct to, but that's not what we're after)

Done.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @tbg)

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 30 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Good, if we also update the verbiage on these settings to reflect the new generality. Mostly for us, in case we want to make future tweaks.

Done.

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 37 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Typo. I meant >= 0 and was really saying, use settings.NonNegativeDuration.

Done.

pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 78 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ping

Done.

Closes cockroachdb#67549. Touches cockroachdb#67562. This commit introduces a new QueryResolvedTimestampRequest type, which is the first step towards implementing bounded staleness reads. This new request type requests a resolved timestamp for the key span it is issued over. A resolved timestamp for a key span is a timestamp at or below which all future reads within the span are guaranteed to produce the same results, i.e. at which MVCC history has become immutable. The most up-to-date such bound can be computed for a key span contained in a single range by taking the minimum of the leaseholder's closed timestamp and the timestamp preceding the earliest intent present on the range that overlaps with the key span of interest. This optimum timestamp is nondecreasing over time, since the closed timestamp will not regress and since it also prevents intents at lower timestamps from being created. Follower replicas can also provide a resolved timestamp, though it may not be the most recent one due to replication delay. However, a given follower replica will similarly produce a nondecreasing sequence of resolved timestamps. QueryResolvedTimestampRequest returns a resolved timestamp for the input key span by returning the minimum of all replicas contacted in order to cover the key span. This means that repeated invocations of this operation will be guaranteed nondecreasing only if routed to the same replicas. A CONSISTENT read at or below a key span's resolved timestamp will never block on replication or on conflicting transactions. However, as can be inferred from the previous paragraph, for this to be guaranteed, the read must be issued to the same replica or set of replicas (for multi-range reads) that were consulted when computing the key span's resolved timestamp. A resolved timestamp for a key span is a sibling concept a resolved timestamp for a rangefeed, which is defined in: pkg/kv/kvserver/rangefeed/resolved_timestamp.go Whereas a resolved timestamp for a rangefeed refers to a timestamp below which no future updates will be published on the rangefeed, a resolved timestamp for a key span refers to a timestamp below which no future state modifications that could change the result of read requests will be made. Both concepts rely on some notion of immutability, but the former imparts this property on a stream of events while the latter imparts this property on materialized state. This commit does not begin using the new QueryResolvedTimestampRequest. Its use will begin in a follow-up commit that implements the "Server-side negotiation fast-path". See the bounded staleness RFC for details.

nvb · 2021-07-26T23:06:19Z

bors r=tbg,irfansharif

craig · 2021-07-27T00:24:10Z

Build succeeded:

GitHub CI (Cockroach)

Touches cockroachdb#67562. This commit introduces a new RoutingPolicy configuration that lives on a BatchRequest header. A request's routing policy specifies how the request should be routed to the replicas of its target range(s) by the DistSender. There are initially two routing policies: ``` enum RoutingPolicy { // LEASEHOLDER means that the DistSender should route the request to the // leaseholder replica(s) of its target range(s). LEASEHOLDER = 0; // NEAREST means that the DistSender should route the request to the // nearest replica(s) of its target range(s). NEAREST = 1; } ``` The default policy is `LEASEHOLDER`. Routing policies allow us to stop overloading the use of the ReadConsistency enum to dictate both how the client should route a request to a server and which kinds of requests should be eligible to be served by a given replica. Routing policies are a client-side only concept. They do not dictate which replicas in a range are eligible to serve the request, only which replicas are considered as targets by the DistSender, and in which order. A request that is routed to an ineligible replica (a function of request type, timestamp, and read consistency) will be rejected by that replica and the DistSender will target another replica in the range. As discussed in cockroachdb#67725 (review), we will likely need to introduce a third routing policy that called `SINGLE_REPLICA` to address cockroachdb#67554. This policy would be accompanied by a ReplicaDescriptor and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor. This is important to ensure that a QueryResolvedTimestampRequest and its follow-up ScanRequest are both sent to the same replica.

68191: kv: introduce request RoutingPolicy configuration r=nvanbenschoten a=nvanbenschoten Half of #67551. Touches #67562. This commit introduces a new RoutingPolicy configuration that lives on a BatchRequest header. A request's routing policy specifies how the request should be routed to the replicas of its target range(s) by the DistSender. There are initially two routing policies: ``` enum RoutingPolicy { // LEASEHOLDER means that the DistSender should route the request to the // leaseholder replica(s) of its target range(s). LEASEHOLDER = 0; // NEAREST means that the DistSender should route the request to the // nearest replica(s) of its target range(s). NEAREST = 1; } ``` The default policy is `LEASEHOLDER`. Routing policies allow us to stop overloading the use of the ReadConsistency enum to dictate both how the client should route a request to a server and which kinds of requests should be eligible to be served by a given replica. Routing policies are a client-side only concept. They do not dictate which replicas in a range are eligible to serve the request, only which replicas are considered as targets by the DistSender, and in which order. A request that is routed to an ineligible replica (a function of request type, timestamp, and read consistency) will be rejected by that replica and the DistSender will target another replica in the range. As discussed in #67725 (review), we will likely need to introduce a third routing policy called `SINGLE_REPLICA` to address #67554. This policy would be accompanied by a ReplicaDescriptor and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor. This is important to ensure that a QueryResolvedTimestampRequest and its follow-up ScanRequest are both sent to the same replica. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

Touches cockroachdb#67562. This commit introduces a new RoutingPolicy configuration that lives on a BatchRequest header. A request's routing policy specifies how the request should be routed to the replicas of its target range(s) by the DistSender. There are initially two routing policies: ``` enum RoutingPolicy { // LEASEHOLDER means that the DistSender should route the request to the // leaseholder replica(s) of its target range(s). LEASEHOLDER = 0; // NEAREST means that the DistSender should route the request to the // nearest replica(s) of its target range(s). NEAREST = 1; } ``` The default policy is `LEASEHOLDER`. Routing policies allow us to stop overloading the use of the ReadConsistency enum to dictate both how the client should route a request to a server and which kinds of requests should be eligible to be served by a given replica. Routing policies are a client-side only concept. They do not dictate which replicas in a range are eligible to serve the request, only which replicas are considered as targets by the DistSender, and in which order. A request that is routed to an ineligible replica (a function of request type, timestamp, and read consistency) will be rejected by that replica and the DistSender will target another replica in the range. As discussed in cockroachdb#67725 (review), we will likely need to introduce a third routing policy that called `SINGLE_REPLICA` to address cockroachdb#67554. This policy would be accompanied by a ReplicaDescriptor and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor. This is important to ensure that a QueryResolvedTimestampRequest and its follow-up ScanRequest are both sent to the same replica.

nvb requested review from a team, irfansharif and tbg July 16, 2021 21:09

nvb requested a review from a team as a code owner July 16, 2021 21:09

nvb removed the request for review from a team July 16, 2021 21:09

nvb force-pushed the nvanbenschoten/getResolvedTS branch 4 times, most recently from 5161429 to 90cf2fd Compare July 19, 2021 16:08

tbg reviewed Jul 20, 2021

View reviewed changes

nvb commented Jul 21, 2021

View reviewed changes

tbg approved these changes Jul 21, 2021

View reviewed changes

irfansharif approved these changes Jul 21, 2021

View reviewed changes

nvb force-pushed the nvanbenschoten/getResolvedTS branch from 90cf2fd to a957a67 Compare July 23, 2021 06:23

nvb commented Jul 23, 2021

View reviewed changes

tbg requested a review from irfansharif July 26, 2021 12:35

tbg approved these changes Jul 26, 2021

View reviewed changes

nvb force-pushed the nvanbenschoten/getResolvedTS branch from a957a67 to 44986dd Compare July 26, 2021 19:36

nvb commented Jul 26, 2021

View reviewed changes

nvb force-pushed the nvanbenschoten/getResolvedTS branch from 44986dd to 024a8d9 Compare July 26, 2021 21:38

craig bot merged commit c2db7f8 into cockroachdb:master Jul 27, 2021

nvb deleted the nvanbenschoten/getResolvedTS branch July 27, 2021 20:49

nvb mentioned this pull request Jul 28, 2021

kv: introduce request RoutingPolicy configuration #68191

Merged

	// INCONSISTENT reads return the latest available, committed values.
	// They are more efficient, but may read stale values as pending
	// intents are ignored.
	INCONSISTENT ReadConsistencyType = 2

	// MaxIntentsPerCleanupBatch is a maximum number of intents that GC will send
	// for intent resolution as a single batch.
	// Default value is set to half of the maximum lock table size at the time
	// of writing.
	// This value is subject to tuning in real environment as we have more
	// data available.
	var MaxIntentsPerCleanupBatch = settings.RegisterIntSetting(
	"kv.gc.intent_cleanup_batch_size",
	"if non zero, gc will split found intents into batches of this size when trying to resolve them",
	5000,
	func(batchSize int64) error {
	if batchSize < 0 {
	return errors.New("gc intent cleanup batch size must be non negative")
	}
	return nil
	},
	)

	// MaxIntentKeyBytesPerCleanupBatch is maximum number of intent bytes GC will try to
	// send as a single batch to intent resolution. This number is approximate and
	// only includes size of the intent keys.
	// Default value is conservative limit to prevent pending intent key sizes from
	// ballooning.
	var MaxIntentKeyBytesPerCleanupBatch = settings.RegisterIntSetting(
	"kv.gc.intent_cleanup_batch_byte_size",
	"if non zero, gc will split found intents into batches of this size when trying to resolve them",
	1e6,
	func(batchSize int64) error {
	if batchSize < 0 {
	return errors.New("gc intent cleanup batch size must be non negative")
	}
	return nil
	},
	)


		var _ combinable = &AdminVerifyProtectedTimestampResponse{}

		// Combine implements the combinable interface.

Conversation

nvb commented Jul 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jul 16, 2021

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

irfansharif left a comment

Choose a reason for hiding this comment

Uh oh!

irfansharif Jul 21, 2021

Choose a reason for hiding this comment

Uh oh!

irfansharif Jul 21, 2021

Choose a reason for hiding this comment

Uh oh!

irfansharif Jul 21, 2021

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

nvb commented Jul 26, 2021

Uh oh!

craig bot commented Jul 27, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nvb commented Jul 16, 2021 •

edited

Loading