kv: introduce QueryResolvedTimestamp request#67725
kv: introduce QueryResolvedTimestamp request#67725craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
5161429 to
90cf2fd
Compare
tbg
left a comment
There was a problem hiding this comment.
The
The resolved timestamp of a key span is ...
and
However, within a given range,
paragraphs directly contradict each other, plus the closed timestamp is not a concept that even applies cleanly at the range level. How about this:
A resolved timestamp (for a key span) is a timestamp at or below which all future reads within the span are guaranteed to produce the same results, i.e. at which MVCC history has become immutable. The most up-to-date such bound can be computed, for an individual range, by taking the minimum of the leaseholder's closed timestamp and the timestamp preceding the earliest intent present on the range. This optimum timestamp is nondecreasing over time, since the closed timestamp will not regress and since it also prevents intents at lower timestamps from being created. Follower replicas can also provide a resolved timestamp, though it may not be the most recent one due to replication delay. However, a follower replica will similarly produce a nondecreasing sequence of closed timestamps. The newly introduced
QueryResolvedTimestampoperation returns a resolved timestamp for the input key span by returning the minimum of all replicas contacted in order to cover the key span. This means that repeated invocations of this operation will be guaranteed nondecreasing only if routed to leaseholders for all constituent ranges, or when the same replicas were consulted in both invocations.
This would motivate replacing various s/the/a/ replacements in the commit message and commit.
I saw in the RFC updates how this request will be used, but it would be helpful if you briefly mentioned this in the commit/PR message, as currently it's not clear that this is an incomplete step (the new consistency level isn't there yet).
Ran out of time for the actual review, but will return to that tomorrow.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)
docs/RFCS/20210519_bounded_staleness_reads.md, line 255 at r1 (raw file):
Third, we will introduce a new `QueryResolvedTimestamp` request and response pair. The intention is for this request type to be sent in `BatchRequests` with an `INCONSISTENT` consistency level so that it routes to the nearest replica, skips
should we s/INCONSISTENT/NEAREST? Independently, we'll also want to update its description:
cockroach/pkg/roachpb/api.pb.go
Lines 58 to 61 in 1518a5c
We haven't relied on this level much but now that it's getting locked into certain semantics it's a good time to be more precise.
nvb
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)
docs/RFCS/20210519_bounded_staleness_reads.md, line 255 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
should we s/INCONSISTENT/NEAREST? Independently, we'll also want to update its description:
cockroach/pkg/roachpb/api.pb.go
Lines 58 to 61 in 1518a5c
We haven't relied on this level much but now that it's getting locked into certain semantics it's a good time to be more precise.
Yes, thanks for bringing this up. I've also been thinking that we are coupling two concepts here that should really be split out into distinct fields. First, we have the read consistency levels. I actually don't mind this concept and I think the semantics are fairly reasonable on the server-side, though they could be spelled out more explicitly. But the bigger problem is that they have an overloaded meaning on the client.
I'm thinking we should remove this overloaded meaning and introduce a new routing policy concept, which dictates the way in which the DistSender decides how to route requests to the replicas in a range. There are currently two policies:
// RoutingPolicy specifies how a request should be routed to the
// replicas of its target range by the DistSender.
enum RoutingPolicy {
// LEASEHOLDER means that the DistSender should route the request to the
// leaseholder replica of its target range.
LEASEHOLDER = 0;
// NEAREST means that the DistSender should route the request to the
// nearest replica of its target range.
NEAREST = 1;
}
This policy would dictate this logic.
One change I'm thinking of making is having the TxnCoordSender be responsible for "promoting" a LEASEHOLDER routing policy to a NEAREST routing policy when it determines that a follower read is possible. This would avoid the hoops the DistSender currently goes through to handle this logic.
We can then extend this RoutingPolicy concept further and introduce a new policy that I imagine we'll want for stage 2 of bounded staleness reads - a SINGLE_REPLICA policy. This policy would be accompanied by a ReplicaDescriptor and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor. This is important to ensure that a QueryResolvedTimestampRequest and its follow-up ScanRequest are both sent to the same replica.
This is all a bit hazy in my mind, so let me know if this seems reasonable or if you have other thoughts on the topic.
tbg
left a comment
There was a problem hiding this comment.
mod my previous comment above.
Reviewed 23 of 23 files at r1.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)
docs/RFCS/20210519_bounded_staleness_reads.md, line 255 at r1 (raw file):
This all makes sense to me 👍
and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor.
DistSender cache misses become pretty disastrous for the bounded staleness reads. Have we thought about this at all? I understand that we want to "prevent" meta lookups that might incur a WAN hop, but what good is a bounded staleness read if it errors out due to a need to fetch a range desc? I assume we must have the same restriction on the QueryResolvedTimestampRequest or we could pass the range descriptor along from there to the actual reads to bypass the cache problem.
pkg/kv/db.go, line 718 at r1 (raw file):
b := &Batch{} b.queryResolvedTimestamp(begin, end) b.Header.ReadConsistency = roachpb.READ_UNCOMMITTED
I just read the doc comment for READ_UNCOMMITTED and think that one could be brushed up a bit. It can only be used with read-only ops (same as READ_INCONSISTENT, which also doesn't mention this), and intents are returned separately (i.e. not mixed into the committed data), and only for operations that support this (i.e. Get, Scan, ReverseScan, but not QueryResolvedTimestamp). Also, only CONSISTENT can be used in a txn. There are probably other rules that I forgot.
I'm also not sure we're really enforcing the above rules well. For example, can you run an AdminXYZRequest with INCONSISTENT? What about a CPut? You would never want to, but also not sure if we check anywhere.
pkg/kv/kvserver/replica_closedts_internal_test.go, line 266 at r1 (raw file):
{ name: "closed timestamp after non-overlapping intent", span: [2]string{"b", "c"},
Might want to test the endpoint too where the intent is sitting at the right boundary of span.
pkg/kv/kvserver/replica_closedts_test.go, line 764 at r1 (raw file):
keySet[i] = make(roachpb.Key, len(scratchKey)+1) copy(keySet[i], scratchKey) keySet[i][len(scratchKey)] = byte(i)
What you have is fine, but if you wanted a more eye-friendly version of this:
n := len(scratchKey)
for i := range keySet {
keySet[i] = append(scratchKey[:n:n], byte(i))
}pkg/kv/kvserver/replica_closedts_test.go, line 850 at r1 (raw file):
time.Sleep(testTime) atomic.StoreInt32(&done, 1) require.NoError(t, g.Wait())
Nice test. Could see it flake under heavy stress though, since it "only" has 3s to advance the resolved timestamp.
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 30 at r1 (raw file):
// intents returned for async intent cleanup by a single QueryResolvedTimestamp // request. var QueryResolvedTimestampMaxEncounteredIntents = settings.RegisterIntSetting(
Should validate to >0, shouldn't it?
Also we've been careful in other places to use a bytes limit as well , so we should do so here, too.
cockroach/pkg/kv/kvserver/gc/gc.go
Lines 63 to 96 in ce1f98f
Also, any particular reason this request has its own threshold?
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 37 at r1 (raw file):
= 0
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 48 at r1 (raw file):
// QueryResolvedTimestamp requests the resolved timestamp of the key span it is // issued over. The resolved timestamp is defined as the minimum between the // closed timestamp and the timestamp immediately preceding each intents in the
"preceding the intents"
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 62 at r1 (raw file):
// latches (see roachpb.INCONSISTENT) and often also on follower replicas, // so latches won't help them to synchronize with writes. closedTS := cArgs.EvalCtx.GetClosedTimestampV2(ctx)
This is subtle. Doesn't it depend on how exactly a write makes the intent and new closed timestamp visible? I think here the assumption is that the intents become visible first, and then the closed timestamp gets bumped (i.e. we first commit the apply batch and then update in-mem state, as usual).
In our desired future where evaluation is purely against an immutable state snapshot, we shouldn't rely on such implementation details, but I don't see how we can clean this particular instance up right now.
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 78 at r1 (raw file):
intentCleanupThresh := cArgs.EvalCtx.Clock().Now().Add(-intentCleanupAge.Nanoseconds(), 0) minIntentTS, encounteredIntents, err := computeMinIntentTimestamp( reader, args.Span(), int(maxEncounteredIntents), intentCleanupThresh,
int() cast needs overflow check
pkg/roachpb/api.go
Outdated
|
|
||
| var _ combinable = &AdminVerifyProtectedTimestampResponse{} | ||
|
|
||
| // Combine implements the combinable interface. |
There was a problem hiding this comment.
[nit] Drop the comment or start it with the lowercase combine.
|
|
||
| // TestQueryResolvedTimestampResolvesAbandonedIntents verifies that | ||
| // QueryResolvedTimestamp requests attempt to asynchronously resolve intents | ||
| // that they encounter once the encountered intents are sufficiently stale. |
There was a problem hiding this comment.
s/the encountered intents/they
| } | ||
|
|
||
| // TestNonBlockingReadsAtResolvedTimestamp tests that reads served at or below a | ||
| // key span's resolved timestamp never block or redirect to the leaseholder. The |
There was a problem hiding this comment.
In this test are we testing that these requests aren't being redirected to the leaseholder?
90cf2fd to
a957a67
Compare
nvb
left a comment
There was a problem hiding this comment.
paragraphs directly contradict each other, plus the closed timestamp is not a concept that even applies cleanly at the range level. How about this:
This would motivate replacing various s/the/a/ replacements in the commit message and commit.
Done.
I saw in the RFC updates how this request will be used, but it would be helpful if you briefly mentioned this in the commit/PR message, as currently it's not clear that this is an incomplete step (the new consistency level isn't there yet).
Done.
TFTRs!
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @nvanbenschoten, and @tbg)
docs/RFCS/20210519_bounded_staleness_reads.md, line 255 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This all makes sense to me 👍
and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor.
DistSender cache misses become pretty disastrous for the bounded staleness reads. Have we thought about this at all? I understand that we want to "prevent" meta lookups that might incur a WAN hop, but what good is a bounded staleness read if it errors out due to a need to fetch a range desc? I assume we must have the same restriction on the QueryResolvedTimestampRequest or we could pass the range descriptor along from there to the actual reads to bypass the cache problem.
I wasn't clear here. We wouldn't use this to avoid meta lookups for the reasons you listed. The availability model we're using for bounded staleness reads is that meta lookups are fine.
The point of the SINGLE_REPLICA policy would be to ensure that the ScanRequest that follows a QueryResolvedTimestampRequest is always sent to the same replica that served the QueryResolvedTimestampRequest. So the sequence would look like:
batch 1:
reqs = QueryResolvedTimestampRequest
read_consistency = INCONSISTENT
routing_policy = NEAREST
batch 2:
reqs = ScanRequest
read_consistency = CONSISTENT
routing_policy = SINGLE_REPLICA
routing_target = QueryResolvedTimestampResponse.Replica
if batch 2's routing target does not exist in cache, retry both batches
As long as the QueryResolvedTimestampRequest is using the range cache and willing to update the cache on misses, we'll eventually converge and issue both requests to the same replica.
pkg/kv/db.go, line 718 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I just read the doc comment for READ_UNCOMMITTED and think that one could be brushed up a bit. It can only be used with read-only ops (same as
READ_INCONSISTENT, which also doesn't mention this), and intents are returned separately (i.e. not mixed into the committed data), and only for operations that support this (i.e. Get, Scan, ReverseScan, but notQueryResolvedTimestamp). Also, only CONSISTENT can be used in a txn. There are probably other rules that I forgot.I'm also not sure we're really enforcing the above rules well. For example, can you run an
AdminXYZRequestwith INCONSISTENT? What about a CPut? You would never want to, but also not sure if we check anywhere.
👍 I'll see what I can do here when I start pulling on the routing policy thread.
pkg/kv/kvserver/replica_closedts_internal_test.go, line 266 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Might want to test the endpoint too where the intent is sitting at the right boundary of
span.
Done.
pkg/kv/kvserver/replica_closedts_internal_test.go, line 304 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
s/the encountered intents/they
I think that risks some confusion about "they" referring to "QueryResolvedTimestamp requests", so I'll keep this explicit.
pkg/kv/kvserver/replica_closedts_test.go, line 732 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
In this test are we testing that these requests aren't being redirected to the leaseholder?
Yes, we're sending directly to each replica, so they would return NotLeaseholderErrors if they needed to redirect. I added a comment,
pkg/kv/kvserver/replica_closedts_test.go, line 764 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
What you have is fine, but if you wanted a more eye-friendly version of this:
n := len(scratchKey) for i := range keySet { keySet[i] = append(scratchKey[:n:n], byte(i)) }
Done.
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 30 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Should validate to >0, shouldn't it?
Also we've been careful in other places to use a bytes limit as well , so we should do so here, too.cockroach/pkg/kv/kvserver/gc/gc.go
Lines 63 to 96 in ce1f98f
Also, any particular reason this request has its own threshold?
All good points. How do you feel about me using those two settings directly?
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 37 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
= 0
What did you mean by this?
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 48 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
"preceding the intents"
Done.
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 62 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This is subtle. Doesn't it depend on how exactly a write makes the intent and new closed timestamp visible? I think here the assumption is that the intents become visible first, and then the closed timestamp gets bumped (i.e. we first commit the apply batch and then update in-mem state, as usual).
In our desired future where evaluation is purely against an immutable state snapshot, we shouldn't rely on such implementation details, but I don't see how we can clean this particular instance up right now.
Yes, this is subtle. We see variations of this with the GC threshold check as well. Avoiding this would require more than just an immutable state snapshot though. It would also require the in-memory and persistent effects of Raft entry application to be atomic from the perspective of whoever is capturing these snapshots. I haven't thought through that very far, but enforcing that does feel like it would be quite expensive.
In this case, I'm actually more ok with this ordering assumption than in most others because I would consider the inverse order (the closed timestamp being bumped before the intents are visible) to be a violation of the closed timestamp contract. It would allow an observer to see a closed timestamp and then later observe a newly written intent below that closed timestamp.
The case that concerns me more is assumptions about the order in which we update the persistent vs. in-memory GC threshold or the order in which we update the persistent vs. in-memory range descriptor.
pkg/roachpb/api.go, line 458 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
[nit] Drop the comment or start it with the lowercase combine.
Done.
tbg
left a comment
There was a problem hiding this comment.
My bad:
However, a given follower replica will similarly produce a nondecreasing
sequence of closed timestamps.
should be "resolved timestamps" (closed is correct to, but that's not what we're after)
Reviewed 11 of 11 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @nvanbenschoten)
docs/RFCS/20210519_bounded_staleness_reads.md, line 255 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I wasn't clear here. We wouldn't use this to avoid meta lookups for the reasons you listed. The availability model we're using for bounded staleness reads is that meta lookups are fine.
The point of the
SINGLE_REPLICApolicy would be to ensure that theScanRequestthat follows aQueryResolvedTimestampRequestis always sent to the same replica that served theQueryResolvedTimestampRequest. So the sequence would look like:batch 1: reqs = QueryResolvedTimestampRequest read_consistency = INCONSISTENT routing_policy = NEAREST batch 2: reqs = ScanRequest read_consistency = CONSISTENT routing_policy = SINGLE_REPLICA routing_target = QueryResolvedTimestampResponse.Replica if batch 2's routing target does not exist in cache, retry both batchesAs long as the
QueryResolvedTimestampRequestis using the range cache and willing to update the cache on misses, we'll eventually converge and issue both requests to the same replica.
Ah, thanks. That's what I would've expected, but the reference to "cached range descriptor" threw me off prematurely. This all sounds good.
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 30 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
All good points. How do you feel about me using those two settings directly?
Good, if we also update the verbiage on these settings to reflect the new generality. Mostly for us, in case we want to make future tweaks.
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 37 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
What did you mean by this?
Typo. I meant >= 0 and was really saying, use settings.NonNegativeDuration.
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 78 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
int() cast needs overflow check
Ping
a957a67 to
44986dd
Compare
nvb
left a comment
There was a problem hiding this comment.
should be "resolved timestamps" (closed is correct to, but that's not what we're after)
Done.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @tbg)
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 30 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Good, if we also update the verbiage on these settings to reflect the new generality. Mostly for us, in case we want to make future tweaks.
Done.
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 37 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Typo. I meant
>= 0and was really saying, usesettings.NonNegativeDuration.
Done.
pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 78 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Ping
Done.
Closes cockroachdb#67549. Touches cockroachdb#67562. This commit introduces a new QueryResolvedTimestampRequest type, which is the first step towards implementing bounded staleness reads. This new request type requests a resolved timestamp for the key span it is issued over. A resolved timestamp for a key span is a timestamp at or below which all future reads within the span are guaranteed to produce the same results, i.e. at which MVCC history has become immutable. The most up-to-date such bound can be computed for a key span contained in a single range by taking the minimum of the leaseholder's closed timestamp and the timestamp preceding the earliest intent present on the range that overlaps with the key span of interest. This optimum timestamp is nondecreasing over time, since the closed timestamp will not regress and since it also prevents intents at lower timestamps from being created. Follower replicas can also provide a resolved timestamp, though it may not be the most recent one due to replication delay. However, a given follower replica will similarly produce a nondecreasing sequence of resolved timestamps. QueryResolvedTimestampRequest returns a resolved timestamp for the input key span by returning the minimum of all replicas contacted in order to cover the key span. This means that repeated invocations of this operation will be guaranteed nondecreasing only if routed to the same replicas. A CONSISTENT read at or below a key span's resolved timestamp will never block on replication or on conflicting transactions. However, as can be inferred from the previous paragraph, for this to be guaranteed, the read must be issued to the same replica or set of replicas (for multi-range reads) that were consulted when computing the key span's resolved timestamp. A resolved timestamp for a key span is a sibling concept a resolved timestamp for a rangefeed, which is defined in: pkg/kv/kvserver/rangefeed/resolved_timestamp.go Whereas a resolved timestamp for a rangefeed refers to a timestamp below which no future updates will be published on the rangefeed, a resolved timestamp for a key span refers to a timestamp below which no future state modifications that could change the result of read requests will be made. Both concepts rely on some notion of immutability, but the former imparts this property on a stream of events while the latter imparts this property on materialized state. This commit does not begin using the new QueryResolvedTimestampRequest. Its use will begin in a follow-up commit that implements the "Server-side negotiation fast-path". See the bounded staleness RFC for details.
44986dd to
024a8d9
Compare
|
bors r=tbg,irfansharif |
|
Build succeeded: |
Touches cockroachdb#67562. This commit introduces a new RoutingPolicy configuration that lives on a BatchRequest header. A request's routing policy specifies how the request should be routed to the replicas of its target range(s) by the DistSender. There are initially two routing policies: ``` enum RoutingPolicy { // LEASEHOLDER means that the DistSender should route the request to the // leaseholder replica(s) of its target range(s). LEASEHOLDER = 0; // NEAREST means that the DistSender should route the request to the // nearest replica(s) of its target range(s). NEAREST = 1; } ``` The default policy is `LEASEHOLDER`. Routing policies allow us to stop overloading the use of the ReadConsistency enum to dictate both how the client should route a request to a server and which kinds of requests should be eligible to be served by a given replica. Routing policies are a client-side only concept. They do not dictate which replicas in a range are eligible to serve the request, only which replicas are considered as targets by the DistSender, and in which order. A request that is routed to an ineligible replica (a function of request type, timestamp, and read consistency) will be rejected by that replica and the DistSender will target another replica in the range. As discussed in cockroachdb#67725 (review), we will likely need to introduce a third routing policy that called `SINGLE_REPLICA` to address cockroachdb#67554. This policy would be accompanied by a ReplicaDescriptor and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor. This is important to ensure that a QueryResolvedTimestampRequest and its follow-up ScanRequest are both sent to the same replica.
Touches cockroachdb#67562. This commit introduces a new RoutingPolicy configuration that lives on a BatchRequest header. A request's routing policy specifies how the request should be routed to the replicas of its target range(s) by the DistSender. There are initially two routing policies: ``` enum RoutingPolicy { // LEASEHOLDER means that the DistSender should route the request to the // leaseholder replica(s) of its target range(s). LEASEHOLDER = 0; // NEAREST means that the DistSender should route the request to the // nearest replica(s) of its target range(s). NEAREST = 1; } ``` The default policy is `LEASEHOLDER`. Routing policies allow us to stop overloading the use of the ReadConsistency enum to dictate both how the client should route a request to a server and which kinds of requests should be eligible to be served by a given replica. Routing policies are a client-side only concept. They do not dictate which replicas in a range are eligible to serve the request, only which replicas are considered as targets by the DistSender, and in which order. A request that is routed to an ineligible replica (a function of request type, timestamp, and read consistency) will be rejected by that replica and the DistSender will target another replica in the range. As discussed in cockroachdb#67725 (review), we will likely need to introduce a third routing policy that called `SINGLE_REPLICA` to address cockroachdb#67554. This policy would be accompanied by a ReplicaDescriptor and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor. This is important to ensure that a QueryResolvedTimestampRequest and its follow-up ScanRequest are both sent to the same replica.
Touches cockroachdb#67562. This commit introduces a new RoutingPolicy configuration that lives on a BatchRequest header. A request's routing policy specifies how the request should be routed to the replicas of its target range(s) by the DistSender. There are initially two routing policies: ``` enum RoutingPolicy { // LEASEHOLDER means that the DistSender should route the request to the // leaseholder replica(s) of its target range(s). LEASEHOLDER = 0; // NEAREST means that the DistSender should route the request to the // nearest replica(s) of its target range(s). NEAREST = 1; } ``` The default policy is `LEASEHOLDER`. Routing policies allow us to stop overloading the use of the ReadConsistency enum to dictate both how the client should route a request to a server and which kinds of requests should be eligible to be served by a given replica. Routing policies are a client-side only concept. They do not dictate which replicas in a range are eligible to serve the request, only which replicas are considered as targets by the DistSender, and in which order. A request that is routed to an ineligible replica (a function of request type, timestamp, and read consistency) will be rejected by that replica and the DistSender will target another replica in the range. As discussed in cockroachdb#67725 (review), we will likely need to introduce a third routing policy that called `SINGLE_REPLICA` to address cockroachdb#67554. This policy would be accompanied by a ReplicaDescriptor and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor. This is important to ensure that a QueryResolvedTimestampRequest and its follow-up ScanRequest are both sent to the same replica.
Touches cockroachdb#67562. This commit introduces a new RoutingPolicy configuration that lives on a BatchRequest header. A request's routing policy specifies how the request should be routed to the replicas of its target range(s) by the DistSender. There are initially two routing policies: ``` enum RoutingPolicy { // LEASEHOLDER means that the DistSender should route the request to the // leaseholder replica(s) of its target range(s). LEASEHOLDER = 0; // NEAREST means that the DistSender should route the request to the // nearest replica(s) of its target range(s). NEAREST = 1; } ``` The default policy is `LEASEHOLDER`. Routing policies allow us to stop overloading the use of the ReadConsistency enum to dictate both how the client should route a request to a server and which kinds of requests should be eligible to be served by a given replica. Routing policies are a client-side only concept. They do not dictate which replicas in a range are eligible to serve the request, only which replicas are considered as targets by the DistSender, and in which order. A request that is routed to an ineligible replica (a function of request type, timestamp, and read consistency) will be rejected by that replica and the DistSender will target another replica in the range. As discussed in cockroachdb#67725 (review), we will likely need to introduce a third routing policy that called `SINGLE_REPLICA` to address cockroachdb#67554. This policy would be accompanied by a ReplicaDescriptor and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor. This is important to ensure that a QueryResolvedTimestampRequest and its follow-up ScanRequest are both sent to the same replica.
Touches cockroachdb#67562. This commit introduces a new RoutingPolicy configuration that lives on a BatchRequest header. A request's routing policy specifies how the request should be routed to the replicas of its target range(s) by the DistSender. There are initially two routing policies: ``` enum RoutingPolicy { // LEASEHOLDER means that the DistSender should route the request to the // leaseholder replica(s) of its target range(s). LEASEHOLDER = 0; // NEAREST means that the DistSender should route the request to the // nearest replica(s) of its target range(s). NEAREST = 1; } ``` The default policy is `LEASEHOLDER`. Routing policies allow us to stop overloading the use of the ReadConsistency enum to dictate both how the client should route a request to a server and which kinds of requests should be eligible to be served by a given replica. Routing policies are a client-side only concept. They do not dictate which replicas in a range are eligible to serve the request, only which replicas are considered as targets by the DistSender, and in which order. A request that is routed to an ineligible replica (a function of request type, timestamp, and read consistency) will be rejected by that replica and the DistSender will target another replica in the range. As discussed in cockroachdb#67725 (review), we will likely need to introduce a third routing policy that called `SINGLE_REPLICA` to address cockroachdb#67554. This policy would be accompanied by a ReplicaDescriptor and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor. This is important to ensure that a QueryResolvedTimestampRequest and its follow-up ScanRequest are both sent to the same replica.
68191: kv: introduce request RoutingPolicy configuration r=nvanbenschoten a=nvanbenschoten Half of #67551. Touches #67562. This commit introduces a new RoutingPolicy configuration that lives on a BatchRequest header. A request's routing policy specifies how the request should be routed to the replicas of its target range(s) by the DistSender. There are initially two routing policies: ``` enum RoutingPolicy { // LEASEHOLDER means that the DistSender should route the request to the // leaseholder replica(s) of its target range(s). LEASEHOLDER = 0; // NEAREST means that the DistSender should route the request to the // nearest replica(s) of its target range(s). NEAREST = 1; } ``` The default policy is `LEASEHOLDER`. Routing policies allow us to stop overloading the use of the ReadConsistency enum to dictate both how the client should route a request to a server and which kinds of requests should be eligible to be served by a given replica. Routing policies are a client-side only concept. They do not dictate which replicas in a range are eligible to serve the request, only which replicas are considered as targets by the DistSender, and in which order. A request that is routed to an ineligible replica (a function of request type, timestamp, and read consistency) will be rejected by that replica and the DistSender will target another replica in the range. As discussed in #67725 (review), we will likely need to introduce a third routing policy called `SINGLE_REPLICA` to address #67554. This policy would be accompanied by a ReplicaDescriptor and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor. This is important to ensure that a QueryResolvedTimestampRequest and its follow-up ScanRequest are both sent to the same replica. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Touches cockroachdb#67562. This commit introduces a new RoutingPolicy configuration that lives on a BatchRequest header. A request's routing policy specifies how the request should be routed to the replicas of its target range(s) by the DistSender. There are initially two routing policies: ``` enum RoutingPolicy { // LEASEHOLDER means that the DistSender should route the request to the // leaseholder replica(s) of its target range(s). LEASEHOLDER = 0; // NEAREST means that the DistSender should route the request to the // nearest replica(s) of its target range(s). NEAREST = 1; } ``` The default policy is `LEASEHOLDER`. Routing policies allow us to stop overloading the use of the ReadConsistency enum to dictate both how the client should route a request to a server and which kinds of requests should be eligible to be served by a given replica. Routing policies are a client-side only concept. They do not dictate which replicas in a range are eligible to serve the request, only which replicas are considered as targets by the DistSender, and in which order. A request that is routed to an ineligible replica (a function of request type, timestamp, and read consistency) will be rejected by that replica and the DistSender will target another replica in the range. As discussed in cockroachdb#67725 (review), we will likely need to introduce a third routing policy that called `SINGLE_REPLICA` to address cockroachdb#67554. This policy would be accompanied by a ReplicaDescriptor and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor. This is important to ensure that a QueryResolvedTimestampRequest and its follow-up ScanRequest are both sent to the same replica.
Closes #67549.
Touches #67562.
This commit introduces a new
QueryResolvedTimestampRequesttype, which is the first step towards implementing bounded staleness reads. This new request type requests a resolved timestamp for the key span it is issued over.A resolved timestamp for a key span is a timestamp at or below which all future reads within the span are guaranteed to produce the same results, i.e. at which MVCC history has become immutable. The most up-to-date such bound can be computed for a key span contained in a single range by taking the minimum of the leaseholder's closed timestamp and the timestamp preceding the earliest intent present on the range that overlaps with the key span of interest. This optimum timestamp is nondecreasing over time, since the closed timestamp will not regress and since it also prevents intents at lower timestamps from being created. Follower replicas can also provide a resolved timestamp, though it may not be the most recent one due to replication delay. However, a given follower replica will similarly produce a nondecreasing sequence of resolved timestamps.
QueryResolvedTimestampRequest returns a resolved timestamp for the input key span by returning the minimum of all replicas contacted in order to cover the key span. This means that repeated invocations of this operation will be guaranteed nondecreasing only if routed to the same replicas.
A CONSISTENT read at or below a key span's resolved timestamp will never block on replication or on conflicting transactions. However, as can be inferred from the previous paragraph, for this to be guaranteed, the read must be issued to the same replica or set of replicas (for multi-range reads) that were consulted when computing the key span's resolved timestamp.
A resolved timestamp for a key span is a sibling concept a resolved timestamp for a rangefeed, which is defined in
pkg/kv/kvserver/rangefeed/resolved_timestamp.go. Whereas a resolved timestamp for a rangefeed refers to a timestamp below which no future updates will be published on the rangefeed, a resolved timestamp for a key span refers to a timestamp below which no future state modifications that could change the result of read requests will be made. Both concepts rely on some notion of immutability, but the former imparts this property on a stream of events while the latter imparts this property on materialized state.This commit does not begin using the new QueryResolvedTimestampRequest. Its use will begin in a follow-up commit that implements the "Server-side negotiation fast-path". See the bounded staleness RFC for details.