Skip to content

kv: introduce QueryResolvedTimestamp request#67725

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/getResolvedTS
Jul 27, 2021
Merged

kv: introduce QueryResolvedTimestamp request#67725
craig[bot] merged 1 commit intocockroachdb:masterfrom
nvb:nvanbenschoten/getResolvedTS

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Jul 16, 2021

Closes #67549.
Touches #67562.

This commit introduces a new QueryResolvedTimestampRequest type, which is the first step towards implementing bounded staleness reads. This new request type requests a resolved timestamp for the key span it is issued over.

A resolved timestamp for a key span is a timestamp at or below which all future reads within the span are guaranteed to produce the same results, i.e. at which MVCC history has become immutable. The most up-to-date such bound can be computed for a key span contained in a single range by taking the minimum of the leaseholder's closed timestamp and the timestamp preceding the earliest intent present on the range that overlaps with the key span of interest. This optimum timestamp is nondecreasing over time, since the closed timestamp will not regress and since it also prevents intents at lower timestamps from being created. Follower replicas can also provide a resolved timestamp, though it may not be the most recent one due to replication delay. However, a given follower replica will similarly produce a nondecreasing sequence of resolved timestamps.

QueryResolvedTimestampRequest returns a resolved timestamp for the input key span by returning the minimum of all replicas contacted in order to cover the key span. This means that repeated invocations of this operation will be guaranteed nondecreasing only if routed to the same replicas.

A CONSISTENT read at or below a key span's resolved timestamp will never block on replication or on conflicting transactions. However, as can be inferred from the previous paragraph, for this to be guaranteed, the read must be issued to the same replica or set of replicas (for multi-range reads) that were consulted when computing the key span's resolved timestamp.

A resolved timestamp for a key span is a sibling concept a resolved timestamp for a rangefeed, which is defined in pkg/kv/kvserver/rangefeed/resolved_timestamp.go. Whereas a resolved timestamp for a rangefeed refers to a timestamp below which no future updates will be published on the rangefeed, a resolved timestamp for a key span refers to a timestamp below which no future state modifications that could change the result of read requests will be made. Both concepts rely on some notion of immutability, but the former imparts this property on a stream of events while the latter imparts this property on materialized state.

This commit does not begin using the new QueryResolvedTimestampRequest. Its use will begin in a follow-up commit that implements the "Server-side negotiation fast-path". See the bounded staleness RFC for details.

@nvb nvb requested review from a team, irfansharif and tbg July 16, 2021 21:09
@nvb nvb requested a review from a team as a code owner July 16, 2021 21:09
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@nvb nvb removed the request for review from a team July 16, 2021 21:09
@nvb nvb force-pushed the nvanbenschoten/getResolvedTS branch 4 times, most recently from 5161429 to 90cf2fd Compare July 19, 2021 16:08
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The

The resolved timestamp of a key span is ...

and

However, within a given range,

paragraphs directly contradict each other, plus the closed timestamp is not a concept that even applies cleanly at the range level. How about this:

A resolved timestamp (for a key span) is a timestamp at or below which all future reads within the span are guaranteed to produce the same results, i.e. at which MVCC history has become immutable. The most up-to-date such bound can be computed, for an individual range, by taking the minimum of the leaseholder's closed timestamp and the timestamp preceding the earliest intent present on the range. This optimum timestamp is nondecreasing over time, since the closed timestamp will not regress and since it also prevents intents at lower timestamps from being created. Follower replicas can also provide a resolved timestamp, though it may not be the most recent one due to replication delay. However, a follower replica will similarly produce a nondecreasing sequence of closed timestamps. The newly introduced QueryResolvedTimestamp operation returns a resolved timestamp for the input key span by returning the minimum of all replicas contacted in order to cover the key span. This means that repeated invocations of this operation will be guaranteed nondecreasing only if routed to leaseholders for all constituent ranges, or when the same replicas were consulted in both invocations.

This would motivate replacing various s/the/a/ replacements in the commit message and commit.

I saw in the RFC updates how this request will be used, but it would be helpful if you briefly mentioned this in the commit/PR message, as currently it's not clear that this is an incomplete step (the new consistency level isn't there yet).

Ran out of time for the actual review, but will return to that tomorrow.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)


docs/RFCS/20210519_bounded_staleness_reads.md, line 255 at r1 (raw file):

Third, we will introduce a new `QueryResolvedTimestamp` request and response pair.
The intention is for this request type to be sent in `BatchRequests` with an
`INCONSISTENT` consistency level so that it routes to the nearest replica, skips

should we s/INCONSISTENT/NEAREST? Independently, we'll also want to update its description:

// INCONSISTENT reads return the latest available, committed values.
// They are more efficient, but may read stale values as pending
// intents are ignored.
INCONSISTENT ReadConsistencyType = 2

We haven't relied on this level much but now that it's getting locked into certain semantics it's a good time to be more precise.

Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)


docs/RFCS/20210519_bounded_staleness_reads.md, line 255 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

should we s/INCONSISTENT/NEAREST? Independently, we'll also want to update its description:

// INCONSISTENT reads return the latest available, committed values.
// They are more efficient, but may read stale values as pending
// intents are ignored.
INCONSISTENT ReadConsistencyType = 2

We haven't relied on this level much but now that it's getting locked into certain semantics it's a good time to be more precise.

Yes, thanks for bringing this up. I've also been thinking that we are coupling two concepts here that should really be split out into distinct fields. First, we have the read consistency levels. I actually don't mind this concept and I think the semantics are fairly reasonable on the server-side, though they could be spelled out more explicitly. But the bigger problem is that they have an overloaded meaning on the client.

I'm thinking we should remove this overloaded meaning and introduce a new routing policy concept, which dictates the way in which the DistSender decides how to route requests to the replicas in a range. There are currently two policies:

// RoutingPolicy specifies how a request should be routed to the
// replicas of its target range by the DistSender.
enum RoutingPolicy {
  // LEASEHOLDER means that the DistSender should route the request to the
  // leaseholder replica of its target range.
  LEASEHOLDER = 0;
  // NEAREST means that the DistSender should route the request to the
  // nearest replica of its target range.
  NEAREST = 1;
}

This policy would dictate this logic.

One change I'm thinking of making is having the TxnCoordSender be responsible for "promoting" a LEASEHOLDER routing policy to a NEAREST routing policy when it determines that a follower read is possible. This would avoid the hoops the DistSender currently goes through to handle this logic.

We can then extend this RoutingPolicy concept further and introduce a new policy that I imagine we'll want for stage 2 of bounded staleness reads - a SINGLE_REPLICA policy. This policy would be accompanied by a ReplicaDescriptor and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor. This is important to ensure that a QueryResolvedTimestampRequest and its follow-up ScanRequest are both sent to the same replica.

This is all a bit hazy in my mind, so let me know if this seems reasonable or if you have other thoughts on the topic.

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: mod my previous comment above.

Reviewed 23 of 23 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif and @nvanbenschoten)


docs/RFCS/20210519_bounded_staleness_reads.md, line 255 at r1 (raw file):
This all makes sense to me 👍

and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor.

DistSender cache misses become pretty disastrous for the bounded staleness reads. Have we thought about this at all? I understand that we want to "prevent" meta lookups that might incur a WAN hop, but what good is a bounded staleness read if it errors out due to a need to fetch a range desc? I assume we must have the same restriction on the QueryResolvedTimestampRequest or we could pass the range descriptor along from there to the actual reads to bypass the cache problem.


pkg/kv/db.go, line 718 at r1 (raw file):

	b := &Batch{}
	b.queryResolvedTimestamp(begin, end)
	b.Header.ReadConsistency = roachpb.READ_UNCOMMITTED

I just read the doc comment for READ_UNCOMMITTED and think that one could be brushed up a bit. It can only be used with read-only ops (same as READ_INCONSISTENT, which also doesn't mention this), and intents are returned separately (i.e. not mixed into the committed data), and only for operations that support this (i.e. Get, Scan, ReverseScan, but not QueryResolvedTimestamp). Also, only CONSISTENT can be used in a txn. There are probably other rules that I forgot.

I'm also not sure we're really enforcing the above rules well. For example, can you run an AdminXYZRequest with INCONSISTENT? What about a CPut? You would never want to, but also not sure if we check anywhere.


pkg/kv/kvserver/replica_closedts_internal_test.go, line 266 at r1 (raw file):

		{
			name:          "closed timestamp after non-overlapping intent",
			span:          [2]string{"b", "c"},

Might want to test the endpoint too where the intent is sitting at the right boundary of span.


pkg/kv/kvserver/replica_closedts_test.go, line 764 at r1 (raw file):

		keySet[i] = make(roachpb.Key, len(scratchKey)+1)
		copy(keySet[i], scratchKey)
		keySet[i][len(scratchKey)] = byte(i)

What you have is fine, but if you wanted a more eye-friendly version of this:

n := len(scratchKey)
for i := range keySet {
  keySet[i] = append(scratchKey[:n:n], byte(i))
}

pkg/kv/kvserver/replica_closedts_test.go, line 850 at r1 (raw file):

	time.Sleep(testTime)
	atomic.StoreInt32(&done, 1)
	require.NoError(t, g.Wait())

Nice test. Could see it flake under heavy stress though, since it "only" has 3s to advance the resolved timestamp.


pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 30 at r1 (raw file):

// intents returned for async intent cleanup by a single QueryResolvedTimestamp
// request.
var QueryResolvedTimestampMaxEncounteredIntents = settings.RegisterIntSetting(

Should validate to >0, shouldn't it?
Also we've been careful in other places to use a bytes limit as well , so we should do so here, too.

// MaxIntentsPerCleanupBatch is a maximum number of intents that GC will send
// for intent resolution as a single batch.
// Default value is set to half of the maximum lock table size at the time
// of writing.
// This value is subject to tuning in real environment as we have more
// data available.
var MaxIntentsPerCleanupBatch = settings.RegisterIntSetting(
"kv.gc.intent_cleanup_batch_size",
"if non zero, gc will split found intents into batches of this size when trying to resolve them",
5000,
func(batchSize int64) error {
if batchSize < 0 {
return errors.New("gc intent cleanup batch size must be non negative")
}
return nil
},
)
// MaxIntentKeyBytesPerCleanupBatch is maximum number of intent bytes GC will try to
// send as a single batch to intent resolution. This number is approximate and
// only includes size of the intent keys.
// Default value is conservative limit to prevent pending intent key sizes from
// ballooning.
var MaxIntentKeyBytesPerCleanupBatch = settings.RegisterIntSetting(
"kv.gc.intent_cleanup_batch_byte_size",
"if non zero, gc will split found intents into batches of this size when trying to resolve them",
1e6,
func(batchSize int64) error {
if batchSize < 0 {
return errors.New("gc intent cleanup batch size must be non negative")
}
return nil
},
)

Also, any particular reason this request has its own threshold?


pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 37 at r1 (raw file):

= 0


pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 48 at r1 (raw file):

// QueryResolvedTimestamp requests the resolved timestamp of the key span it is
// issued over. The resolved timestamp is defined as the minimum between the
// closed timestamp and the timestamp immediately preceding each intents in the

"preceding the intents"


pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 62 at r1 (raw file):

	// latches (see roachpb.INCONSISTENT) and often also on follower replicas,
	// so latches won't help them to synchronize with writes.
	closedTS := cArgs.EvalCtx.GetClosedTimestampV2(ctx)

This is subtle. Doesn't it depend on how exactly a write makes the intent and new closed timestamp visible? I think here the assumption is that the intents become visible first, and then the closed timestamp gets bumped (i.e. we first commit the apply batch and then update in-mem state, as usual).

In our desired future where evaluation is purely against an immutable state snapshot, we shouldn't rely on such implementation details, but I don't see how we can clean this particular instance up right now.


pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 78 at r1 (raw file):

	intentCleanupThresh := cArgs.EvalCtx.Clock().Now().Add(-intentCleanupAge.Nanoseconds(), 0)
	minIntentTS, encounteredIntents, err := computeMinIntentTimestamp(
		reader, args.Span(), int(maxEncounteredIntents), intentCleanupThresh,

int() cast needs overflow check

Copy link
Copy Markdown
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.


var _ combinable = &AdminVerifyProtectedTimestampResponse{}

// Combine implements the combinable interface.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Drop the comment or start it with the lowercase combine.


// TestQueryResolvedTimestampResolvesAbandonedIntents verifies that
// QueryResolvedTimestamp requests attempt to asynchronously resolve intents
// that they encounter once the encountered intents are sufficiently stale.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/the encountered intents/they

}

// TestNonBlockingReadsAtResolvedTimestamp tests that reads served at or below a
// key span's resolved timestamp never block or redirect to the leaseholder. The
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test are we testing that these requests aren't being redirected to the leaseholder?

@nvb nvb force-pushed the nvanbenschoten/getResolvedTS branch from 90cf2fd to a957a67 Compare July 23, 2021 06:23
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paragraphs directly contradict each other, plus the closed timestamp is not a concept that even applies cleanly at the range level. How about this:

This would motivate replacing various s/the/a/ replacements in the commit message and commit.

Done.

I saw in the RFC updates how this request will be used, but it would be helpful if you briefly mentioned this in the commit/PR message, as currently it's not clear that this is an incomplete step (the new consistency level isn't there yet).

Done.

TFTRs!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @nvanbenschoten, and @tbg)


docs/RFCS/20210519_bounded_staleness_reads.md, line 255 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This all makes sense to me 👍

and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor.

DistSender cache misses become pretty disastrous for the bounded staleness reads. Have we thought about this at all? I understand that we want to "prevent" meta lookups that might incur a WAN hop, but what good is a bounded staleness read if it errors out due to a need to fetch a range desc? I assume we must have the same restriction on the QueryResolvedTimestampRequest or we could pass the range descriptor along from there to the actual reads to bypass the cache problem.

I wasn't clear here. We wouldn't use this to avoid meta lookups for the reasons you listed. The availability model we're using for bounded staleness reads is that meta lookups are fine.

The point of the SINGLE_REPLICA policy would be to ensure that the ScanRequest that follows a QueryResolvedTimestampRequest is always sent to the same replica that served the QueryResolvedTimestampRequest. So the sequence would look like:

batch 1:
    reqs = QueryResolvedTimestampRequest
    read_consistency = INCONSISTENT
    routing_policy = NEAREST

batch 2:
    reqs = ScanRequest
    read_consistency = CONSISTENT
    routing_policy = SINGLE_REPLICA
    routing_target = QueryResolvedTimestampResponse.Replica

if batch 2's routing target does not exist in cache, retry both batches

As long as the QueryResolvedTimestampRequest is using the range cache and willing to update the cache on misses, we'll eventually converge and issue both requests to the same replica.


pkg/kv/db.go, line 718 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I just read the doc comment for READ_UNCOMMITTED and think that one could be brushed up a bit. It can only be used with read-only ops (same as READ_INCONSISTENT, which also doesn't mention this), and intents are returned separately (i.e. not mixed into the committed data), and only for operations that support this (i.e. Get, Scan, ReverseScan, but not QueryResolvedTimestamp). Also, only CONSISTENT can be used in a txn. There are probably other rules that I forgot.

I'm also not sure we're really enforcing the above rules well. For example, can you run an AdminXYZRequest with INCONSISTENT? What about a CPut? You would never want to, but also not sure if we check anywhere.

👍 I'll see what I can do here when I start pulling on the routing policy thread.


pkg/kv/kvserver/replica_closedts_internal_test.go, line 266 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Might want to test the endpoint too where the intent is sitting at the right boundary of span.

Done.


pkg/kv/kvserver/replica_closedts_internal_test.go, line 304 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/the encountered intents/they

I think that risks some confusion about "they" referring to "QueryResolvedTimestamp requests", so I'll keep this explicit.


pkg/kv/kvserver/replica_closedts_test.go, line 732 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

In this test are we testing that these requests aren't being redirected to the leaseholder?

Yes, we're sending directly to each replica, so they would return NotLeaseholderErrors if they needed to redirect. I added a comment,


pkg/kv/kvserver/replica_closedts_test.go, line 764 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

What you have is fine, but if you wanted a more eye-friendly version of this:

n := len(scratchKey)
for i := range keySet {
  keySet[i] = append(scratchKey[:n:n], byte(i))
}

Done.


pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 30 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Should validate to >0, shouldn't it?
Also we've been careful in other places to use a bytes limit as well , so we should do so here, too.

// MaxIntentsPerCleanupBatch is a maximum number of intents that GC will send
// for intent resolution as a single batch.
// Default value is set to half of the maximum lock table size at the time
// of writing.
// This value is subject to tuning in real environment as we have more
// data available.
var MaxIntentsPerCleanupBatch = settings.RegisterIntSetting(
"kv.gc.intent_cleanup_batch_size",
"if non zero, gc will split found intents into batches of this size when trying to resolve them",
5000,
func(batchSize int64) error {
if batchSize < 0 {
return errors.New("gc intent cleanup batch size must be non negative")
}
return nil
},
)
// MaxIntentKeyBytesPerCleanupBatch is maximum number of intent bytes GC will try to
// send as a single batch to intent resolution. This number is approximate and
// only includes size of the intent keys.
// Default value is conservative limit to prevent pending intent key sizes from
// ballooning.
var MaxIntentKeyBytesPerCleanupBatch = settings.RegisterIntSetting(
"kv.gc.intent_cleanup_batch_byte_size",
"if non zero, gc will split found intents into batches of this size when trying to resolve them",
1e6,
func(batchSize int64) error {
if batchSize < 0 {
return errors.New("gc intent cleanup batch size must be non negative")
}
return nil
},
)

Also, any particular reason this request has its own threshold?

All good points. How do you feel about me using those two settings directly?


pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 37 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

= 0

What did you mean by this?


pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 48 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

"preceding the intents"

Done.


pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 62 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is subtle. Doesn't it depend on how exactly a write makes the intent and new closed timestamp visible? I think here the assumption is that the intents become visible first, and then the closed timestamp gets bumped (i.e. we first commit the apply batch and then update in-mem state, as usual).

In our desired future where evaluation is purely against an immutable state snapshot, we shouldn't rely on such implementation details, but I don't see how we can clean this particular instance up right now.

Yes, this is subtle. We see variations of this with the GC threshold check as well. Avoiding this would require more than just an immutable state snapshot though. It would also require the in-memory and persistent effects of Raft entry application to be atomic from the perspective of whoever is capturing these snapshots. I haven't thought through that very far, but enforcing that does feel like it would be quite expensive.

In this case, I'm actually more ok with this ordering assumption than in most others because I would consider the inverse order (the closed timestamp being bumped before the intents are visible) to be a violation of the closed timestamp contract. It would allow an observer to see a closed timestamp and then later observe a newly written intent below that closed timestamp.

The case that concerns me more is assumptions about the order in which we update the persistent vs. in-memory GC threshold or the order in which we update the persistent vs. in-memory range descriptor.


pkg/roachpb/api.go, line 458 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

[nit] Drop the comment or start it with the lowercase combine.

Done.

@tbg tbg requested a review from irfansharif July 26, 2021 12:35
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad:

However, a given follower replica will similarly produce a nondecreasing
sequence of closed timestamps.

should be "resolved timestamps" (closed is correct to, but that's not what we're after)

Reviewed 11 of 11 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @nvanbenschoten)


docs/RFCS/20210519_bounded_staleness_reads.md, line 255 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I wasn't clear here. We wouldn't use this to avoid meta lookups for the reasons you listed. The availability model we're using for bounded staleness reads is that meta lookups are fine.

The point of the SINGLE_REPLICA policy would be to ensure that the ScanRequest that follows a QueryResolvedTimestampRequest is always sent to the same replica that served the QueryResolvedTimestampRequest. So the sequence would look like:

batch 1:
    reqs = QueryResolvedTimestampRequest
    read_consistency = INCONSISTENT
    routing_policy = NEAREST

batch 2:
    reqs = ScanRequest
    read_consistency = CONSISTENT
    routing_policy = SINGLE_REPLICA
    routing_target = QueryResolvedTimestampResponse.Replica

if batch 2's routing target does not exist in cache, retry both batches

As long as the QueryResolvedTimestampRequest is using the range cache and willing to update the cache on misses, we'll eventually converge and issue both requests to the same replica.

Ah, thanks. That's what I would've expected, but the reference to "cached range descriptor" threw me off prematurely. This all sounds good.


pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 30 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

All good points. How do you feel about me using those two settings directly?

Good, if we also update the verbiage on these settings to reflect the new generality. Mostly for us, in case we want to make future tweaks.


pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 37 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What did you mean by this?

Typo. I meant >= 0 and was really saying, use settings.NonNegativeDuration.


pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 78 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

int() cast needs overflow check

Ping

@nvb nvb force-pushed the nvanbenschoten/getResolvedTS branch from a957a67 to 44986dd Compare July 26, 2021 19:36
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be "resolved timestamps" (closed is correct to, but that's not what we're after)

Done.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif and @tbg)


pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 30 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Good, if we also update the verbiage on these settings to reflect the new generality. Mostly for us, in case we want to make future tweaks.

Done.


pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 37 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Typo. I meant >= 0 and was really saying, use settings.NonNegativeDuration.

Done.


pkg/kv/kvserver/batcheval/cmd_query_resolved_timestamp.go, line 78 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ping

Done.

Closes cockroachdb#67549.
Touches cockroachdb#67562.

This commit introduces a new QueryResolvedTimestampRequest type, which is the
first step towards implementing bounded staleness reads. This new request type
requests a resolved timestamp for the key span it is issued over.

A resolved timestamp for a key span is a timestamp at or below which all
future reads within the span are guaranteed to produce the same results, i.e.
at which MVCC history has become immutable. The most up-to-date such bound
can be computed for a key span contained in a single range by taking the
minimum of the leaseholder's closed timestamp and the timestamp preceding the
earliest intent present on the range that overlaps with the key span of
interest. This optimum timestamp is nondecreasing over time, since the closed
timestamp will not regress and since it also prevents intents at lower
timestamps from being created. Follower replicas can also provide a resolved
timestamp, though it may not be the most recent one due to replication delay.
However, a given follower replica will similarly produce a nondecreasing
sequence of resolved timestamps.

QueryResolvedTimestampRequest returns a resolved timestamp for the input key
span by returning the minimum of all replicas contacted in order to cover the
key span. This means that repeated invocations of this operation will be
guaranteed nondecreasing only if routed to the same replicas.

A CONSISTENT read at or below a key span's resolved timestamp will never
block on replication or on conflicting transactions. However, as can be
inferred from the previous paragraph, for this to be guaranteed, the read
must be issued to the same replica or set of replicas (for multi-range reads)
that were consulted when computing the key span's resolved timestamp.

A resolved timestamp for a key span is a sibling concept a resolved timestamp
for a rangefeed, which is defined in:
  pkg/kv/kvserver/rangefeed/resolved_timestamp.go
Whereas a resolved timestamp for a rangefeed refers to a timestamp below
which no future updates will be published on the rangefeed, a resolved
timestamp for a key span refers to a timestamp below which no future state
modifications that could change the result of read requests will be made.
Both concepts rely on some notion of immutability, but the former imparts
this property on a stream of events while the latter imparts this property
on materialized state.

This commit does not begin using the new QueryResolvedTimestampRequest. Its
use will begin in a follow-up commit that implements the "Server-side
negotiation fast-path". See the bounded staleness RFC for details.
@nvb nvb force-pushed the nvanbenschoten/getResolvedTS branch from 44986dd to 024a8d9 Compare July 26, 2021 21:38
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Jul 26, 2021

bors r=tbg,irfansharif

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 27, 2021

Build succeeded:

@craig craig bot merged commit c2db7f8 into cockroachdb:master Jul 27, 2021
@nvb nvb deleted the nvanbenschoten/getResolvedTS branch July 27, 2021 20:49
nvb added a commit to nvb/cockroach that referenced this pull request Jul 28, 2021
Touches cockroachdb#67562.

This commit introduces a new RoutingPolicy configuration that lives on a
BatchRequest header. A request's routing policy specifies how the
request should be routed to the replicas of its target range(s) by the
DistSender. There are initially two routing policies:
```
enum RoutingPolicy {
  // LEASEHOLDER means that the DistSender should route the request to the
  // leaseholder replica(s) of its target range(s).
  LEASEHOLDER = 0;
  // NEAREST means that the DistSender should route the request to the
  // nearest replica(s) of its target range(s).
  NEAREST = 1;
}
```

The default policy is `LEASEHOLDER`.

Routing policies allow us to stop overloading the use of the
ReadConsistency enum to dictate both how the client should route a
request to a server and which kinds of requests should be eligible to be
served by a given replica. Routing policies are a client-side only
concept. They do not dictate which replicas in a range are eligible to
serve the request, only which replicas are considered as targets by the
DistSender, and in which order. A request that is routed to an
ineligible replica (a function of request type, timestamp, and read
consistency) will be rejected by that replica and the DistSender will
target another replica in the range.

As discussed in cockroachdb#67725 (review),
we will likely need to introduce a third routing policy that called
`SINGLE_REPLICA` to address cockroachdb#67554. This policy would be accompanied by
a ReplicaDescriptor and would specify that a given request must be sent
to that replica and DistSender should throw an error if the replica is
not part of its cached range descriptor. This is important to ensure
that a QueryResolvedTimestampRequest and its follow-up ScanRequest are
both sent to the same replica.
nvb added a commit to nvb/cockroach that referenced this pull request Jul 29, 2021
Touches cockroachdb#67562.

This commit introduces a new RoutingPolicy configuration that lives on a
BatchRequest header. A request's routing policy specifies how the
request should be routed to the replicas of its target range(s) by the
DistSender. There are initially two routing policies:
```
enum RoutingPolicy {
  // LEASEHOLDER means that the DistSender should route the request to the
  // leaseholder replica(s) of its target range(s).
  LEASEHOLDER = 0;
  // NEAREST means that the DistSender should route the request to the
  // nearest replica(s) of its target range(s).
  NEAREST = 1;
}
```

The default policy is `LEASEHOLDER`.

Routing policies allow us to stop overloading the use of the
ReadConsistency enum to dictate both how the client should route a
request to a server and which kinds of requests should be eligible to be
served by a given replica. Routing policies are a client-side only
concept. They do not dictate which replicas in a range are eligible to
serve the request, only which replicas are considered as targets by the
DistSender, and in which order. A request that is routed to an
ineligible replica (a function of request type, timestamp, and read
consistency) will be rejected by that replica and the DistSender will
target another replica in the range.

As discussed in cockroachdb#67725 (review),
we will likely need to introduce a third routing policy that called
`SINGLE_REPLICA` to address cockroachdb#67554. This policy would be accompanied by
a ReplicaDescriptor and would specify that a given request must be sent
to that replica and DistSender should throw an error if the replica is
not part of its cached range descriptor. This is important to ensure
that a QueryResolvedTimestampRequest and its follow-up ScanRequest are
both sent to the same replica.
otan pushed a commit to otan-cockroach/cockroach that referenced this pull request Jul 29, 2021
Touches cockroachdb#67562.

This commit introduces a new RoutingPolicy configuration that lives on a
BatchRequest header. A request's routing policy specifies how the
request should be routed to the replicas of its target range(s) by the
DistSender. There are initially two routing policies:
```
enum RoutingPolicy {
  // LEASEHOLDER means that the DistSender should route the request to the
  // leaseholder replica(s) of its target range(s).
  LEASEHOLDER = 0;
  // NEAREST means that the DistSender should route the request to the
  // nearest replica(s) of its target range(s).
  NEAREST = 1;
}
```

The default policy is `LEASEHOLDER`.

Routing policies allow us to stop overloading the use of the
ReadConsistency enum to dictate both how the client should route a
request to a server and which kinds of requests should be eligible to be
served by a given replica. Routing policies are a client-side only
concept. They do not dictate which replicas in a range are eligible to
serve the request, only which replicas are considered as targets by the
DistSender, and in which order. A request that is routed to an
ineligible replica (a function of request type, timestamp, and read
consistency) will be rejected by that replica and the DistSender will
target another replica in the range.

As discussed in cockroachdb#67725 (review),
we will likely need to introduce a third routing policy that called
`SINGLE_REPLICA` to address cockroachdb#67554. This policy would be accompanied by
a ReplicaDescriptor and would specify that a given request must be sent
to that replica and DistSender should throw an error if the replica is
not part of its cached range descriptor. This is important to ensure
that a QueryResolvedTimestampRequest and its follow-up ScanRequest are
both sent to the same replica.
nvb added a commit to nvb/cockroach that referenced this pull request Aug 2, 2021
Touches cockroachdb#67562.

This commit introduces a new RoutingPolicy configuration that lives on a
BatchRequest header. A request's routing policy specifies how the
request should be routed to the replicas of its target range(s) by the
DistSender. There are initially two routing policies:
```
enum RoutingPolicy {
  // LEASEHOLDER means that the DistSender should route the request to the
  // leaseholder replica(s) of its target range(s).
  LEASEHOLDER = 0;
  // NEAREST means that the DistSender should route the request to the
  // nearest replica(s) of its target range(s).
  NEAREST = 1;
}
```

The default policy is `LEASEHOLDER`.

Routing policies allow us to stop overloading the use of the
ReadConsistency enum to dictate both how the client should route a
request to a server and which kinds of requests should be eligible to be
served by a given replica. Routing policies are a client-side only
concept. They do not dictate which replicas in a range are eligible to
serve the request, only which replicas are considered as targets by the
DistSender, and in which order. A request that is routed to an
ineligible replica (a function of request type, timestamp, and read
consistency) will be rejected by that replica and the DistSender will
target another replica in the range.

As discussed in cockroachdb#67725 (review),
we will likely need to introduce a third routing policy that called
`SINGLE_REPLICA` to address cockroachdb#67554. This policy would be accompanied by
a ReplicaDescriptor and would specify that a given request must be sent
to that replica and DistSender should throw an error if the replica is
not part of its cached range descriptor. This is important to ensure
that a QueryResolvedTimestampRequest and its follow-up ScanRequest are
both sent to the same replica.
nvb added a commit to nvb/cockroach that referenced this pull request Aug 5, 2021
Touches cockroachdb#67562.

This commit introduces a new RoutingPolicy configuration that lives on a
BatchRequest header. A request's routing policy specifies how the
request should be routed to the replicas of its target range(s) by the
DistSender. There are initially two routing policies:
```
enum RoutingPolicy {
  // LEASEHOLDER means that the DistSender should route the request to the
  // leaseholder replica(s) of its target range(s).
  LEASEHOLDER = 0;
  // NEAREST means that the DistSender should route the request to the
  // nearest replica(s) of its target range(s).
  NEAREST = 1;
}
```

The default policy is `LEASEHOLDER`.

Routing policies allow us to stop overloading the use of the
ReadConsistency enum to dictate both how the client should route a
request to a server and which kinds of requests should be eligible to be
served by a given replica. Routing policies are a client-side only
concept. They do not dictate which replicas in a range are eligible to
serve the request, only which replicas are considered as targets by the
DistSender, and in which order. A request that is routed to an
ineligible replica (a function of request type, timestamp, and read
consistency) will be rejected by that replica and the DistSender will
target another replica in the range.

As discussed in cockroachdb#67725 (review),
we will likely need to introduce a third routing policy that called
`SINGLE_REPLICA` to address cockroachdb#67554. This policy would be accompanied by
a ReplicaDescriptor and would specify that a given request must be sent
to that replica and DistSender should throw an error if the replica is
not part of its cached range descriptor. This is important to ensure
that a QueryResolvedTimestampRequest and its follow-up ScanRequest are
both sent to the same replica.
craig bot pushed a commit that referenced this pull request Aug 5, 2021
68191: kv: introduce request RoutingPolicy configuration r=nvanbenschoten a=nvanbenschoten

Half of #67551.
Touches #67562.

This commit introduces a new RoutingPolicy configuration that lives on a BatchRequest header. A request's routing policy specifies how the request should be routed to the replicas of its target range(s) by the DistSender. There are initially two routing policies:
```
enum RoutingPolicy {
  // LEASEHOLDER means that the DistSender should route the request to the
  // leaseholder replica(s) of its target range(s).
  LEASEHOLDER = 0;
  // NEAREST means that the DistSender should route the request to the
  // nearest replica(s) of its target range(s).
  NEAREST = 1;
}
```

The default policy is `LEASEHOLDER`.

Routing policies allow us to stop overloading the use of the ReadConsistency enum to dictate both how the client should route a request to a server and which kinds of requests should be eligible to be served by a given replica. Routing policies are a client-side only concept. They do not dictate which replicas in a range are eligible to serve the request, only which replicas are considered as targets by the DistSender, and in which order. A request that is routed to an ineligible replica (a function of request type, timestamp, and read consistency) will be rejected by that replica and the DistSender will target another replica in the range.

As discussed in #67725 (review), we will likely need to introduce a third routing policy called `SINGLE_REPLICA` to address #67554. This policy would be accompanied by a ReplicaDescriptor and would specify that a given request must be sent to that replica and DistSender should throw an error if the replica is not part of its cached range descriptor. This is important to ensure that a QueryResolvedTimestampRequest and its follow-up ScanRequest are both sent to the same replica.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
sajjadrizvi pushed a commit to sajjadrizvi/cockroach that referenced this pull request Aug 10, 2021
Touches cockroachdb#67562.

This commit introduces a new RoutingPolicy configuration that lives on a
BatchRequest header. A request's routing policy specifies how the
request should be routed to the replicas of its target range(s) by the
DistSender. There are initially two routing policies:
```
enum RoutingPolicy {
  // LEASEHOLDER means that the DistSender should route the request to the
  // leaseholder replica(s) of its target range(s).
  LEASEHOLDER = 0;
  // NEAREST means that the DistSender should route the request to the
  // nearest replica(s) of its target range(s).
  NEAREST = 1;
}
```

The default policy is `LEASEHOLDER`.

Routing policies allow us to stop overloading the use of the
ReadConsistency enum to dictate both how the client should route a
request to a server and which kinds of requests should be eligible to be
served by a given replica. Routing policies are a client-side only
concept. They do not dictate which replicas in a range are eligible to
serve the request, only which replicas are considered as targets by the
DistSender, and in which order. A request that is routed to an
ineligible replica (a function of request type, timestamp, and read
consistency) will be rejected by that replica and the DistSender will
target another replica in the range.

As discussed in cockroachdb#67725 (review),
we will likely need to introduce a third routing policy that called
`SINGLE_REPLICA` to address cockroachdb#67554. This policy would be accompanied by
a ReplicaDescriptor and would specify that a given request must be sent
to that replica and DistSender should throw an error if the replica is
not part of its cached range descriptor. This is important to ensure
that a QueryResolvedTimestampRequest and its follow-up ScanRequest are
both sent to the same replica.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kv: introduce QueryResolvedTimestamp request

4 participants