kvfollowerreadsccl: plug bounded staleness into the execution engine by otan · Pull Request #68967 · cockroachdb/cockroach

otan · 2021-08-15T22:05:17Z

See individual commits for details.

Resolves #67556.

cockroach-teamcity · 2021-08-15T22:05:27Z

This change is

nvb

This all looks very good! I imagine #68969 will be the more contention part of all of this.

Reviewed 2 of 2 files at r5, 10 of 10 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @otan, @rhu713, @rytaft, and @stevendanna)

pkg/ccl/kvccl/kvfollowerreadsccl/boundedstaleness_test.go, line 129 at r6 (raw file):

func (bse *boundedStalenessEvents) String() string {
	var sb strings.Builder
	sb.WriteString(fmt.Sprintf("events (%d found):\n", len(bse.events)))

fmt.Fprintf(&sb, ... here and below.

pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 14 at r6 (raw file):

----

# If we try read a timestamp that is impossible to satisfy with a follower read,

"try to read"

pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 18 at r6 (raw file):

# We always do bounded staleness reads from node_idx 2, as node_idx 0 in a
# TestCluster is always the leaseholder.
query idx=2 show-events

Do we ever run query without show-events? If not, should we just bake in the directive into query?

pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 22 at r6 (raw file):

----
1
events (1 found):

Is the idea that future improvements to bounded staleness reads will allow for multiple events?

pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 64 at r6 (raw file):

# Note we have to wait until a match here, in case a follower read reads an
# older version of the data.
query idx=2 show-events wait-until-follower-read wait-until-match

Do we need wait-until-follower-read if we're also including wait-until-match? Doesn't the latter encompass the role of the former?

pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 72 at r6 (raw file):


query idx=2 show-events
SELECT * FROM t AS OF SYSTEM TIME with_min_timestamp(now() - '10s') WHERE pk = 1

Is the clock frozen in these tests? If not, isn't it possible for the follower to catch up to exactly now()-10s (and no further) in the wait-until-match query and then fall progressively further behind?

pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 108 at r6 (raw file):

# Set a super high closed bounded staleness target and execute a schema change.
exec
SET CLUSTER SETTING kv.closed_timestamp.target_duration = '1hr';

I'm a bit concerned about the async nature of cluster setting propagation here. Could this cause flakiness? How do we deal with this in standard SQL logic tests?

pkg/sql/pgwire/pgcode/codes.go, line 362 at r5 (raw file):

	// Section: Class 54 - Program Limit Exceeded (Cockroach extension)

	// UnsatisfiableBoundedStaleness signals that the bounded staleness query

Out of curiosity, which PG code do you think we should use for when a query runs below the MVCC GC timestamp? Right now, it returns a raw BatchTimestampBeforeGCError with no PG code.

➜ ./cockroach demo

demo@127.0.0.1:26257/movr> alter table rides configure zone using gc.ttlseconds = 1;
CONFIGURE ZONE 1

demo@127.0.0.1:26257/movr> select * from rides as of system time '-2s';
ERROR: batch timestamp 1629140312.708334000,0 must be after replica GC threshold 1629140313.709146000,0

I ask because I imagine these two errors should be in the same class.

pkg/sql/row/errors.go, line 93 at r5 (raw file):

// key-value fetch to a user friendly SQL error.
func ConvertFetchError(ctx context.Context, descForKey KeyToDescTranslator, err error) error {
	var wiErr *roachpb.WriteIntentError

This is kind of an interesting pattern that we don't have much precedent for because of how new errors.As is. I wonder if it should look more like:

var wiErr *roachpb.WriteIntentError
var bsErr *roachpb.MinTimestampBoundUnsatisfiableError
switch {
case errors.As(err, &wiErr):
    key := wiErr.Intents[0].Key
    desc, _ := descForKey.KeyToDesc(key)
    return NewLockNotAvailableError(ctx, desc, key, wiErr.Reason)
case errors.As(err, &bsErr):
    return pgerror.WithCandidateCode(
        err,
        pgcode.UnsatisfiableBoundedStaleness,
    )
}
return err

Or even better, to batch the heap allocations:

var errs struct {
    wi *roachpb.WriteIntentError
    bs *roachpb.MinTimestampBoundUnsatisfiableError
}
switch {
case errors.As(err, &errs.wi):
    key := errs.wi.Intents[0].Key
    desc, _ := descForKey.KeyToDesc(key)
    return NewLockNotAvailableError(ctx, desc, key, errs.wi.Reason)
case errors.As(err, &errs.bs):
    return pgerror.WithCandidateCode(
        err,
        pgcode.UnsatisfiableBoundedStaleness,
    )
}
return err

pkg/sql/row/kv_fetcher.go, line 60 at r6 (raw file):

	forceProductionKVBatchSize bool,
) (*KVFetcher, error) {
	sendFn := makeKVBatchFetcherDefaultSendFunc(txn)

Consider only calling makeKVBatchFetcherDefaultSendFunc on the else branch to avoid an unnecessary heap allocation on the bounded staleness path.

Or better, continue to also accept the txn in makeKVBatchFetcher so that you can pass static function references with the following signature instead of having to create a closure here:

func(ctx context.Context, txn *kv.Txn, ba roachpb.BatchRequest) (*roachpb.BatchResponse, error)

otan

This all looks very good! I imagine #68969 will be the more contention part of all of this.

Yeah, there's a reason why I made it a separate PR :D

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @rhu713, @rytaft, and @stevendanna)

pkg/ccl/kvccl/kvfollowerreadsccl/boundedstaleness_test.go, line 129 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

fmt.Fprintf(&sb, ... here and below.

have to handle the error or the linter cries (yuck), it's a test so i prefer to leave it as is.

pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 18 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we ever run query without show-events? If not, should we just bake in the directive into query?

Done.

pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 22 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is the idea that future improvements to bounded staleness reads will allow for multiple events?

Yep!

pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 64 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we need wait-until-follower-read if we're also including wait-until-match? Doesn't the latter encompass the role of the former?

We want --rewrite to not wait-until-match, but still wait-until-follower-read so it rewrites with correct output. i've added a comment in the test codes.

pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 72 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is the clock frozen in these tests? If not, isn't it possible for the follower to catch up to exactly now()-10s (and no further) in the wait-until-match query and then fall progressively further behind?

Clock is not frozen (how do i do that?), but the rest of these tests don't really care if we don't catch up past now()-10s.
unless i'm missing something and we should!

pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 108 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm a bit concerned about the async nature of cluster setting propagation here. Could this cause flakiness? How do we deal with this in standard SQL logic tests?

i don't think we do, looks like the tests are alright though? i've seen lots of CLUSTER SETTING sets throughout code.
is there a wait-for-cluster-setting-propagate in tests i should use?

pkg/sql/pgwire/pgcode/codes.go, line 362 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Out of curiosity, which PG code do you think we should use for when a query runs below the MVCC GC timestamp? Right now, it returns a raw BatchTimestampBeforeGCError with no PG code.
➜ ./cockroach demo

demo@127.0.0.1:26257/movr> alter table rides configure zone using gc.ttlseconds = 1;
CONFIGURE ZONE 1

demo@127.0.0.1:26257/movr> select * from rides as of system time '-2s';
ERROR: batch timestamp 1629140312.708334000,0 must be after replica GC threshold 1629140313.709146000,0
I ask because I imagine these two errors should be in the same class.

it's probably worth making a new custom error under this class (it looks like XXC where XX = class is reserved for our own internal usage)

pkg/sql/row/kv_fetcher.go, line 60 at r6 (raw file):

Or better, continue to also accept the txn in makeKVBatchFetcher so that you can pass static function references with the following signature instead of having to create a closure here:

That required going down deep into the stack, and passing txn to the txnKVFetcher stuff, so i'm going to do the if right now (this is ~ the same as before).

rytaft

SQL stuff

Reviewed 1 of 18 files at r1, 3 of 3 files at r4, 1 of 2 files at r5, 1 of 11 files at r7, 8 of 10 files at r8, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten, @otan, @rhu713, and @stevendanna)

-- commits, line 109 at r8 ([raw file](https://github.com/cockroachdb/cockroach/blob/ab4691aa62dc9c45312a6c0a8e984499d7ada643/-- commits#L109)):
nit: keep track of?

nvb

This should be good to rebase now, as the kv stuff just landed.

Reviewed 1 of 11 files at r7, 9 of 10 files at r8.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @otan, @rhu713, @rytaft, and @stevendanna)

pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 72 at r6 (raw file):

Previously, otan (Oliver Tan) wrote…

Clock is not frozen (how do i do that?), but the rest of these tests don't really care if we don't catch up past now()-10s.
unless i'm missing something and we should!

Well they pass a min_timestamp of now()-10s, so if now() keeps advancing with the clock but the closed timestamp freezes, they may be unable to be served locally. But maybe that's such a rare possibility that we're ok with how this is written.

pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 108 at r6 (raw file):

Previously, otan (Oliver Tan) wrote…

i don't think we do, looks like the tests are alright though? i've seen lots of CLUSTER SETTING sets throughout code.
is there a wait-for-cluster-setting-propagate in tests i should use?

I'm not quite sure. I thought we had code to read the cluster setting from each other node in the cluster to make sure it's gone into effect. But now that I think of it, that may actually be part of SET CLUSTER SETTING itself. If this isn't flaky, I wouldn't worry about it.

pkg/sql/pgwire/pgcode/codes.go, line 362 at r5 (raw file):

Previously, otan (Oliver Tan) wrote…

it's probably worth making a new custom error under this class (it looks like XXC where XX = class is reserved for our own internal usage)

But I guess the question I'm asking is whether you think this is the right class for that error code as well. The reason why I ask is that I'm not entirely sure that Program Limit Exceeded is the right class for UnsatisfiableBoundedStaleness (though I don't see a better one), but I am pretty sure that UnsatisfiableBoundedStaleness and BatchTimestampBeforeGCError's eventual code should be in the same class. So does Program Limit Exceeded also sound right to you for the GC timestamp error code? If so, I'm happy with this.

pkg/sql/row/kv_fetcher.go, line 60 at r6 (raw file):

Previously, otan (Oliver Tan) wrote…

Or better, continue to also accept the txn in makeKVBatchFetcher so that you can pass static function references with the following signature instead of having to create a closure here:

That required going down deep into the stack, and passing txn to the txnKVFetcher stuff, so i'm going to do the if right now (this is ~ the same as before).

👍 I guess I also missed the fact that we were capturing bsHeader, so it would be even harder to avoid the allocation on the bounded staleness path.

nit: any reason to pull makeKVBatchFetcherDefaultSendFunc out instead of inlining it here? Or conversely, any reason not to pull out a constructor for the bounded staleness sendFunc? The asymmetry is a bit odd to me, given that there's only one use of each.

otan

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @nvanbenschoten, @otan, @rhu713, @rytaft, and @stevendanna)

pkg/sql/pgwire/pgcode/codes.go, line 362 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

But I guess the question I'm asking is whether you think this is the right class for that error code as well. The reason why I ask is that I'm not entirely sure that Program Limit Exceeded is the right class for UnsatisfiableBoundedStaleness (though I don't see a better one), but I am pretty sure that UnsatisfiableBoundedStaleness and BatchTimestampBeforeGCError's eventual code should be in the same class. So does Program Limit Exceeded also sound right to you for the GC timestamp error code? If so, I'm happy with this.

as discussed offline, CDB is a new class for these errors.

pkg/sql/row/kv_fetcher.go, line 60 at r6 (raw file):

nit: any reason to pull makeKVBatchFetcherDefaultSendFunc out instead of inlining it here?

it's used in another place :)

pkg/sql/row/fetcher.go
583:		makeKVBatchFetcherDefaultSendFunc(txn),

but bounded staleness is only one.

otan · 2021-08-17T06:24:38Z

bors r=nvanbenschoten,rytaft

tftrs!

This commit introduces a common way of converting rows to data driven output. This abstracts this away from backup which is using this in the hopes that it can be used in other data driven tests I plan to introduce. Release note: None

Release note (sql change): Errors involving MinTimestampBoundUnsatisfiableError during a bounded staleness read now get a custom pgcode (54C01).

This commit plugs in the bounded staleness reads into the SQL Execution layer, bridging the gap between the SQL syntax being parsed and the KV layer of bounded staleness. When nearest_only=True, we currently give up if the schema has been recently changed. The `makeKVBatchFetcher` and `makeKVBatchFetcherWithSendFunc` have been simplified to just one `makeKVBatchFetcher` method to avoid the multiple permutations of makeKVBatchFetcher flying around. A datadriven test of bounded staleness has been added which keeps tracks of relevant events (e.g. whether the read was a follower or leaseholder read). It is intended for it to keep track of e.g. transaction retries in an upcoming commit. Release note: None

craig · 2021-08-17T06:44:10Z

Canceled.

otan · 2021-08-17T06:44:23Z

lost a commit during rebase, woops!

bors r=nvanbenschoten,rytaft

craig · 2021-08-17T08:09:36Z

Build succeeded:

GitHub CI (Cockroach)

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)

pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file):

		// schema.
		if aost.Timestamp.Less(table.GetModificationTime()) {
			ts = table.GetModificationTime()

AFAICT, there is no corresponding code in the row execution engine.. I don't think we are setting the bounded staleness header at all in the row engine? CC @yuzefovich

RaduBerinde · 2022-02-03T17:46:54Z

pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file):

Previously, RaduBerinde wrote…

AFAICT, there is no corresponding code in the row execution engine.. I don't think we are setting the bounded staleness header at all in the row engine? CC @yuzefovich

Also, can't we make this timestamp adjustment at planning time? Feels strange that we do it in the bowels of execution code.

otan · 2022-02-03T21:38:54Z

pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file):

Previously, RaduBerinde wrote…

Also, can't we make this timestamp adjustment at planning time? Feels strange that we do it in the bowels of execution code.

yeah this wasn't done for the row exec engine and was an assumed limitation.
ah didn't think about doing it at planning time! if we could that'd be better, this seemed like the best place at the time as it was the least hoops to jump, but open to it going elsewhere

RaduBerinde · 2022-02-04T00:47:27Z

pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file):

Previously, otan (Oliver Tan) wrote…

yeah this wasn't done for the row exec engine and was an assumed limitation.
ah didn't think about doing it at planning time! if we could that'd be better, this seemed like the best place at the time as it was the least hoops to jump, but open to it going elsewhere

I might play with it a bit. Is there a test for this case, where we have to adjust the timestamp because of the descriptor modification time?

otan · 2022-02-04T03:30:14Z

pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file):

Previously, RaduBerinde wrote…

I might play with it a bit. Is there a test for this case, where we have to adjust the timestamp because of the descriptor modification time?

yes! checkout TestBoundedStalenessDataDriven

otan · 2022-02-04T03:30:54Z

pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file):

Previously, otan (Oliver Tan) wrote…

yes! checkout TestBoundedStalenessDataDriven

cockroach/pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row

Line 137 in fa193c3

# Set a super high closed bounded staleness target and execute a schema change.

RaduBerinde · 2022-02-04T20:32:32Z

pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file):

Previously, otan (Oliver Tan) wrote…

cockroach/pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row

Line 137 in fa193c3

# Set a super high closed bounded staleness target and execute a schema change.

I assume that when using bounded staleness, we still read everything at the same timestamp correct? So it's not useful to set the MinTimestampBound to different values for different table scans, it should be equivalent to set it to the max value, is this right?

RaduBerinde · 2022-02-04T21:09:27Z

pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file):

Previously, RaduBerinde wrote…

I assume that when using bounded staleness, we still read everything at the same timestamp correct? So it's not useful to set the MinTimestampBound to different values for different table scans, it should be equivalent to set it to the max value, is this right?

I tried to do this instead:

--- a/pkg/sql/distsql_physical_planner.go
+++ b/pkg/sql/distsql_physical_planner.go
@@ -1260,6 +1260,15 @@ func (dsp *DistSQLPlanner) CheckInstanceHealthAndVersion(
 func (dsp *DistSQLPlanner) createTableReaders(
        planCtx *PlanningCtx, n *scanNode,
 ) (*PhysicalPlan, error) {
+       // If the descriptor's modification time is after the bounded staleness min bound,
+       // we have to increase the min bound.
+       // Otherwise, we would have table data which would not correspond to the correct
+       // schema.
+       if aost := planCtx.EvalContext().AsOfSystemTime; aost != nil && aost.BoundedStaleness {
+               if mTime := n.desc.GetModificationTime(); aost.Timestamp.Less(mTime) {
+                       aost.Timestamp = mTime
+               }
+       }

But this test fails to return any events: https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row#L173

Any ideas? Shouldn't this be equivalent, given that we're scanning a single table?

The context here is that I'm trying to remove the table descriptor from TableReader (and replace it with a much smaller proto that describes the index encoding). I'd rather not include the modification timestamp in this smaller proto just for this usecase. But I don't know much about this feature so I'm not sure what I'm doing wrong.

otan · 2022-02-04T23:43:43Z

On my phone and the weekend, but I'd take a look at the kv traces and see what changed. We only log events in the tests if

if sp.Operation == "dist sender send" && spans[sp.ParentSpanID].Operation == "colbatchscan".

so did something change?

Your assumption looks right to me but it's been a while and I won't have a chance to double check for a bit.

Can take a closer look next week.

RaduBerinde · 2022-02-09T17:42:54Z

If you have a bit of time to look at it, it would be awesome. Otherwise I think I will move forward with my change and include the modification time in the new spec for now (with a TODO to remove later).

otan requested review from a team, nvb and rytaft August 15, 2021 22:05

otan requested review from a team as code owners August 15, 2021 22:05

otan requested review from rhu713 and stevendanna and removed request for a team August 15, 2021 22:05

otan changed the title ~~kvfollowerreadsccl: plug bounded staleness reads end to end~~ kvfollowerreadsccl: plug bounded staleness into the execution engine Aug 15, 2021

otan force-pushed the polished-br branch 3 times, most recently from 2eb06c8 to a6d53c1 Compare August 16, 2021 00:18

otan mentioned this pull request Aug 16, 2021

kvfollowerreadsccl: implement txn retries for nearest_only=True #68969

Merged

otan force-pushed the polished-br branch from a6d53c1 to 5702d74 Compare August 16, 2021 01:37

nvb reviewed Aug 16, 2021

View reviewed changes

otan force-pushed the polished-br branch from 5702d74 to ab4691a Compare August 16, 2021 21:59

otan requested a review from a team as a code owner August 16, 2021 21:59

otan commented Aug 16, 2021

View reviewed changes

rytaft approved these changes Aug 17, 2021

View reviewed changes

otan force-pushed the polished-br branch 2 times, most recently from 8963b90 to 629f27d Compare August 17, 2021 04:52

nvb approved these changes Aug 17, 2021

View reviewed changes

otan force-pushed the polished-br branch from 629f27d to 04ab877 Compare August 17, 2021 06:22

otan commented Aug 17, 2021

View reviewed changes

sqlutils: make a RowsToDataDrivenOutput function

e3a7b04

This commit introduces a common way of converting rows to data driven output. This abstracts this away from backup which is using this in the hopes that it can be used in other data driven tests I plan to introduce. Release note: None

otan added 2 commits August 17, 2021 16:43

row: wrap MinTimestampBoundUnsatisfiableError with pgcode

20c81dd

Release note (sql change): Errors involving MinTimestampBoundUnsatisfiableError during a bounded staleness read now get a custom pgcode (54C01).

otan force-pushed the polished-br branch from 04ab877 to edf5daf Compare August 17, 2021 06:44

craig bot merged commit d61e634 into cockroachdb:master Aug 17, 2021

jseldess mentioned this pull request Sep 8, 2021

kvfollowerreadsccl: plug bounded staleness into the execution engine cockroachdb/docs#11426

Closed

RaduBerinde reviewed Feb 3, 2022

View reviewed changes

Conversation

otan commented Aug 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Aug 15, 2021

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

otan left a comment

Choose a reason for hiding this comment

Uh oh!

rytaft left a comment

Choose a reason for hiding this comment

Uh oh!

nvb left a comment

Choose a reason for hiding this comment

Uh oh!

otan left a comment

Choose a reason for hiding this comment

Uh oh!

otan commented Aug 17, 2021

Uh oh!

craig bot commented Aug 17, 2021

Uh oh!

otan commented Aug 17, 2021

Uh oh!

craig bot commented Aug 17, 2021

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde commented Feb 3, 2022

Uh oh!

otan commented Feb 3, 2022

Uh oh!

RaduBerinde commented Feb 4, 2022

Uh oh!

otan commented Feb 4, 2022

Uh oh!

otan commented Feb 4, 2022

Uh oh!

RaduBerinde commented Feb 4, 2022

Uh oh!

RaduBerinde commented Feb 4, 2022

Uh oh!

otan commented Feb 4, 2022

Uh oh!

RaduBerinde commented Feb 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

otan commented Aug 15, 2021 •

edited

Loading