Skip to content

kvfollowerreadsccl: plug bounded staleness into the execution engine#68967

Merged
craig[bot] merged 3 commits intocockroachdb:masterfrom
otan-cockroach:polished-br
Aug 17, 2021
Merged

kvfollowerreadsccl: plug bounded staleness into the execution engine#68967
craig[bot] merged 3 commits intocockroachdb:masterfrom
otan-cockroach:polished-br

Conversation

@otan
Copy link
Copy Markdown
Contributor

@otan otan commented Aug 15, 2021

See individual commits for details.

Resolves #67556.

@otan otan requested review from a team, nvb and rytaft August 15, 2021 22:05
@otan otan requested review from a team as code owners August 15, 2021 22:05
@otan otan requested review from rhu713 and stevendanna and removed request for a team August 15, 2021 22:05
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@otan otan changed the title kvfollowerreadsccl: plug bounded staleness reads end to end kvfollowerreadsccl: plug bounded staleness into the execution engine Aug 15, 2021
@otan otan force-pushed the polished-br branch 3 times, most recently from 2eb06c8 to a6d53c1 Compare August 16, 2021 00:18
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks very good! I imagine #68969 will be the more contention part of all of this.

Reviewed 2 of 2 files at r5, 10 of 10 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @otan, @rhu713, @rytaft, and @stevendanna)


pkg/ccl/kvccl/kvfollowerreadsccl/boundedstaleness_test.go, line 129 at r6 (raw file):

func (bse *boundedStalenessEvents) String() string {
	var sb strings.Builder
	sb.WriteString(fmt.Sprintf("events (%d found):\n", len(bse.events)))

fmt.Fprintf(&sb, ... here and below.


pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 14 at r6 (raw file):

----

# If we try read a timestamp that is impossible to satisfy with a follower read,

"try to read"


pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 18 at r6 (raw file):

# We always do bounded staleness reads from node_idx 2, as node_idx 0 in a
# TestCluster is always the leaseholder.
query idx=2 show-events

Do we ever run query without show-events? If not, should we just bake in the directive into query?


pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 22 at r6 (raw file):

----
1
events (1 found):

Is the idea that future improvements to bounded staleness reads will allow for multiple events?


pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 64 at r6 (raw file):

# Note we have to wait until a match here, in case a follower read reads an
# older version of the data.
query idx=2 show-events wait-until-follower-read wait-until-match

Do we need wait-until-follower-read if we're also including wait-until-match? Doesn't the latter encompass the role of the former?


pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 72 at r6 (raw file):


query idx=2 show-events
SELECT * FROM t AS OF SYSTEM TIME with_min_timestamp(now() - '10s') WHERE pk = 1

Is the clock frozen in these tests? If not, isn't it possible for the follower to catch up to exactly now()-10s (and no further) in the wait-until-match query and then fall progressively further behind?


pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 108 at r6 (raw file):

# Set a super high closed bounded staleness target and execute a schema change.
exec
SET CLUSTER SETTING kv.closed_timestamp.target_duration = '1hr';

I'm a bit concerned about the async nature of cluster setting propagation here. Could this cause flakiness? How do we deal with this in standard SQL logic tests?


pkg/sql/pgwire/pgcode/codes.go, line 362 at r5 (raw file):

	// Section: Class 54 - Program Limit Exceeded (Cockroach extension)

	// UnsatisfiableBoundedStaleness signals that the bounded staleness query

Out of curiosity, which PG code do you think we should use for when a query runs below the MVCC GC timestamp? Right now, it returns a raw BatchTimestampBeforeGCError with no PG code.

➜ ./cockroach demo

demo@127.0.0.1:26257/movr> alter table rides configure zone using gc.ttlseconds = 1;
CONFIGURE ZONE 1

demo@127.0.0.1:26257/movr> select * from rides as of system time '-2s';
ERROR: batch timestamp 1629140312.708334000,0 must be after replica GC threshold 1629140313.709146000,0

I ask because I imagine these two errors should be in the same class.


pkg/sql/row/errors.go, line 93 at r5 (raw file):

// key-value fetch to a user friendly SQL error.
func ConvertFetchError(ctx context.Context, descForKey KeyToDescTranslator, err error) error {
	var wiErr *roachpb.WriteIntentError

This is kind of an interesting pattern that we don't have much precedent for because of how new errors.As is. I wonder if it should look more like:

var wiErr *roachpb.WriteIntentError
var bsErr *roachpb.MinTimestampBoundUnsatisfiableError
switch {
case errors.As(err, &wiErr):
    key := wiErr.Intents[0].Key
    desc, _ := descForKey.KeyToDesc(key)
    return NewLockNotAvailableError(ctx, desc, key, wiErr.Reason)
case errors.As(err, &bsErr):
    return pgerror.WithCandidateCode(
        err,
        pgcode.UnsatisfiableBoundedStaleness,
    )
}
return err

Or even better, to batch the heap allocations:

var errs struct {
    wi *roachpb.WriteIntentError
    bs *roachpb.MinTimestampBoundUnsatisfiableError
}
switch {
case errors.As(err, &errs.wi):
    key := errs.wi.Intents[0].Key
    desc, _ := descForKey.KeyToDesc(key)
    return NewLockNotAvailableError(ctx, desc, key, errs.wi.Reason)
case errors.As(err, &errs.bs):
    return pgerror.WithCandidateCode(
        err,
        pgcode.UnsatisfiableBoundedStaleness,
    )
}
return err

pkg/sql/row/kv_fetcher.go, line 60 at r6 (raw file):

	forceProductionKVBatchSize bool,
) (*KVFetcher, error) {
	sendFn := makeKVBatchFetcherDefaultSendFunc(txn)

Consider only calling makeKVBatchFetcherDefaultSendFunc on the else branch to avoid an unnecessary heap allocation on the bounded staleness path.

Or better, continue to also accept the txn in makeKVBatchFetcher so that you can pass static function references with the following signature instead of having to create a closure here:

func(ctx context.Context, txn *kv.Txn, ba roachpb.BatchRequest) (*roachpb.BatchResponse, error)

@otan otan requested a review from a team as a code owner August 16, 2021 21:59
Copy link
Copy Markdown
Contributor Author

@otan otan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks very good! I imagine #68969 will be the more contention part of all of this.

Yeah, there's a reason why I made it a separate PR :D

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @rhu713, @rytaft, and @stevendanna)


pkg/ccl/kvccl/kvfollowerreadsccl/boundedstaleness_test.go, line 129 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

fmt.Fprintf(&sb, ... here and below.

have to handle the error or the linter cries (yuck), it's a test so i prefer to leave it as is.


pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 18 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we ever run query without show-events? If not, should we just bake in the directive into query?

Done.


pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 22 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is the idea that future improvements to bounded staleness reads will allow for multiple events?

Yep!


pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 64 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we need wait-until-follower-read if we're also including wait-until-match? Doesn't the latter encompass the role of the former?

We want --rewrite to not wait-until-match, but still wait-until-follower-read so it rewrites with correct output. i've added a comment in the test codes.


pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 72 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is the clock frozen in these tests? If not, isn't it possible for the follower to catch up to exactly now()-10s (and no further) in the wait-until-match query and then fall progressively further behind?

Clock is not frozen (how do i do that?), but the rest of these tests don't really care if we don't catch up past now()-10s.
unless i'm missing something and we should!


pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 108 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm a bit concerned about the async nature of cluster setting propagation here. Could this cause flakiness? How do we deal with this in standard SQL logic tests?

i don't think we do, looks like the tests are alright though? i've seen lots of CLUSTER SETTING sets throughout code.
is there a wait-for-cluster-setting-propagate in tests i should use?


pkg/sql/pgwire/pgcode/codes.go, line 362 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Out of curiosity, which PG code do you think we should use for when a query runs below the MVCC GC timestamp? Right now, it returns a raw BatchTimestampBeforeGCError with no PG code.

➜ ./cockroach demo

demo@127.0.0.1:26257/movr> alter table rides configure zone using gc.ttlseconds = 1;
CONFIGURE ZONE 1

demo@127.0.0.1:26257/movr> select * from rides as of system time '-2s';
ERROR: batch timestamp 1629140312.708334000,0 must be after replica GC threshold 1629140313.709146000,0

I ask because I imagine these two errors should be in the same class.

it's probably worth making a new custom error under this class (it looks like XXC where XX = class is reserved for our own internal usage)


pkg/sql/row/kv_fetcher.go, line 60 at r6 (raw file):

Or better, continue to also accept the txn in makeKVBatchFetcher so that you can pass static function references with the following signature instead of having to create a closure here:

That required going down deep into the stack, and passing txn to the txnKVFetcher stuff, so i'm going to do the if right now (this is ~ the same as before).

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL stuff :lgtm:

Reviewed 1 of 18 files at r1, 3 of 3 files at r4, 1 of 2 files at r5, 1 of 11 files at r7, 8 of 10 files at r8, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten, @otan, @rhu713, and @stevendanna)


-- commits, line 109 at r8 ([raw file](https://github.com/cockroachdb/cockroach/blob/ab4691aa62dc9c45312a6c0a8e984499d7ada643/-- commits#L109)):
nit: keep track of?

@otan otan force-pushed the polished-br branch 2 times, most recently from 8963b90 to 629f27d Compare August 17, 2021 04:52
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

This should be good to rebase now, as the kv stuff just landed.

Reviewed 1 of 11 files at r7, 9 of 10 files at r8.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @otan, @rhu713, @rytaft, and @stevendanna)


pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 72 at r6 (raw file):

Previously, otan (Oliver Tan) wrote…

Clock is not frozen (how do i do that?), but the rest of these tests don't really care if we don't catch up past now()-10s.
unless i'm missing something and we should!

Well they pass a min_timestamp of now()-10s, so if now() keeps advancing with the clock but the closed timestamp freezes, they may be unable to be served locally. But maybe that's such a rare possibility that we're ok with how this is written.


pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 108 at r6 (raw file):

Previously, otan (Oliver Tan) wrote…

i don't think we do, looks like the tests are alright though? i've seen lots of CLUSTER SETTING sets throughout code.
is there a wait-for-cluster-setting-propagate in tests i should use?

I'm not quite sure. I thought we had code to read the cluster setting from each other node in the cluster to make sure it's gone into effect. But now that I think of it, that may actually be part of SET CLUSTER SETTING itself. If this isn't flaky, I wouldn't worry about it.


pkg/sql/pgwire/pgcode/codes.go, line 362 at r5 (raw file):

Previously, otan (Oliver Tan) wrote…

it's probably worth making a new custom error under this class (it looks like XXC where XX = class is reserved for our own internal usage)

But I guess the question I'm asking is whether you think this is the right class for that error code as well. The reason why I ask is that I'm not entirely sure that Program Limit Exceeded is the right class for UnsatisfiableBoundedStaleness (though I don't see a better one), but I am pretty sure that UnsatisfiableBoundedStaleness and BatchTimestampBeforeGCError's eventual code should be in the same class. So does Program Limit Exceeded also sound right to you for the GC timestamp error code? If so, I'm happy with this.


pkg/sql/row/kv_fetcher.go, line 60 at r6 (raw file):

Previously, otan (Oliver Tan) wrote…

Or better, continue to also accept the txn in makeKVBatchFetcher so that you can pass static function references with the following signature instead of having to create a closure here:

That required going down deep into the stack, and passing txn to the txnKVFetcher stuff, so i'm going to do the if right now (this is ~ the same as before).

👍 I guess I also missed the fact that we were capturing bsHeader, so it would be even harder to avoid the allocation on the bounded staleness path.

nit: any reason to pull makeKVBatchFetcherDefaultSendFunc out instead of inlining it here? Or conversely, any reason not to pull out a constructor for the bounded staleness sendFunc? The asymmetry is a bit odd to me, given that there's only one use of each.

Copy link
Copy Markdown
Contributor Author

@otan otan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @nvanbenschoten, @otan, @rhu713, @rytaft, and @stevendanna)


pkg/sql/pgwire/pgcode/codes.go, line 362 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

But I guess the question I'm asking is whether you think this is the right class for that error code as well. The reason why I ask is that I'm not entirely sure that Program Limit Exceeded is the right class for UnsatisfiableBoundedStaleness (though I don't see a better one), but I am pretty sure that UnsatisfiableBoundedStaleness and BatchTimestampBeforeGCError's eventual code should be in the same class. So does Program Limit Exceeded also sound right to you for the GC timestamp error code? If so, I'm happy with this.

as discussed offline, CDB is a new class for these errors.


pkg/sql/row/kv_fetcher.go, line 60 at r6 (raw file):

nit: any reason to pull makeKVBatchFetcherDefaultSendFunc out instead of inlining it here?

it's used in another place :)

pkg/sql/row/fetcher.go
583:		makeKVBatchFetcherDefaultSendFunc(txn),

but bounded staleness is only one.

@otan
Copy link
Copy Markdown
Contributor Author

otan commented Aug 17, 2021

bors r=nvanbenschoten,rytaft

tftrs!

This commit introduces a common way of converting rows to data driven
output. This abstracts this away from backup which is using this in the
hopes that it can be used in other data driven tests I plan to
introduce.

Release note: None
otan added 2 commits August 17, 2021 16:43
Release note (sql change): Errors involving
MinTimestampBoundUnsatisfiableError during a bounded staleness read
now get a custom pgcode (54C01).
This commit plugs in the bounded staleness reads into the SQL Execution
layer, bridging the gap between the SQL syntax being parsed and the KV
layer of bounded staleness. When nearest_only=True, we currently give up
if the schema has been recently changed.

The `makeKVBatchFetcher` and `makeKVBatchFetcherWithSendFunc` have been
simplified to just one `makeKVBatchFetcher` method to avoid the multiple
permutations of makeKVBatchFetcher flying around.

A datadriven test of bounded staleness has been added which keeps tracks
of relevant events (e.g. whether the read was a follower or leaseholder
read). It is intended for it to keep track of e.g. transaction retries in
an upcoming commit.

Release note: None
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 17, 2021

Canceled.

@otan
Copy link
Copy Markdown
Contributor Author

otan commented Aug 17, 2021

lost a commit during rebase, woops!

bors r=nvanbenschoten,rytaft

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 17, 2021

Build succeeded:

Copy link
Copy Markdown
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)


pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file):

		// schema.
		if aost.Timestamp.Less(table.GetModificationTime()) {
			ts = table.GetModificationTime()

AFAICT, there is no corresponding code in the row execution engine.. I don't think we are setting the bounded staleness header at all in the row engine? CC @yuzefovich

@RaduBerinde
Copy link
Copy Markdown
Member


pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file):

Previously, RaduBerinde wrote…

AFAICT, there is no corresponding code in the row execution engine.. I don't think we are setting the bounded staleness header at all in the row engine? CC @yuzefovich

Also, can't we make this timestamp adjustment at planning time? Feels strange that we do it in the bowels of execution code.

@otan
Copy link
Copy Markdown
Contributor Author

otan commented Feb 3, 2022


pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file):

Previously, RaduBerinde wrote…

Also, can't we make this timestamp adjustment at planning time? Feels strange that we do it in the bowels of execution code.

yeah this wasn't done for the row exec engine and was an assumed limitation.
ah didn't think about doing it at planning time! if we could that'd be better, this seemed like the best place at the time as it was the least hoops to jump, but open to it going elsewhere

@RaduBerinde
Copy link
Copy Markdown
Member


pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file):

Previously, otan (Oliver Tan) wrote…

yeah this wasn't done for the row exec engine and was an assumed limitation.
ah didn't think about doing it at planning time! if we could that'd be better, this seemed like the best place at the time as it was the least hoops to jump, but open to it going elsewhere

I might play with it a bit. Is there a test for this case, where we have to adjust the timestamp because of the descriptor modification time?

@otan
Copy link
Copy Markdown
Contributor Author

otan commented Feb 4, 2022


pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file):

Previously, RaduBerinde wrote…

I might play with it a bit. Is there a test for this case, where we have to adjust the timestamp because of the descriptor modification time?

yes! checkout TestBoundedStalenessDataDriven

@otan
Copy link
Copy Markdown
Contributor Author

otan commented Feb 4, 2022


pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file):

Previously, otan (Oliver Tan) wrote…

yes! checkout TestBoundedStalenessDataDriven

# Set a super high closed bounded staleness target and execute a schema change.

@RaduBerinde
Copy link
Copy Markdown
Member


pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file):

Previously, otan (Oliver Tan) wrote…

# Set a super high closed bounded staleness target and execute a schema change.

I assume that when using bounded staleness, we still read everything at the same timestamp correct? So it's not useful to set the MinTimestampBound to different values for different table scans, it should be equivalent to set it to the max value, is this right?

@RaduBerinde
Copy link
Copy Markdown
Member


pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file):

Previously, RaduBerinde wrote…

I assume that when using bounded staleness, we still read everything at the same timestamp correct? So it's not useful to set the MinTimestampBound to different values for different table scans, it should be equivalent to set it to the max value, is this right?

I tried to do this instead:

--- a/pkg/sql/distsql_physical_planner.go
+++ b/pkg/sql/distsql_physical_planner.go
@@ -1260,6 +1260,15 @@ func (dsp *DistSQLPlanner) CheckInstanceHealthAndVersion(
 func (dsp *DistSQLPlanner) createTableReaders(
        planCtx *PlanningCtx, n *scanNode,
 ) (*PhysicalPlan, error) {
+       // If the descriptor's modification time is after the bounded staleness min bound,
+       // we have to increase the min bound.
+       // Otherwise, we would have table data which would not correspond to the correct
+       // schema.
+       if aost := planCtx.EvalContext().AsOfSystemTime; aost != nil && aost.BoundedStaleness {
+               if mTime := n.desc.GetModificationTime(); aost.Timestamp.Less(mTime) {
+                       aost.Timestamp = mTime
+               }
+       }

But this test fails to return any events: https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row#L173

Any ideas? Shouldn't this be equivalent, given that we're scanning a single table?

The context here is that I'm trying to remove the table descriptor from TableReader (and replace it with a much smaller proto that describes the index encoding). I'd rather not include the modification timestamp in this smaller proto just for this usecase. But I don't know much about this feature so I'm not sure what I'm doing wrong.

@otan
Copy link
Copy Markdown
Contributor Author

otan commented Feb 4, 2022

On my phone and the weekend, but I'd take a look at the kv traces and see what changed. We only log events in the tests if

if sp.Operation == "dist sender send" && spans[sp.ParentSpanID].Operation == "colbatchscan".

so did something change?

Your assumption looks right to me but it's been a while and I won't have a chance to double check for a bit.

Can take a closer look next week.

@RaduBerinde
Copy link
Copy Markdown
Member

If you have a bit of time to look at it, it would be awesome. Otherwise I think I will move forward with my change and include the modification time in the new spec for now (with a TODO to remove later).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql: plumb min_timestamp_bound to TableReader

5 participants