kvfollowerreadsccl: plug bounded staleness into the execution engine#68967
kvfollowerreadsccl: plug bounded staleness into the execution engine#68967craig[bot] merged 3 commits intocockroachdb:masterfrom
Conversation
2eb06c8 to
a6d53c1
Compare
nvb
left a comment
There was a problem hiding this comment.
This all looks very good! I imagine #68969 will be the more contention part of all of this.
Reviewed 2 of 2 files at r5, 10 of 10 files at r6, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @otan, @rhu713, @rytaft, and @stevendanna)
pkg/ccl/kvccl/kvfollowerreadsccl/boundedstaleness_test.go, line 129 at r6 (raw file):
func (bse *boundedStalenessEvents) String() string { var sb strings.Builder sb.WriteString(fmt.Sprintf("events (%d found):\n", len(bse.events)))
fmt.Fprintf(&sb, ... here and below.
pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 14 at r6 (raw file):
---- # If we try read a timestamp that is impossible to satisfy with a follower read,
"try to read"
pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 18 at r6 (raw file):
# We always do bounded staleness reads from node_idx 2, as node_idx 0 in a # TestCluster is always the leaseholder. query idx=2 show-events
Do we ever run query without show-events? If not, should we just bake in the directive into query?
pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 22 at r6 (raw file):
---- 1 events (1 found):
Is the idea that future improvements to bounded staleness reads will allow for multiple events?
pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 64 at r6 (raw file):
# Note we have to wait until a match here, in case a follower read reads an # older version of the data. query idx=2 show-events wait-until-follower-read wait-until-match
Do we need wait-until-follower-read if we're also including wait-until-match? Doesn't the latter encompass the role of the former?
pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 72 at r6 (raw file):
query idx=2 show-events SELECT * FROM t AS OF SYSTEM TIME with_min_timestamp(now() - '10s') WHERE pk = 1
Is the clock frozen in these tests? If not, isn't it possible for the follower to catch up to exactly now()-10s (and no further) in the wait-until-match query and then fall progressively further behind?
pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 108 at r6 (raw file):
# Set a super high closed bounded staleness target and execute a schema change. exec SET CLUSTER SETTING kv.closed_timestamp.target_duration = '1hr';
I'm a bit concerned about the async nature of cluster setting propagation here. Could this cause flakiness? How do we deal with this in standard SQL logic tests?
pkg/sql/pgwire/pgcode/codes.go, line 362 at r5 (raw file):
// Section: Class 54 - Program Limit Exceeded (Cockroach extension) // UnsatisfiableBoundedStaleness signals that the bounded staleness query
Out of curiosity, which PG code do you think we should use for when a query runs below the MVCC GC timestamp? Right now, it returns a raw BatchTimestampBeforeGCError with no PG code.
➜ ./cockroach demo
demo@127.0.0.1:26257/movr> alter table rides configure zone using gc.ttlseconds = 1;
CONFIGURE ZONE 1
demo@127.0.0.1:26257/movr> select * from rides as of system time '-2s';
ERROR: batch timestamp 1629140312.708334000,0 must be after replica GC threshold 1629140313.709146000,0
I ask because I imagine these two errors should be in the same class.
pkg/sql/row/errors.go, line 93 at r5 (raw file):
// key-value fetch to a user friendly SQL error. func ConvertFetchError(ctx context.Context, descForKey KeyToDescTranslator, err error) error { var wiErr *roachpb.WriteIntentError
This is kind of an interesting pattern that we don't have much precedent for because of how new errors.As is. I wonder if it should look more like:
var wiErr *roachpb.WriteIntentError
var bsErr *roachpb.MinTimestampBoundUnsatisfiableError
switch {
case errors.As(err, &wiErr):
key := wiErr.Intents[0].Key
desc, _ := descForKey.KeyToDesc(key)
return NewLockNotAvailableError(ctx, desc, key, wiErr.Reason)
case errors.As(err, &bsErr):
return pgerror.WithCandidateCode(
err,
pgcode.UnsatisfiableBoundedStaleness,
)
}
return errOr even better, to batch the heap allocations:
var errs struct {
wi *roachpb.WriteIntentError
bs *roachpb.MinTimestampBoundUnsatisfiableError
}
switch {
case errors.As(err, &errs.wi):
key := errs.wi.Intents[0].Key
desc, _ := descForKey.KeyToDesc(key)
return NewLockNotAvailableError(ctx, desc, key, errs.wi.Reason)
case errors.As(err, &errs.bs):
return pgerror.WithCandidateCode(
err,
pgcode.UnsatisfiableBoundedStaleness,
)
}
return errpkg/sql/row/kv_fetcher.go, line 60 at r6 (raw file):
forceProductionKVBatchSize bool, ) (*KVFetcher, error) { sendFn := makeKVBatchFetcherDefaultSendFunc(txn)
Consider only calling makeKVBatchFetcherDefaultSendFunc on the else branch to avoid an unnecessary heap allocation on the bounded staleness path.
Or better, continue to also accept the txn in makeKVBatchFetcher so that you can pass static function references with the following signature instead of having to create a closure here:
func(ctx context.Context, txn *kv.Txn, ba roachpb.BatchRequest) (*roachpb.BatchResponse, error)
otan
left a comment
There was a problem hiding this comment.
This all looks very good! I imagine #68969 will be the more contention part of all of this.
Yeah, there's a reason why I made it a separate PR :D
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @rhu713, @rytaft, and @stevendanna)
pkg/ccl/kvccl/kvfollowerreadsccl/boundedstaleness_test.go, line 129 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
fmt.Fprintf(&sb, ...here and below.
have to handle the error or the linter cries (yuck), it's a test so i prefer to leave it as is.
pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 18 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we ever run
querywithoutshow-events? If not, should we just bake in the directive intoquery?
Done.
pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 22 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is the idea that future improvements to bounded staleness reads will allow for multiple events?
Yep!
pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 64 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Do we need
wait-until-follower-readif we're also includingwait-until-match? Doesn't the latter encompass the role of the former?
We want --rewrite to not wait-until-match, but still wait-until-follower-read so it rewrites with correct output. i've added a comment in the test codes.
pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 72 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Is the clock frozen in these tests? If not, isn't it possible for the follower to catch up to exactly
now()-10s(and no further) in thewait-until-matchquery and then fall progressively further behind?
Clock is not frozen (how do i do that?), but the rest of these tests don't really care if we don't catch up past now()-10s.
unless i'm missing something and we should!
pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 108 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'm a bit concerned about the async nature of cluster setting propagation here. Could this cause flakiness? How do we deal with this in standard SQL logic tests?
i don't think we do, looks like the tests are alright though? i've seen lots of CLUSTER SETTING sets throughout code.
is there a wait-for-cluster-setting-propagate in tests i should use?
pkg/sql/pgwire/pgcode/codes.go, line 362 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Out of curiosity, which PG code do you think we should use for when a query runs below the MVCC GC timestamp? Right now, it returns a raw
BatchTimestampBeforeGCErrorwith no PG code.➜ ./cockroach demo demo@127.0.0.1:26257/movr> alter table rides configure zone using gc.ttlseconds = 1; CONFIGURE ZONE 1 demo@127.0.0.1:26257/movr> select * from rides as of system time '-2s'; ERROR: batch timestamp 1629140312.708334000,0 must be after replica GC threshold 1629140313.709146000,0I ask because I imagine these two errors should be in the same class.
it's probably worth making a new custom error under this class (it looks like XXC where XX = class is reserved for our own internal usage)
pkg/sql/row/kv_fetcher.go, line 60 at r6 (raw file):
Or better, continue to also accept the txn in makeKVBatchFetcher so that you can pass static function references with the following signature instead of having to create a closure here:
That required going down deep into the stack, and passing txn to the txnKVFetcher stuff, so i'm going to do the if right now (this is ~ the same as before).
rytaft
left a comment
There was a problem hiding this comment.
Reviewed 1 of 18 files at r1, 3 of 3 files at r4, 1 of 2 files at r5, 1 of 11 files at r7, 8 of 10 files at r8, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten, @otan, @rhu713, and @stevendanna)
-- commits, line 109 at r8 ([raw file](https://github.com/cockroachdb/cockroach/blob/ab4691aa62dc9c45312a6c0a8e984499d7ada643/-- commits#L109)):
nit: keep track of?
8963b90 to
629f27d
Compare
nvb
left a comment
There was a problem hiding this comment.
This should be good to rebase now, as the kv stuff just landed.
Reviewed 1 of 11 files at r7, 9 of 10 files at r8.
Reviewable status:complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @otan, @rhu713, @rytaft, and @stevendanna)
pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 72 at r6 (raw file):
Previously, otan (Oliver Tan) wrote…
Clock is not frozen (how do i do that?), but the rest of these tests don't really care if we don't catch up past now()-10s.
unless i'm missing something and we should!
Well they pass a min_timestamp of now()-10s, so if now() keeps advancing with the clock but the closed timestamp freezes, they may be unable to be served locally. But maybe that's such a rare possibility that we're ok with how this is written.
pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row, line 108 at r6 (raw file):
Previously, otan (Oliver Tan) wrote…
i don't think we do, looks like the tests are alright though? i've seen lots of CLUSTER SETTING sets throughout code.
is there a wait-for-cluster-setting-propagate in tests i should use?
I'm not quite sure. I thought we had code to read the cluster setting from each other node in the cluster to make sure it's gone into effect. But now that I think of it, that may actually be part of SET CLUSTER SETTING itself. If this isn't flaky, I wouldn't worry about it.
pkg/sql/pgwire/pgcode/codes.go, line 362 at r5 (raw file):
Previously, otan (Oliver Tan) wrote…
it's probably worth making a new custom error under this class (it looks like
XXCwhere XX = class is reserved for our own internal usage)
But I guess the question I'm asking is whether you think this is the right class for that error code as well. The reason why I ask is that I'm not entirely sure that Program Limit Exceeded is the right class for UnsatisfiableBoundedStaleness (though I don't see a better one), but I am pretty sure that UnsatisfiableBoundedStaleness and BatchTimestampBeforeGCError's eventual code should be in the same class. So does Program Limit Exceeded also sound right to you for the GC timestamp error code? If so, I'm happy with this.
pkg/sql/row/kv_fetcher.go, line 60 at r6 (raw file):
Previously, otan (Oliver Tan) wrote…
Or better, continue to also accept the txn in makeKVBatchFetcher so that you can pass static function references with the following signature instead of having to create a closure here:
That required going down deep into the stack, and passing
txnto thetxnKVFetcherstuff, so i'm going to do the if right now (this is ~ the same as before).
👍 I guess I also missed the fact that we were capturing bsHeader, so it would be even harder to avoid the allocation on the bounded staleness path.
nit: any reason to pull makeKVBatchFetcherDefaultSendFunc out instead of inlining it here? Or conversely, any reason not to pull out a constructor for the bounded staleness sendFunc? The asymmetry is a bit odd to me, given that there's only one use of each.
otan
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @nvanbenschoten, @otan, @rhu713, @rytaft, and @stevendanna)
pkg/sql/pgwire/pgcode/codes.go, line 362 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
But I guess the question I'm asking is whether you think this is the right class for that error code as well. The reason why I ask is that I'm not entirely sure that
Program Limit Exceededis the right class forUnsatisfiableBoundedStaleness(though I don't see a better one), but I am pretty sure thatUnsatisfiableBoundedStalenessand BatchTimestampBeforeGCError's eventual code should be in the same class. So doesProgram Limit Exceededalso sound right to you for the GC timestamp error code? If so, I'm happy with this.
as discussed offline, CDB is a new class for these errors.
pkg/sql/row/kv_fetcher.go, line 60 at r6 (raw file):
nit: any reason to pull makeKVBatchFetcherDefaultSendFunc out instead of inlining it here?
it's used in another place :)
pkg/sql/row/fetcher.go
583: makeKVBatchFetcherDefaultSendFunc(txn),
but bounded staleness is only one.
|
bors r=nvanbenschoten,rytaft tftrs! |
This commit introduces a common way of converting rows to data driven output. This abstracts this away from backup which is using this in the hopes that it can be used in other data driven tests I plan to introduce. Release note: None
Release note (sql change): Errors involving MinTimestampBoundUnsatisfiableError during a bounded staleness read now get a custom pgcode (54C01).
This commit plugs in the bounded staleness reads into the SQL Execution layer, bridging the gap between the SQL syntax being parsed and the KV layer of bounded staleness. When nearest_only=True, we currently give up if the schema has been recently changed. The `makeKVBatchFetcher` and `makeKVBatchFetcherWithSendFunc` have been simplified to just one `makeKVBatchFetcher` method to avoid the multiple permutations of makeKVBatchFetcher flying around. A datadriven test of bounded staleness has been added which keeps tracks of relevant events (e.g. whether the read was a follower or leaseholder read). It is intended for it to keep track of e.g. transaction retries in an upcoming commit. Release note: None
|
Canceled. |
|
lost a commit during rebase, woops! bors r=nvanbenschoten,rytaft |
|
Build succeeded: |
RaduBerinde
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale)
pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file):
// schema. if aost.Timestamp.Less(table.GetModificationTime()) { ts = table.GetModificationTime()
AFAICT, there is no corresponding code in the row execution engine.. I don't think we are setting the bounded staleness header at all in the row engine? CC @yuzefovich
|
pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file): Previously, RaduBerinde wrote…
Also, can't we make this timestamp adjustment at planning time? Feels strange that we do it in the bowels of execution code. |
|
pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file): Previously, RaduBerinde wrote…
yeah this wasn't done for the row exec engine and was an assumed limitation. |
|
pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file): Previously, otan (Oliver Tan) wrote…
I might play with it a bit. Is there a test for this case, where we have to adjust the timestamp because of the descriptor modification time? |
|
pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file): Previously, RaduBerinde wrote…
yes! checkout |
|
pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file): Previously, otan (Oliver Tan) wrote…
|
|
pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file): Previously, otan (Oliver Tan) wrote…I assume that when using bounded staleness, we still read everything at the same timestamp correct? So it's not useful to set the MinTimestampBound to different values for different table scans, it should be equivalent to set it to the max value, is this right? |
|
pkg/sql/colfetcher/colbatch_scan.go, line 237 at r17 (raw file): Previously, RaduBerinde wrote…
I tried to do this instead: --- a/pkg/sql/distsql_physical_planner.go
+++ b/pkg/sql/distsql_physical_planner.go
@@ -1260,6 +1260,15 @@ func (dsp *DistSQLPlanner) CheckInstanceHealthAndVersion(
func (dsp *DistSQLPlanner) createTableReaders(
planCtx *PlanningCtx, n *scanNode,
) (*PhysicalPlan, error) {
+ // If the descriptor's modification time is after the bounded staleness min bound,
+ // we have to increase the min bound.
+ // Otherwise, we would have table data which would not correspond to the correct
+ // schema.
+ if aost := planCtx.EvalContext().AsOfSystemTime; aost != nil && aost.BoundedStaleness {
+ if mTime := n.desc.GetModificationTime(); aost.Timestamp.Less(mTime) {
+ aost.Timestamp = mTime
+ }
+ }But this test fails to return any events: https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/kvccl/kvfollowerreadsccl/testdata/boundedstaleness/single_row#L173 Any ideas? Shouldn't this be equivalent, given that we're scanning a single table? The context here is that I'm trying to remove the table descriptor from TableReader (and replace it with a much smaller proto that describes the index encoding). I'd rather not include the modification timestamp in this smaller proto just for this usecase. But I don't know much about this feature so I'm not sure what I'm doing wrong. |
|
On my phone and the weekend, but I'd take a look at the kv traces and see what changed. We only log events in the tests if
so did something change? Your assumption looks right to me but it's been a while and I won't have a chance to double check for a bit. Can take a closer look next week. |
|
If you have a bit of time to look at it, it would be awesome. Otherwise I think I will move forward with my change and include the modification time in the new spec for now (with a TODO to remove later). |
See individual commits for details.
Resolves #67556.