sql: dynamically detect a query's home region with enforce_home_region#97827
sql: dynamically detect a query's home region with enforce_home_region#97827craig[bot] merged 3 commits intocockroachdb:masterfrom
Conversation
yuzefovich
left a comment
There was a problem hiding this comment.
Nice stuff! I have some questions and nits.
Reviewed 21 of 21 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @michae2, @msirek, and @rytaft)
-- commits line 13 at r1:
Just to make sure I understand (let's consider an example of LOS where we want to read a single row that actually lives in us-west but we connected to us-east):
- first, we use the actual locality of the gateway node (
us-east) and perform a scan - that scan doesn't find the row, then we know we'll error out for the query as a whole; however, we want to extend the error message with the region where that single row lives
- to do that we will modify the eval context to make the optimizer think that
us-westis the local region, so it'll construct a scan forus-west - in order to not do cross-region read, we will use AOST with follower read timestamp.
Now, I'm wondering are we guaranteed to always have a follower replica of all remote regions in the local region? I.e. I'm wondering whether it's possible that we still will do a cross-region request to find out the home region for the query.
-- commits line 28 at r1:
nit: this probably deserves a release note.
pkg/sql/conn_executor.go line 2753 at r1 (raw file):
func errIsRetriable(err error) bool { isDynamicQueryHasNoHomeRegionError := row.IsDynamicQueryHasNoHomeRegionError(err)
nit: this variable declaration seems redundant given it's only used once and the logical ORs below inline the function calls for other error types.
pkg/sql/conn_executor.go line 3062 at r1 (raw file):
// data which may be after the schema change when we retry. var minTSErr *kvpb.MinTimestampBoundUnsatisfiableError p.EvalContext().Locality = p.EvalContext().OriginalLocality
nit: perhaps add a quick comment for why we do this restore of the locality here.
pkg/sql/conn_executor.go line 3068 at r1 (raw file):
} else if row.IsDynamicQueryHasNoHomeRegionError(autoRetryReason) { if int(ex.state.mu.autoRetryCounter) <= len(p.EvalContext().RemoteRegions) { p.EvalContext().Locality =
Some commentary here would be helpful. In particular, it'd be good to mention that this Locality field will only influence which region is considered "local" by the optimizer, but the query will not actually run on a node from that remote region as its gateway.
pkg/sql/conn_executor_exec.go line 752 at r1 (raw file):
var rollbackSP *tree.RollbackToSavepoint var r *tree.ReleaseSavepoint enforceHomeRegion := p.IsANSIDML() && p.EvalContext().SessionData().EnforceHomeRegion
nit: I think we have a similar conditional in several places now, perhaps we could extract a small helper?
pkg/sql/conn_executor_exec.go line 768 at r1 (raw file):
} r = &tree.ReleaseSavepoint{Savepoint: "enforce_home_region_sp"}
Do we need to care about a user also creating a savepoint with this name?
pkg/sql/conn_executor_exec.go line 807 at r1 (raw file):
// A retryable "query has no home region" error has occurred. // Roll back to the internal savepoint in preparation for the next // planning and execution of this query with a different gateway region.
nit: perhaps add something like with a different gateway region (as considered by the optimizer).
pkg/sql/conn_executor_exec.go line 816 at r1 (raw file):
if etr == nil || eventPayload != nil { err = errors.AssertionFailedf( "unable to roll back internal savepoint for enforce_home_region",
nit: s/roll back/roll back to/ here and below.
pkg/sql/conn_executor_exec.go line 824 at r1 (raw file):
} } else { internalError := errors.AssertionFailedf(
Why didn't we override err here?
pkg/sql/conn_executor_exec.go line 836 at r1 (raw file):
// non-retryable form. errorMessage := err.Error() if !strings.HasPrefix(errorMessage, "Query is not running in its home region") {
nit: perhaps this constant string should be extracted somewhere to a place where we're constructing the error?
pkg/sql/conn_executor_exec.go line 1412 at r1 (raw file):
if res.Err() == nil && err == nil { autoRetryReason := ex.state.mu.autoRetryReason if row.IsDynamicQueryHasNoHomeRegionError(ex.state.mu.autoRetryReason) {
nit: we could use autoRetryReason variable here.
pkg/sql/conn_executor_exec.go line 1423 at r1 (raw file):
) res.SetError(err) return nil
Why are we returning nil here and below?
pkg/sql/conn_executor_exec.go line 1425 at r1 (raw file):
return nil } // If for some reason we're not running the same query as before, report
How can we run a different query than the one we're retrying with a different gateway region?
pkg/sql/colexec/BUILD.bazel line 67 at r1 (raw file):
"//pkg/sql/execinfrapb", "//pkg/sql/memsize", "//pkg/sql/row",
nit: this dependency of colexec on row seems a bit off. Perhaps we should move the new code from row/errors.go into a different package, say execinfra - that package seems more suitable for that.
pkg/sql/colexecerror/error.go line 164 at r1 (raw file):
// NotInternalError will be returned to the client not as an // "internal error" and without the stack trace. type NotInternalError struct {
Let's not change the name here (since this error is sent across the wire, we'd need to register a type migration if we did, otherwise sentry reports can be broken). Instead, let's introduce some unwrapper exported methods that return the cause in colexecerror and use those in row package.
pkg/sql/opt/optbuilder/select.go line 582 at r1 (raw file):
// Populate the remote regions touched by the multiregion database used in // this query. If a query dynamically errors out as having no home region, // these
nit: incomplete comment.
pkg/sql/opt/optbuilder/select.go line 586 at r1 (raw file):
if regionsNames, ok := tabMeta.GetRegionsInDatabase(b.ctx, b.evalCtx.Planner); ok { if gatewayRegion, ok := b.evalCtx.Locality.Find("region"); ok { b.evalCtx.RemoteRegions = make(catpb.RegionNames, 0, len(regionsNames))
nit: len(regionNames)-1 since we don't include the gateway.
pkg/sql/sem/eval/context.go line 99 at r1 (raw file):
Locality roachpb.Locality OriginalLocality roachpb.Locality
nit: add a quick comment for why we need this and how it might differ from Locality field. Perhaps also adjust the comment on that field too.
pkg/sql/sem/eval/context.go line 258 at r1 (raw file):
// RemoteRegions contains the slice of remote regions in a multiregion // database which owns a table accessed by the current SQL request.
nit: mention that this slice is only populated during the optbuilding stage.
pkg/ccl/logictestccl/testdata/logic_test/multi_region_remote_access_error line 876 at r1 (raw file):
retry statement error pq: Query is not running in its home region\. Try running the query from region 'us-east-1'\. SELECT * FROM t1 WHERE a=1;
Should we also add some tests that use tracing so that we could ensure that the internal behavior is as expected?
d078e02 to
e05fed9
Compare
msirek
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, and @yuzefovich)
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Just to make sure I understand (let's consider an example of LOS where we want to read a single row that actually lives in
us-westbut we connected tous-east):
- first, we use the actual locality of the gateway node (
us-east) and perform a scan- that scan doesn't find the row, then we know we'll error out for the query as a whole; however, we want to extend the error message with the region where that single row lives
- to do that we will modify the eval context to make the optimizer think that
us-westis the local region, so it'll construct a scan forus-west- in order to not do cross-region read, we will use AOST with follower read timestamp.
Now, I'm wondering are we guaranteed to always have a follower replica of all remote regions in the local region? I.e. I'm wondering whether it's possible that we still will do a cross-region request to find out the home region for the query.
Yes, that summary is correct. Whether we'd always have a follower replica in the local region, I don't know, maybe not. It's a best effort attempt to keep latency during an error condition low. Is there something that works better than follower reads?
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: this probably deserves a release note.
added:
Release note (sql change): The enforce_home_region session setting was
extended to dynamically detect and report a query's home region, based on
the locality of the queried rows, if different from the gateway region.
pkg/sql/conn_executor.go line 2753 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: this variable declaration seems redundant given it's only used once and the logical ORs below inline the function calls for other error types.
Removed the variable
Code quote:
isDynamicQueryHasNoHomeRegionError := row.IsDynamicQueryHasNoHomeRegionError(err)pkg/sql/conn_executor.go line 3062 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: perhaps add a quick comment for why we do this restore of the locality here.
Added:
// Make sure the default locality specifies the actual gateway region at the
// start of query compilation. It could have been overridden to a remote
// region when the enforce_home_region session setting is true.
I've gone ahead and added some code to also restore the locality right after we determine which non-retryable error to use (see setErrorAndRestoreLocality).
pkg/sql/conn_executor.go line 3068 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Some commentary here would be helpful. In particular, it'd be good to mention that this Locality field will only influence which region is considered "local" by the optimizer, but the query will not actually run on a node from that remote region as its gateway.
Added:
// Set a fake gateway region for use by the optimizer to inform its
// decision on which region to access first in locality-optimized scan
// and join operations. This setting does not affect the distsql planner,
// and local plans will continue to be run from the actual gateway region.
pkg/sql/conn_executor_exec.go line 752 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: I think we have a similar conditional in several places now, perhaps we could extract a small helper?
Added planner.EnforceHomeRegion.
pkg/sql/conn_executor_exec.go line 768 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Do we need to care about a user also creating a savepoint with this name?
Probably so, to be safe, even though a collision is unlikely. Added some unprintable characters to it:
var b strings.Builder
b.WriteString("enforce_home_region_sp")
// Add some unprintable ASCII characters to the name of the savepoint to
// decrease the likelihood of collision with a user-created savepoint.
b.WriteRune(rune(0x11))
b.WriteRune(rune(0x12))
b.WriteRune(rune(0x13))
enforceHomeRegionSavepointName := tree.Name(b.String())
pkg/sql/conn_executor_exec.go line 807 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: perhaps add something like
with a different gateway region (as considered by the optimizer).
Done
pkg/sql/conn_executor_exec.go line 816 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
s/roll back/roll back to/here and below.
Done
Code quote:
unable to roll back internal savepointpkg/sql/conn_executor_exec.go line 824 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Why didn't we override
errhere?
Yeah, no need for a new variable. Fixed it.
Code quote:
internalError := errors.AssertionFailedf(pkg/sql/conn_executor_exec.go line 836 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: perhaps this constant string should be extracted somewhere to a place where we're constructing the error?
Added execinfra.QueryNotRunningInHomeRegionMessagePrefix.
Code quote:
Query is not running in its home regionpkg/sql/conn_executor_exec.go line 1412 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: we could use
autoRetryReasonvariable here.
Fixed.
pkg/sql/conn_executor_exec.go line 1423 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Why are we returning
nilhere and below?
So that when we return it'll trigger a call to makeErrEvent.
pkg/sql/conn_executor_exec.go line 1425 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
How can we run a different query than the one we're retrying with a different gateway region?
Maybe with more than one statement on a line, e.g. 'USE db; SELECT * FROM t1;', it could rewind to the USE statement? For example, if I try this test:
query III nodeidx=8
SET enforce_home_region = true;
USE multi_region_test_db;
SELECT * FROM t1 WHERE a=2;
----
2 1 1
It rewinds to the USE statement (maybe the query command concatenates and sends all SQLs in a single request?), which we shouldn't retry, so we get the generic error message:
Query has no home region. Try using a lower LIMIT value or running the query from a different region.
I'm not sure how important it is to handle this case, since I imagine typically queries are sent one-at-a-time. But we still need to return the proper generic error message here instead of trying to run the "USE" statement with AOST.
pkg/sql/opt/optbuilder/select.go line 582 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: incomplete comment.
Updated it to:
// Populate the remote regions touched by the multiregion database used in
// this query. If a query dynamically errors out as having no home region,
// the query will be replanned with each of the remote regions,
// one-at-a-time, with AOST follower_read_timestamp(). If one of these
// retries doesn't error out, that region will be reported to the user as
// the query's home region.
pkg/sql/opt/optbuilder/select.go line 586 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
len(regionNames)-1since we don't include the gateway.
The gateway region may not be one of the regions in the multiregion db.
pkg/sql/sem/eval/context.go line 99 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: add a quick comment for why we need this and how it might differ from
Localityfield. Perhaps also adjust the comment on that field too.
Done
Code quote:
OriginalLocality roachpb.Localitypkg/sql/sem/eval/context.go line 258 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: mention that this slice is only populated during the optbuilding stage.
Done
pkg/ccl/logictestccl/testdata/logic_test/multi_region_remote_access_error line 876 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Should we also add some tests that use tracing so that we could ensure that the internal behavior is as expected?
OK, I've added a couple of tracing tests to check all KV batches are sent to the local node.
pkg/sql/colexec/BUILD.bazel line 67 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: this dependency of
colexeconrowseems a bit off. Perhaps we should move the new code fromrow/errors.gointo a different package, sayexecinfra- that package seems more suitable for that.
Done
pkg/sql/colexecerror/error.go line 164 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Let's not change the name here (since this error is sent across the wire, we'd need to register a type migration if we did, otherwise sentry reports can be broken). Instead, let's introduce some unwrapper exported methods that return the cause in
colexecerrorand use those inrowpackage.
I reverted the name and just fixed IsDynamicQueryHasNoHomeRegionError and MaybeGetNonRetryableDynamicQueryHasNoHomeRegionError to unwrap anything that's not a dynamicQueryHasNoHomeRegionError. Not sure if you had something better in mind.
|
@taroface Could you perhaps look at the form of the error message in this PR from a docs perspective and see if there are any issues or suggestions for change? : Is there anyone else from the docs team you think might want to look at this? |
Thanks for asking -- only thought is to include the exact syntax for specifying the region (I'm not sure what it is?). EDIT: Also, in this context, is it worth reminding that |
msirek
left a comment
There was a problem hiding this comment.
only thought is to include the exact syntax for specifying the region (I'm not sure what it is?).
The only way I know to do it is for the application to disconnect and then reconnect with a connection string specifying a node in a different region. The application writer would have to know which nodes are in which regions. There is info about enforce_home_region on this page: https://www.cockroachlabs.com/docs/stable/set-vars.html, but it doesn't mention how to handle the Query is not running in its home region error. If it did, I could just add a link to that page in the error message. Or do you have other thoughts on rewording this? I guess the question is, do we want the error message to be really verbose, or do we want to redirect the user to docs for additional info.r
Also, in this context, is it worth reminding that enforce_home_region is on, and explaining why, or offering the command to disable?
Not sure. The user would have had to manually turn on this setting to hit the error message, and probably knows what it's used for before using it. If the error message included a line, "Please see enforce_home_region at https://www.cockroachlabs.com/docs/stable/set-vars.html" that could provide more info. What do you think?
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, @taroface, and @yuzefovich)
|
Per our convo offline, opened cockroachdb/docs#16404 to enhance the doc to which this message can be linked. Is there a variable that can specify the correct doc version in the URL? |
msirek
left a comment
There was a problem hiding this comment.
Is there a variable that can specify the correct doc version in the URL?
Do you mean, should the error message print a version-specific URL? I don't know, it may be good to just report the stable URL in case we want to include small tweaks to the wording. The new text you just added is applicable to v22.2 as well. What do you think?
I'm planning to suffix each error message with the following?
See https://www.cockroachlabs.com/docs/stable/cost-based-optimizer.html#control-whether-queries-are-limited-to-a-single-region.
Any recommended tweaks, or is this good?
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, @taroface, and @yuzefovich)
rytaft
left a comment
There was a problem hiding this comment.
Very impressive turnaround! Sorry I still haven't had a chance to look at this in detail, but I added one high-level comment below. I'm OOO tomorrow so no need to wait for my review if you get sign-off before I come back.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @michae2, @taroface, and @yuzefovich)
Previously, msirek (Mark Sirek) wrote…
Yes, that summary is correct. Whether we'd always have a follower replica in the local region, I don't know, maybe not. It's a best effort attempt to keep latency during an error condition low. Is there something that works better than follower reads?
Follower reads seems like a reasonable option, but it would be better if we could prevent it from visiting a remote region if there is no local replica (or the local replica is not sufficiently up-to-date). We don't yet support the option to error instead of visiting a remote region with follower reads, but there is an issue open for it: #97512.
In a discussion with @andy-kimball today, he mentioned that whatever solution we choose, we should make sure that it's not worse than what a user could do on their own using GLOBAL tables. This approach seems slightly worse to me, since a PLACEMENT RESTRICTED table would not have a local replica in every region and therefore would cause a remote read every time a user tried to access it from outside the home region.
Unless we can come up with a better solution, I would suggest that we put this logic behind a cluster setting that is defaulted to false in 23.1, and then set it to true once we address #97512.
|
I'd also be worried about re-running the query for every region, if there's any chance the follower reads might exit the current region (e.g. the closed timestamp is delayed) and cause N calls to other regions. As Becca said, we only want to enhance the error message if it's as good or better as what the user can do themself (e.g. querying a global |
msirek
left a comment
There was a problem hiding this comment.
OK. I can't think of any other solutions, so I guess I'll add a cluster setting as suggested for this to be disabled by default until #97512 is fixed.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @michae2, @taroface, and @yuzefovich)
msirek
left a comment
There was a problem hiding this comment.
Added cluster setting: sql.multiregion.enforce_home_region_follower_reads.enabled
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @michae2, @taroface, and @yuzefovich)
msirek
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @michae2, @taroface, and @yuzefovich)
Unless we can come up with a better solution, I would suggest that we put this logic behind a cluster setting that is defaulted to false in 23.1, and then set it to true once we address #97512.
Added cluster setting: sql.multiregion.enforce_home_region_follower_reads.enabled
|
@msirek Linking to stable sounds good. My suggestion for the link copy: I'm not sure if having the period after the link will affect the link. If you think it won't, then feel free to add it! |
cbb8a69 to
53b219b
Compare
msirek
left a comment
There was a problem hiding this comment.
For more information, see https://www.cockroachlabs.com/docs/stable/cost-based-optimizer.html#control-whether-queries-are-limited-to-a-single-region
@taroface I've updated the sentence and link as suggested, with no trailing period.
In the multiregion meeting today it was decided to make this a preview feature, so I've marked up the Release notes to indicate so. Let me know if you think more is needed to calls this out for the docs team. Instead of a cluster setting, this feature will now be controlled via a session setting, enforce_home_region_follower_reads_enabled, which defaults to false.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @michae2, and @yuzefovich)
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewed 2 of 6 files at r2, 11 of 25 files at r3, 1 of 1 files at r4, 22 of 22 files at r10, 2 of 14 files at r12, 10 of 10 files at r14, 12 of 12 files at r15, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @msirek)
-- commits line 53 at r15:
nit: "setting setting".
pkg/sql/conn_executor_exec.go line 1425 at r1 (raw file):
Previously, msirek (Mark Sirek) wrote…
Maybe with more than one statement on a line, e.g. 'USE db; SELECT * FROM t1;', it could rewind to the
USEstatement? For example, if I try this test:query III nodeidx=8 SET enforce_home_region = true; USE multi_region_test_db; SELECT * FROM t1 WHERE a=2; ---- 2 1 1It rewinds to the USE statement (maybe the query command concatenates and sends all SQLs in a single request?), which we shouldn't retry, so we get the generic error message:
Query has no home region. Try using a lower LIMIT value or running the query from a different region.I'm not sure how important it is to handle this case, since I imagine typically queries are sent one-at-a-time. But we still need to return the proper generic error message here instead of trying to run the "USE" statement with AOST.
I see, thanks for the context.
pkg/sql/plan_opt.go line 371 at r10 (raw file):
// caching should not be done. opc.allowMemoReuse = !p.Descriptors().HasUncommittedTables() && len(p.EvalContext().RemoteRegions) == 0 opc.useCache = opc.allowMemoReuse && queryCacheEnabled.Get(&p.execCfg.Settings.SV) && len(p.EvalContext().RemoteRegions) == 0
nit: the last part of the conditional is already included in allowMemoReuse computation, so it is redundant here. However, it might be safer to explicitly add it, so up to you.
pkg/sql/colflow/vectorized_flow.go line 972 at r15 (raw file):
err = input.EnforceHomeRegionError.ErrorDetail(ctx) if flowCtx.EvalCtx.SessionData().EnforceHomeRegionFollowerReadsEnabled { err = execinfra.NewDynamicQueryHasNoHomeRegionError(err)
nit: it might be cleaner and safer to pass the session data object into the NewDynamicQueryHasNoHomeRegionError directly, and that method would only return retryable error if the corresponding boolean is true. Up to you though.
pkg/sql/execinfra/errors.go line 52 at r10 (raw file):
return false } // If the error is wrapped in a notInternalError, skip to the root cause.
nit: notInternalError is a leftover.
pkg/sql/execinfra/errors.go line 75 at r10 (raw file):
return nil } // If the error is wrapped in a notInternalError, skip to the root cause.
nit: ditto.
pkg/sql/opt/optbuilder/select.go line 586 at r1 (raw file):
Previously, msirek (Mark Sirek) wrote…
The gateway region may not be one of the regions in the multiregion db.
Interesting, nvm then.
pkg/ccl/logictestccl/testdata/logic_test/multi_region_remote_access_error line 876 at r1 (raw file):
Previously, msirek (Mark Sirek) wrote…
OK, I've added a couple of tracing tests to check all KV batches are sent to the local node.
Nice!
pkg/sql/colexec/serial_unordered_synchronizer.go line 101 at r15 (raw file):
s.curSerialInputIdx++ if s.serialInputIdxExclusiveUpperBound > 0 && s.curSerialInputIdx >= int(s.serialInputIdxExclusiveUpperBound) { colexecerror.ExpectedError(s.exceedsInputIdxExclusiveUpperBoundError)
Was this a bug that we previously created a retryable error here? If so, it should probably be fixed in the first commit.
pkg/sql/colexecerror/error.go line 164 at r1 (raw file):
Previously, msirek (Mark Sirek) wrote…
I reverted the name and just fixed
IsDynamicQueryHasNoHomeRegionErrorandMaybeGetNonRetryableDynamicQueryHasNoHomeRegionErrorto unwrap anything that's not adynamicQueryHasNoHomeRegionError. Not sure if you had something better in mind.
I was suggesting a different approach, but what you did seems even better, thanks!
enforce_home_region phase 3 ---- This extends the `enforce_home_region` session setting to not error out right away when executing a locality-optimized search or join operation, and the operation must read rows from a remote region. Instead, it will retry the transaction, but with the local region, as marked in `evalCtx`, updated to indicate one of the remote regions in the cluster. This causes the locality-optimized Scan or Join operation to read from that remote region first. A follower_read_timestamp() is used when re-running the query with AOST, so no remote reads actually occur. If the query succeeds, the current fake local region is reported back to the user in an error, for example: ``` Query is not running in its home region. Try running the query from region 'us-east-1'. ``` All of the remote regions in the cluster are tried until the home region is found, or all remote regions are exhausted, in which case the original "Query has no home region" error is returned. This preview feature is only supported for SELECT queries. Epic: CRDB-18645 Fixes: cockroachdb#83819 Release note (sql change): The enforce_home_region session setting was extended with a new optional preview feature and session setting, which is disabled by default, to dynamically detect and report the home region for SELECT queries based on the locality of the queried rows, if different from the gateway region.
This commit adds a web URL to the text of "Query has no home region" and "Query is not running in its home region" errors, with more information about the errors. Epic: CRDB-18645 Fixes: cockroachdb#86178 Release note (sql change): This adds a URL to errors related to the enforce_home_region session setting, providing additional information about the error.
This commit adds session setting `enforce_home_region_follower_reads_enabled` which defaults to false. If enabled, it allows the `enforce_home_region` setting to perform AOST follower reads to determine a query's home region, if any. Epic: CRDB-18645 Informs: cockroachdb#83819 Release note (sql change): A new session setting was added as a preview feature to allow errors triggered by session setting enforce_home_region to perform AS OF SYSTEM TIME follower_read_timestamp() reads in order to find and report a query's home region. The session setting name is enforce_home_region_follower_reads_enabled.
msirek
left a comment
There was a problem hiding this comment.
TFTRs!
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2 and @yuzefovich)
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: "setting setting".
Thanks, fixed.
pkg/sql/conn_executor_exec.go line 1425 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I see, thanks for the context.
Sure
pkg/sql/plan_opt.go line 371 at r10 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: the last part of the conditional is already included in
allowMemoReusecomputation, so it is redundant here. However, it might be safer to explicitly add it, so up to you.
Removed as suggested.
Code quote:
opc.useCache = opc.allowMemoReuse && queryCacheEnabled.Get(&p.execCfg.Settings.SV) && len(p.EvalContext(pkg/sql/colflow/vectorized_flow.go line 972 at r15 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: it might be cleaner and safer to pass the session data object into the
NewDynamicQueryHasNoHomeRegionErrordirectly, and that method would only return retryable error if the corresponding boolean istrue. Up to you though.
Hmm, I'm not sure. It's only a single conditional and is easy to read inline with the rest of the logic this way. Also, the function name would have to change to something unwieldly like MaybeWrapErrorWithDynamicQueryHasNoHomeRegionError. Also, the session data is not always available, like in performLookup():
if jr.allowEnforceHomeRegionFollowerReads {
err = execinfra.NewDynamicQueryHasNoHomeRegionError(err)
}I might just keep this as-is.
pkg/sql/execinfra/errors.go line 52 at r10 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit:
notInternalErroris a leftover.
Thanks. Changed to:
// If the error is not a `dynamicQueryHasNoHomeRegionError`, unwrap the
// root cause.
Code quote:
If the error is wrapped in a notInternalError, skip to the root cause.pkg/sql/execinfra/errors.go line 75 at r10 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: ditto.
Fixed
pkg/sql/opt/optbuilder/select.go line 586 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Interesting, nvm then.
sure, thanks
pkg/ccl/logictestccl/testdata/logic_test/multi_region_remote_access_error line 876 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Nice!
thanks
pkg/sql/colexec/serial_unordered_synchronizer.go line 101 at r15 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Was this a bug that we previously created a retryable error here? If so, it should probably be fixed in the first commit.
Yeah, an extra retryable error was made here. Fixed the commits.
pkg/sql/colexecerror/error.go line 164 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I was suggesting a different approach, but what you did seems even better, thanks!
OK, thanks
|
Build succeeded: |
In cockroachdb#97827 we added a new ability to dynamically determine the home region of a row when a query failed due to `enforce_home_region`. Unfortunately, the strategy of switching to follower reads after a savepoint rollback is no longer permitted by the KV layer. This commit removes the dynamic home region lookup ability. Fixes: cockroachdb#113765 Release note (sql change): The functionality provided by session variable `enforce_home_region_follower_reads_enabled` was deprecated in v24.2.4 and is now removed. (The variable itself remains for backward compatibility but has no effect.) (Note that related session variable `enforce_home_region` is _not_ deprecated and still functions normally.)
In cockroachdb#97827 we added a new ability to dynamically determine the home region of a row when a query failed due to `enforce_home_region`. Unfortunately, the strategy of switching to follower reads after a savepoint rollback is no longer permitted by the KV layer. This commit removes the dynamic home region lookup ability. Fixes: cockroachdb#113765 Release note (sql change): The functionality provided by session variable `enforce_home_region_follower_reads_enabled` was deprecated in v24.2.4 and is now removed. (The variable itself remains for backward compatibility but has no effect.) (Note that related session variable `enforce_home_region` is _not_ deprecated and still functions normally.)
In cockroachdb#97827 we added a new ability to dynamically determine the home region of a row when a query failed due to `enforce_home_region`. Unfortunately, the strategy of switching to follower reads after a savepoint rollback is no longer permitted by the KV layer. This commit removes the dynamic home region lookup ability. Fixes: cockroachdb#113765 Release note (sql change): The functionality provided by session variable `enforce_home_region_follower_reads_enabled` was deprecated in v24.2.4 and is now removed. (The variable itself remains for backward compatibility but has no effect.) (Note that related session variable `enforce_home_region` is _not_ deprecated and still functions normally.)
In cockroachdb#97827 we added a new ability to dynamically determine the home region of a row when a query failed due to `enforce_home_region`. Unfortunately, the strategy of switching to follower reads after a savepoint rollback is no longer permitted by the KV layer. This commit removes the dynamic home region lookup ability. Fixes: cockroachdb#113765 Release note (sql change): The functionality provided by session variable `enforce_home_region_follower_reads_enabled` was deprecated in v24.2.4 and is now removed. (The variable itself remains for backward compatibility but has no effect.) (Note that related session variable `enforce_home_region` is _not_ deprecated and still functions normally.)
148314: sql: remove enforce_home_region_follower_reads_enabled r=DrewKimball a=michae2 In #97827 we added a new ability to dynamically determine the home region of a row when a query failed due to `enforce_home_region`. Unfortunately, the strategy of switching to follower reads after a savepoint rollback is no longer permitted by the KV layer. This commit removes the dynamic home region lookup ability. Fixes: #113765 Release note (sql change): The functionality provided by session variable `enforce_home_region_follower_reads_enabled` was deprecated in v24.2.4 and is now removed. (The variable itself remains for backward compatibility but has no effect.) (Note that related session variable `enforce_home_region` is _not_ deprecated and still functions normally.) Co-authored-by: Michael Erickson <michae2@cockroachlabs.com>
enforce_home_region phase 3
This extends the
enforce_home_regionsession setting to not error outright away when executing a locality-optimized search or join operation,
and the operation must read rows from a remote region. Instead, it will
retry the transaction, but with the local region, as marked in
evalCtx,updated to indicate one of the remote regions in the cluster. This causes
the locality-optimized Scan or Join operation to read from that remote
region first. A follower_read_timestamp() is used when re-running the
query with AOST, so no remote reads actually occur. If the query succeeds,
the current fake local region is reported back to the user in an error,
for example:
All of the remote regions in the cluster are tried until the home region
is found, or all remote regions are exhausted, in which case the original
"Query has no home region" error is returned.
This preview feature is only supported for SELECT queries.
Epic: CRDB-18645
Fixes: #83819
Release note (sql change): The enforce_home_region session setting was
extended with a new optional preview feature and session setting, which is
disabled by default, to dynamically detect and report the home region for
SELECT queries based on the locality of the queried rows, if different
from the gateway region.
execbuilder: add docs link to enforce_home_region errors
This commit adds a web URL to the text of "Query has no home region"
and "Query is not running in its home region" errors, with more
information about the errors.
Epic: CRDB-18645
Fixes: #86178
Release note (sql change): This adds a URL to errors related to the
enforce_home_region session setting, providing additional information
about the error.
sql: session setting enforce_home_region_follower_reads_enabled
This commit adds session setting
enforce_home_region_follower_reads_enabledwhich defaults to false. If enabled, it allows the
enforce_home_regionsetting to performAOST follower reads to determine a query's home region, if any.
Epic: CRDB-18645
Informs: #83819
Epic: CRDB-18645
Informs: #83819
Release note (sql change): A new session setting was added as a preview feature to allow
errors triggered by session setting enforce_home_region to perform AS OF SYSTEM TIME
follower_read_timestamp() reads in order to find and report a query's home region.
The session setting name is enforce_home_region_follower_reads_enabled.