Skip to content

sql: dynamically detect a query's home region with enforce_home_region#97827

Merged
craig[bot] merged 3 commits intocockroachdb:masterfrom
msirek:phase3
Mar 7, 2023
Merged

sql: dynamically detect a query's home region with enforce_home_region#97827
craig[bot] merged 3 commits intocockroachdb:masterfrom
msirek:phase3

Conversation

@msirek
Copy link
Copy Markdown
Contributor

@msirek msirek commented Mar 1, 2023

enforce_home_region phase 3

This extends the enforce_home_region session setting to not error out
right away when executing a locality-optimized search or join operation,
and the operation must read rows from a remote region. Instead, it will
retry the transaction, but with the local region, as marked in evalCtx,
updated to indicate one of the remote regions in the cluster. This causes
the locality-optimized Scan or Join operation to read from that remote
region first. A follower_read_timestamp() is used when re-running the
query with AOST, so no remote reads actually occur. If the query succeeds,
the current fake local region is reported back to the user in an error,
for example:

Query is not running in its home region.
Try running the query from region 'us-east-1'.

All of the remote regions in the cluster are tried until the home region
is found, or all remote regions are exhausted, in which case the original
"Query has no home region" error is returned.

This preview feature is only supported for SELECT queries.

Epic: CRDB-18645
Fixes: #83819

Release note (sql change): The enforce_home_region session setting was
extended with a new optional preview feature and session setting, which is
disabled by default, to dynamically detect and report the home region for
SELECT queries based on the locality of the queried rows, if different
from the gateway region.


execbuilder: add docs link to enforce_home_region errors

This commit adds a web URL to the text of "Query has no home region"
and "Query is not running in its home region" errors, with more
information about the errors.

Epic: CRDB-18645
Fixes: #86178

Release note (sql change): This adds a URL to errors related to the
enforce_home_region session setting, providing additional information
about the error.


sql: session setting enforce_home_region_follower_reads_enabled

This commit adds session setting enforce_home_region_follower_reads_enabled
which defaults to false. If enabled, it allows the enforce_home_region setting to perform
AOST follower reads to determine a query's home region, if any.

Epic: CRDB-18645
Informs: #83819

Epic: CRDB-18645
Informs: #83819

Release note (sql change): A new session setting was added as a preview feature to allow
errors triggered by session setting enforce_home_region to perform AS OF SYSTEM TIME
follower_read_timestamp() reads in order to find and report a query's home region.
The session setting name is enforce_home_region_follower_reads_enabled.

@msirek msirek requested review from michae2, rytaft and yuzefovich March 1, 2023 01:42
@msirek msirek requested a review from a team as a code owner March 1, 2023 01:42
@msirek msirek requested a review from a team March 1, 2023 01:42
@msirek msirek requested a review from a team as a code owner March 1, 2023 01:42
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice stuff! I have some questions and nits.

Reviewed 21 of 21 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @msirek, and @rytaft)


-- commits line 13 at r1:
Just to make sure I understand (let's consider an example of LOS where we want to read a single row that actually lives in us-west but we connected to us-east):

  • first, we use the actual locality of the gateway node (us-east) and perform a scan
  • that scan doesn't find the row, then we know we'll error out for the query as a whole; however, we want to extend the error message with the region where that single row lives
  • to do that we will modify the eval context to make the optimizer think that us-west is the local region, so it'll construct a scan for us-west
  • in order to not do cross-region read, we will use AOST with follower read timestamp.

Now, I'm wondering are we guaranteed to always have a follower replica of all remote regions in the local region? I.e. I'm wondering whether it's possible that we still will do a cross-region request to find out the home region for the query.


-- commits line 28 at r1:
nit: this probably deserves a release note.


pkg/sql/conn_executor.go line 2753 at r1 (raw file):

func errIsRetriable(err error) bool {
	isDynamicQueryHasNoHomeRegionError := row.IsDynamicQueryHasNoHomeRegionError(err)

nit: this variable declaration seems redundant given it's only used once and the logical ORs below inline the function calls for other error types.


pkg/sql/conn_executor.go line 3062 at r1 (raw file):

	// data which may be after the schema change when we retry.
	var minTSErr *kvpb.MinTimestampBoundUnsatisfiableError
	p.EvalContext().Locality = p.EvalContext().OriginalLocality

nit: perhaps add a quick comment for why we do this restore of the locality here.


pkg/sql/conn_executor.go line 3068 at r1 (raw file):

	} else if row.IsDynamicQueryHasNoHomeRegionError(autoRetryReason) {
		if int(ex.state.mu.autoRetryCounter) <= len(p.EvalContext().RemoteRegions) {
			p.EvalContext().Locality =

Some commentary here would be helpful. In particular, it'd be good to mention that this Locality field will only influence which region is considered "local" by the optimizer, but the query will not actually run on a node from that remote region as its gateway.


pkg/sql/conn_executor_exec.go line 752 at r1 (raw file):

	var rollbackSP *tree.RollbackToSavepoint
	var r *tree.ReleaseSavepoint
	enforceHomeRegion := p.IsANSIDML() && p.EvalContext().SessionData().EnforceHomeRegion

nit: I think we have a similar conditional in several places now, perhaps we could extract a small helper?


pkg/sql/conn_executor_exec.go line 768 at r1 (raw file):

		}

		r = &tree.ReleaseSavepoint{Savepoint: "enforce_home_region_sp"}

Do we need to care about a user also creating a savepoint with this name?


pkg/sql/conn_executor_exec.go line 807 at r1 (raw file):

				// A retryable "query has no home region" error has occurred.
				// Roll back to the internal savepoint in preparation for the next
				// planning and execution of this query with a different gateway region.

nit: perhaps add something like with a different gateway region (as considered by the optimizer).


pkg/sql/conn_executor_exec.go line 816 at r1 (raw file):

				if etr == nil || eventPayload != nil {
					err = errors.AssertionFailedf(
						"unable to roll back internal savepoint for enforce_home_region",

nit: s/roll back/roll back to/ here and below.


pkg/sql/conn_executor_exec.go line 824 at r1 (raw file):

				}
			} else {
				internalError := errors.AssertionFailedf(

Why didn't we override err here?


pkg/sql/conn_executor_exec.go line 836 at r1 (raw file):

			// non-retryable form.
			errorMessage := err.Error()
			if !strings.HasPrefix(errorMessage, "Query is not running in its home region") {

nit: perhaps this constant string should be extracted somewhere to a place where we're constructing the error?


pkg/sql/conn_executor_exec.go line 1412 at r1 (raw file):

	if res.Err() == nil && err == nil {
		autoRetryReason := ex.state.mu.autoRetryReason
		if row.IsDynamicQueryHasNoHomeRegionError(ex.state.mu.autoRetryReason) {

nit: we could use autoRetryReason variable here.


pkg/sql/conn_executor_exec.go line 1423 at r1 (raw file):

				)
				res.SetError(err)
				return nil

Why are we returning nil here and below?


pkg/sql/conn_executor_exec.go line 1425 at r1 (raw file):

				return nil
			}
			// If for some reason we're not running the same query as before, report

How can we run a different query than the one we're retrying with a different gateway region?


pkg/sql/colexec/BUILD.bazel line 67 at r1 (raw file):

        "//pkg/sql/execinfrapb",
        "//pkg/sql/memsize",
        "//pkg/sql/row",

nit: this dependency of colexec on row seems a bit off. Perhaps we should move the new code from row/errors.go into a different package, say execinfra - that package seems more suitable for that.


pkg/sql/colexecerror/error.go line 164 at r1 (raw file):

// NotInternalError will be returned to the client not as an
// "internal error" and without the stack trace.
type NotInternalError struct {

Let's not change the name here (since this error is sent across the wire, we'd need to register a type migration if we did, otherwise sentry reports can be broken). Instead, let's introduce some unwrapper exported methods that return the cause in colexecerror and use those in row package.


pkg/sql/opt/optbuilder/select.go line 582 at r1 (raw file):

		// Populate the remote regions touched by the multiregion database used in
		// this query. If a query dynamically errors out as having no home region,
		// these

nit: incomplete comment.


pkg/sql/opt/optbuilder/select.go line 586 at r1 (raw file):

			if regionsNames, ok := tabMeta.GetRegionsInDatabase(b.ctx, b.evalCtx.Planner); ok {
				if gatewayRegion, ok := b.evalCtx.Locality.Find("region"); ok {
					b.evalCtx.RemoteRegions = make(catpb.RegionNames, 0, len(regionsNames))

nit: len(regionNames)-1 since we don't include the gateway.


pkg/sql/sem/eval/context.go line 99 at r1 (raw file):

	Locality roachpb.Locality

	OriginalLocality roachpb.Locality

nit: add a quick comment for why we need this and how it might differ from Locality field. Perhaps also adjust the comment on that field too.


pkg/sql/sem/eval/context.go line 258 at r1 (raw file):

	// RemoteRegions contains the slice of remote regions in a multiregion
	// database which owns a table accessed by the current SQL request.

nit: mention that this slice is only populated during the optbuilding stage.


pkg/ccl/logictestccl/testdata/logic_test/multi_region_remote_access_error line 876 at r1 (raw file):

retry
statement error pq: Query is not running in its home region\. Try running the query from region 'us-east-1'\.
SELECT * FROM t1 WHERE a=1;

Should we also add some tests that use tracing so that we could ensure that the internal behavior is as expected?

@msirek msirek requested a review from a team March 2, 2023 02:35
@msirek msirek force-pushed the phase3 branch 2 times, most recently from d078e02 to e05fed9 Compare March 2, 2023 09:20
Copy link
Copy Markdown
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, and @yuzefovich)


-- commits line 13 at r1:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Just to make sure I understand (let's consider an example of LOS where we want to read a single row that actually lives in us-west but we connected to us-east):

  • first, we use the actual locality of the gateway node (us-east) and perform a scan
  • that scan doesn't find the row, then we know we'll error out for the query as a whole; however, we want to extend the error message with the region where that single row lives
  • to do that we will modify the eval context to make the optimizer think that us-west is the local region, so it'll construct a scan for us-west
  • in order to not do cross-region read, we will use AOST with follower read timestamp.

Now, I'm wondering are we guaranteed to always have a follower replica of all remote regions in the local region? I.e. I'm wondering whether it's possible that we still will do a cross-region request to find out the home region for the query.

Yes, that summary is correct. Whether we'd always have a follower replica in the local region, I don't know, maybe not. It's a best effort attempt to keep latency during an error condition low. Is there something that works better than follower reads?


-- commits line 28 at r1:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: this probably deserves a release note.

added:

Release note (sql change): The enforce_home_region session setting was
extended to dynamically detect and report a query's home region, based on
the locality of the queried rows, if different from the gateway region.

pkg/sql/conn_executor.go line 2753 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: this variable declaration seems redundant given it's only used once and the logical ORs below inline the function calls for other error types.

Removed the variable

Code quote:

isDynamicQueryHasNoHomeRegionError := row.IsDynamicQueryHasNoHomeRegionError(err)

pkg/sql/conn_executor.go line 3062 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: perhaps add a quick comment for why we do this restore of the locality here.

Added:

	// Make sure the default locality specifies the actual gateway region at the
	// start of query compilation. It could have been overridden to a remote
	// region when the enforce_home_region session setting is true.

I've gone ahead and added some code to also restore the locality right after we determine which non-retryable error to use (see setErrorAndRestoreLocality).


pkg/sql/conn_executor.go line 3068 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Some commentary here would be helpful. In particular, it'd be good to mention that this Locality field will only influence which region is considered "local" by the optimizer, but the query will not actually run on a node from that remote region as its gateway.

Added:

			// Set a fake gateway region for use by the optimizer to inform its
			// decision on which region to access first in locality-optimized scan
			// and join operations. This setting does not affect the distsql planner,
			// and local plans will continue to be run from the actual gateway region.

pkg/sql/conn_executor_exec.go line 752 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: I think we have a similar conditional in several places now, perhaps we could extract a small helper?

Added planner.EnforceHomeRegion.


pkg/sql/conn_executor_exec.go line 768 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Do we need to care about a user also creating a savepoint with this name?

Probably so, to be safe, even though a collision is unlikely. Added some unprintable characters to it:

		var b strings.Builder
		b.WriteString("enforce_home_region_sp")
		// Add some unprintable ASCII characters to the name of the savepoint to
		// decrease the likelihood of collision with a user-created savepoint.
		b.WriteRune(rune(0x11))
		b.WriteRune(rune(0x12))
		b.WriteRune(rune(0x13))
		enforceHomeRegionSavepointName := tree.Name(b.String())

pkg/sql/conn_executor_exec.go line 807 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: perhaps add something like with a different gateway region (as considered by the optimizer).

Done


pkg/sql/conn_executor_exec.go line 816 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: s/roll back/roll back to/ here and below.

Done

Code quote:

unable to roll back internal savepoint

pkg/sql/conn_executor_exec.go line 824 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Why didn't we override err here?

Yeah, no need for a new variable. Fixed it.

Code quote:

internalError := errors.AssertionFailedf(

pkg/sql/conn_executor_exec.go line 836 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: perhaps this constant string should be extracted somewhere to a place where we're constructing the error?

Added execinfra.QueryNotRunningInHomeRegionMessagePrefix.

Code quote:

Query is not running in its home region

pkg/sql/conn_executor_exec.go line 1412 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: we could use autoRetryReason variable here.

Fixed.


pkg/sql/conn_executor_exec.go line 1423 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Why are we returning nil here and below?

So that when we return it'll trigger a call to makeErrEvent.


pkg/sql/conn_executor_exec.go line 1425 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

How can we run a different query than the one we're retrying with a different gateway region?

Maybe with more than one statement on a line, e.g. 'USE db; SELECT * FROM t1;', it could rewind to the USE statement? For example, if I try this test:

query III nodeidx=8
SET enforce_home_region = true;
USE multi_region_test_db;
SELECT * FROM t1 WHERE a=2;
----
2  1  1

It rewinds to the USE statement (maybe the query command concatenates and sends all SQLs in a single request?), which we shouldn't retry, so we get the generic error message:

Query has no home region. Try using a lower LIMIT value or running the query from a different region.

I'm not sure how important it is to handle this case, since I imagine typically queries are sent one-at-a-time. But we still need to return the proper generic error message here instead of trying to run the "USE" statement with AOST.


pkg/sql/opt/optbuilder/select.go line 582 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: incomplete comment.

Updated it to:

		// Populate the remote regions touched by the multiregion database used in
		// this query. If a query dynamically errors out as having no home region,
		// the query will be replanned with each of the remote regions,
		// one-at-a-time, with AOST follower_read_timestamp(). If one of these
		// retries doesn't error out, that region will be reported to the user as
		// the query's home region.

pkg/sql/opt/optbuilder/select.go line 586 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: len(regionNames)-1 since we don't include the gateway.

The gateway region may not be one of the regions in the multiregion db.


pkg/sql/sem/eval/context.go line 99 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: add a quick comment for why we need this and how it might differ from Locality field. Perhaps also adjust the comment on that field too.

Done

Code quote:

OriginalLocality roachpb.Locality

pkg/sql/sem/eval/context.go line 258 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: mention that this slice is only populated during the optbuilding stage.

Done


pkg/ccl/logictestccl/testdata/logic_test/multi_region_remote_access_error line 876 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Should we also add some tests that use tracing so that we could ensure that the internal behavior is as expected?

OK, I've added a couple of tracing tests to check all KV batches are sent to the local node.


pkg/sql/colexec/BUILD.bazel line 67 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: this dependency of colexec on row seems a bit off. Perhaps we should move the new code from row/errors.go into a different package, say execinfra - that package seems more suitable for that.

Done


pkg/sql/colexecerror/error.go line 164 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Let's not change the name here (since this error is sent across the wire, we'd need to register a type migration if we did, otherwise sentry reports can be broken). Instead, let's introduce some unwrapper exported methods that return the cause in colexecerror and use those in row package.

I reverted the name and just fixed IsDynamicQueryHasNoHomeRegionError and MaybeGetNonRetryableDynamicQueryHasNoHomeRegionError to unwrap anything that's not a dynamicQueryHasNoHomeRegionError. Not sure if you had something better in mind.

@dhartunian dhartunian removed the request for review from a team March 2, 2023 15:24
@msirek
Copy link
Copy Markdown
Contributor Author

msirek commented Mar 2, 2023

@taroface Could you perhaps look at the form of the error message in this PR from a docs perspective and see if there are any issues or suggestions for change? :

Query is not running in its home region. Try running the query from region 'us-east-1'.

Is there anyone else from the docs team you think might want to look at this?

@msirek msirek requested a review from taroface March 2, 2023 17:40
@taroface
Copy link
Copy Markdown
Collaborator

taroface commented Mar 2, 2023

@taroface Could you perhaps look at the form of the error message in this PR from a docs perspective and see if there are any issues or suggestions for change? :

Query is not running in its home region. Try running the query from region 'us-east-1'.

Is there anyone else from the docs team you think might want to look at this?

Thanks for asking -- only thought is to include the exact syntax for specifying the region (I'm not sure what it is?).

Query is not running in its home region. Try running the query with '<flag> us-east-1'.

EDIT: Also, in this context, is it worth reminding that enforce_home_region is on, and explaining why, or offering the command to disable?

Copy link
Copy Markdown
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only thought is to include the exact syntax for specifying the region (I'm not sure what it is?).

The only way I know to do it is for the application to disconnect and then reconnect with a connection string specifying a node in a different region. The application writer would have to know which nodes are in which regions. There is info about enforce_home_region on this page: https://www.cockroachlabs.com/docs/stable/set-vars.html, but it doesn't mention how to handle the Query is not running in its home region error. If it did, I could just add a link to that page in the error message. Or do you have other thoughts on rewording this? I guess the question is, do we want the error message to be really verbose, or do we want to redirect the user to docs for additional info.r

Also, in this context, is it worth reminding that enforce_home_region is on, and explaining why, or offering the command to disable?

Not sure. The user would have had to manually turn on this setting to hit the error message, and probably knows what it's used for before using it. If the error message included a line, "Please see enforce_home_region at https://www.cockroachlabs.com/docs/stable/set-vars.html" that could provide more info. What do you think?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, @taroface, and @yuzefovich)

@taroface
Copy link
Copy Markdown
Collaborator

taroface commented Mar 2, 2023

Per our convo offline, opened cockroachdb/docs#16404 to enhance the doc to which this message can be linked.

Is there a variable that can specify the correct doc version in the URL?

Copy link
Copy Markdown
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a variable that can specify the correct doc version in the URL?

Do you mean, should the error message print a version-specific URL? I don't know, it may be good to just report the stable URL in case we want to include small tweaks to the wording. The new text you just added is applicable to v22.2 as well. What do you think?

I'm planning to suffix each error message with the following?

See https://www.cockroachlabs.com/docs/stable/cost-based-optimizer.html#control-whether-queries-are-limited-to-a-single-region.

Any recommended tweaks, or is this good?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, @taroface, and @yuzefovich)

Copy link
Copy Markdown
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very impressive turnaround! Sorry I still haven't had a chance to look at this in detail, but I added one high-level comment below. I'm OOO tomorrow so no need to wait for my review if you get sign-off before I come back.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @michae2, @taroface, and @yuzefovich)


-- commits line 13 at r1:

Previously, msirek (Mark Sirek) wrote…

Yes, that summary is correct. Whether we'd always have a follower replica in the local region, I don't know, maybe not. It's a best effort attempt to keep latency during an error condition low. Is there something that works better than follower reads?

Follower reads seems like a reasonable option, but it would be better if we could prevent it from visiting a remote region if there is no local replica (or the local replica is not sufficiently up-to-date). We don't yet support the option to error instead of visiting a remote region with follower reads, but there is an issue open for it: #97512.

In a discussion with @andy-kimball today, he mentioned that whatever solution we choose, we should make sure that it's not worse than what a user could do on their own using GLOBAL tables. This approach seems slightly worse to me, since a PLACEMENT RESTRICTED table would not have a local replica in every region and therefore would cause a remote read every time a user tried to access it from outside the home region.

Unless we can come up with a better solution, I would suggest that we put this logic behind a cluster setting that is defaulted to false in 23.1, and then set it to true once we address #97512.

@andy-kimball
Copy link
Copy Markdown
Contributor

I'd also be worried about re-running the query for every region, if there's any chance the follower reads might exit the current region (e.g. the closed timestamp is delayed) and cause N calls to other regions. As Becca said, we only want to enhance the error message if it's as good or better as what the user can do themself (e.g. querying a global users table to find the home region).

Copy link
Copy Markdown
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I can't think of any other solutions, so I guess I'll add a cluster setting as suggested for this to be disabled by default until #97512 is fixed.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @michae2, @taroface, and @yuzefovich)

Copy link
Copy Markdown
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added cluster setting: sql.multiregion.enforce_home_region_follower_reads.enabled

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @michae2, @taroface, and @yuzefovich)

Copy link
Copy Markdown
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @michae2, @taroface, and @yuzefovich)


-- commits line 13 at r1:

Unless we can come up with a better solution, I would suggest that we put this logic behind a cluster setting that is defaulted to false in 23.1, and then set it to true once we address #97512.

Added cluster setting: sql.multiregion.enforce_home_region_follower_reads.enabled

@taroface
Copy link
Copy Markdown
Collaborator

taroface commented Mar 6, 2023

@msirek Linking to stable sounds good. My suggestion for the link copy:

For more information, see https://www.cockroachlabs.com/docs/stable/cost-based-optimizer.html#control-whether-queries-are-limited-to-a-single-region

I'm not sure if having the period after the link will affect the link. If you think it won't, then feel free to add it!

@msirek msirek force-pushed the phase3 branch 2 times, most recently from cbb8a69 to 53b219b Compare March 6, 2023 23:15
Copy link
Copy Markdown
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For more information, see https://www.cockroachlabs.com/docs/stable/cost-based-optimizer.html#control-whether-queries-are-limited-to-a-single-region

@taroface I've updated the sentence and link as suggested, with no trailing period.

In the multiregion meeting today it was decided to make this a preview feature, so I've marked up the Release notes to indicate so. Let me know if you think more is needed to calls this out for the docs team. Instead of a cluster setting, this feature will now be controlled via a session setting, enforce_home_region_follower_reads_enabled, which defaults to false.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @michae2, and @yuzefovich)

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 6 files at r2, 11 of 25 files at r3, 1 of 1 files at r4, 22 of 22 files at r10, 2 of 14 files at r12, 10 of 10 files at r14, 12 of 12 files at r15, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @michae2 and @msirek)


-- commits line 53 at r15:
nit: "setting setting".


pkg/sql/conn_executor_exec.go line 1425 at r1 (raw file):

Previously, msirek (Mark Sirek) wrote…

Maybe with more than one statement on a line, e.g. 'USE db; SELECT * FROM t1;', it could rewind to the USE statement? For example, if I try this test:

query III nodeidx=8
SET enforce_home_region = true;
USE multi_region_test_db;
SELECT * FROM t1 WHERE a=2;
----
2  1  1

It rewinds to the USE statement (maybe the query command concatenates and sends all SQLs in a single request?), which we shouldn't retry, so we get the generic error message:

Query has no home region. Try using a lower LIMIT value or running the query from a different region.

I'm not sure how important it is to handle this case, since I imagine typically queries are sent one-at-a-time. But we still need to return the proper generic error message here instead of trying to run the "USE" statement with AOST.

I see, thanks for the context.


pkg/sql/plan_opt.go line 371 at r10 (raw file):

		// caching should not be done.
		opc.allowMemoReuse = !p.Descriptors().HasUncommittedTables() && len(p.EvalContext().RemoteRegions) == 0
		opc.useCache = opc.allowMemoReuse && queryCacheEnabled.Get(&p.execCfg.Settings.SV) && len(p.EvalContext().RemoteRegions) == 0

nit: the last part of the conditional is already included in allowMemoReuse computation, so it is redundant here. However, it might be safer to explicitly add it, so up to you.


pkg/sql/colflow/vectorized_flow.go line 972 at r15 (raw file):

				err = input.EnforceHomeRegionError.ErrorDetail(ctx)
				if flowCtx.EvalCtx.SessionData().EnforceHomeRegionFollowerReadsEnabled {
					err = execinfra.NewDynamicQueryHasNoHomeRegionError(err)

nit: it might be cleaner and safer to pass the session data object into the NewDynamicQueryHasNoHomeRegionError directly, and that method would only return retryable error if the corresponding boolean is true. Up to you though.


pkg/sql/execinfra/errors.go line 52 at r10 (raw file):

		return false
	}
	// If the error is wrapped in a notInternalError, skip to the root cause.

nit: notInternalError is a leftover.


pkg/sql/execinfra/errors.go line 75 at r10 (raw file):

		return nil
	}
	// If the error is wrapped in a notInternalError, skip to the root cause.

nit: ditto.


pkg/sql/opt/optbuilder/select.go line 586 at r1 (raw file):

Previously, msirek (Mark Sirek) wrote…

The gateway region may not be one of the regions in the multiregion db.

Interesting, nvm then.


pkg/ccl/logictestccl/testdata/logic_test/multi_region_remote_access_error line 876 at r1 (raw file):

Previously, msirek (Mark Sirek) wrote…

OK, I've added a couple of tracing tests to check all KV batches are sent to the local node.

Nice!


pkg/sql/colexec/serial_unordered_synchronizer.go line 101 at r15 (raw file):

			s.curSerialInputIdx++
			if s.serialInputIdxExclusiveUpperBound > 0 && s.curSerialInputIdx >= int(s.serialInputIdxExclusiveUpperBound) {
				colexecerror.ExpectedError(s.exceedsInputIdxExclusiveUpperBoundError)

Was this a bug that we previously created a retryable error here? If so, it should probably be fixed in the first commit.


pkg/sql/colexecerror/error.go line 164 at r1 (raw file):

Previously, msirek (Mark Sirek) wrote…

I reverted the name and just fixed IsDynamicQueryHasNoHomeRegionError and MaybeGetNonRetryableDynamicQueryHasNoHomeRegionError to unwrap anything that's not a dynamicQueryHasNoHomeRegionError. Not sure if you had something better in mind.

I was suggesting a different approach, but what you did seems even better, thanks!

Mark Sirek added 3 commits March 6, 2023 16:49
enforce_home_region phase 3
----

This extends the `enforce_home_region` session setting to not error out
right away when executing a locality-optimized search or join operation,
and the operation must read rows from a remote region. Instead, it will
retry the transaction, but with the local region, as marked in `evalCtx`,
updated to indicate one of the remote regions in the cluster. This causes
the locality-optimized Scan or Join operation to read from that remote
region first. A follower_read_timestamp() is used when re-running the
query with AOST, so no remote reads actually occur. If the query succeeds,
the current fake local region is reported back to the user in an error,
for example:
```
Query is not running in its home region.
Try running the query from region 'us-east-1'.
```
All of the remote regions in the cluster are tried until the home region
is found, or all remote regions are exhausted, in which case the original
"Query has no home region" error is returned.

This preview feature is only supported for SELECT queries.

Epic: CRDB-18645
Fixes: cockroachdb#83819

Release note (sql change): The enforce_home_region session setting was
extended with a new optional preview feature and session setting, which is
disabled by default, to dynamically detect and report the home region for
SELECT queries based on the locality of the queried rows, if different
from the gateway region.
This commit adds a web URL to the text of "Query has no home region"
and "Query is not running in its home region" errors, with more
information about the errors.

Epic: CRDB-18645
Fixes: cockroachdb#86178

Release note (sql change): This adds a URL to errors related to the
enforce_home_region session setting, providing additional information
about the error.
This commit adds session setting `enforce_home_region_follower_reads_enabled`
which defaults to false. If enabled, it allows the `enforce_home_region` setting to perform
AOST follower reads to determine a query's home region, if any.

Epic: CRDB-18645
Informs: cockroachdb#83819

Release note (sql change): A new session setting was added as a preview feature to allow
errors triggered by session setting enforce_home_region to perform AS OF SYSTEM TIME
follower_read_timestamp() reads in order to find and report a query's home region.
The session setting name is enforce_home_region_follower_reads_enabled.
Copy link
Copy Markdown
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!
bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @michae2 and @yuzefovich)


-- commits line 53 at r15:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: "setting setting".

Thanks, fixed.


pkg/sql/conn_executor_exec.go line 1425 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I see, thanks for the context.

Sure


pkg/sql/plan_opt.go line 371 at r10 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: the last part of the conditional is already included in allowMemoReuse computation, so it is redundant here. However, it might be safer to explicitly add it, so up to you.

Removed as suggested.

Code quote:

opc.useCache = opc.allowMemoReuse && queryCacheEnabled.Get(&p.execCfg.Settings.SV) && len(p.EvalContext(

pkg/sql/colflow/vectorized_flow.go line 972 at r15 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: it might be cleaner and safer to pass the session data object into the NewDynamicQueryHasNoHomeRegionError directly, and that method would only return retryable error if the corresponding boolean is true. Up to you though.

Hmm, I'm not sure. It's only a single conditional and is easy to read inline with the rest of the logic this way. Also, the function name would have to change to something unwieldly like MaybeWrapErrorWithDynamicQueryHasNoHomeRegionError. Also, the session data is not always available, like in performLookup():

		if jr.allowEnforceHomeRegionFollowerReads {
			err = execinfra.NewDynamicQueryHasNoHomeRegionError(err)
		}

I might just keep this as-is.


pkg/sql/execinfra/errors.go line 52 at r10 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: notInternalError is a leftover.

Thanks. Changed to:

		// If the error is not a `dynamicQueryHasNoHomeRegionError`, unwrap the
		// root cause.	

Code quote:

If the error is wrapped in a notInternalError, skip to the root cause.

pkg/sql/execinfra/errors.go line 75 at r10 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: ditto.

Fixed


pkg/sql/opt/optbuilder/select.go line 586 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Interesting, nvm then.

sure, thanks


pkg/ccl/logictestccl/testdata/logic_test/multi_region_remote_access_error line 876 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Nice!

thanks


pkg/sql/colexec/serial_unordered_synchronizer.go line 101 at r15 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Was this a bug that we previously created a retryable error here? If so, it should probably be fixed in the first commit.

Yeah, an extra retryable error was made here. Fixed the commits.


pkg/sql/colexecerror/error.go line 164 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I was suggesting a different approach, but what you did seems even better, thanks!

OK, thanks

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 7, 2023

Build succeeded:

@craig craig bot merged commit 781e541 into cockroachdb:master Mar 7, 2023
@msirek msirek deleted the phase3 branch March 14, 2023 17:48
michae2 added a commit to michae2/cockroach that referenced this pull request Jun 13, 2025
In cockroachdb#97827 we added a new ability to dynamically determine the home
region of a row when a query failed due to `enforce_home_region`.
Unfortunately, the strategy of switching to follower reads after a
savepoint rollback is no longer permitted by the KV layer. This commit
removes the dynamic home region lookup ability.

Fixes: cockroachdb#113765

Release note (sql change): The functionality provided by session
variable `enforce_home_region_follower_reads_enabled` was deprecated in
v24.2.4 and is now removed. (The variable itself remains for backward
compatibility but has no effect.) (Note that related session variable
`enforce_home_region` is _not_ deprecated and still functions normally.)
michae2 added a commit to michae2/cockroach that referenced this pull request Jun 25, 2025
In cockroachdb#97827 we added a new ability to dynamically determine the home
region of a row when a query failed due to `enforce_home_region`.
Unfortunately, the strategy of switching to follower reads after a
savepoint rollback is no longer permitted by the KV layer. This commit
removes the dynamic home region lookup ability.

Fixes: cockroachdb#113765

Release note (sql change): The functionality provided by session
variable `enforce_home_region_follower_reads_enabled` was deprecated in
v24.2.4 and is now removed. (The variable itself remains for backward
compatibility but has no effect.) (Note that related session variable
`enforce_home_region` is _not_ deprecated and still functions normally.)
michae2 added a commit to michae2/cockroach that referenced this pull request Aug 22, 2025
In cockroachdb#97827 we added a new ability to dynamically determine the home
region of a row when a query failed due to `enforce_home_region`.
Unfortunately, the strategy of switching to follower reads after a
savepoint rollback is no longer permitted by the KV layer. This commit
removes the dynamic home region lookup ability.

Fixes: cockroachdb#113765

Release note (sql change): The functionality provided by session
variable `enforce_home_region_follower_reads_enabled` was deprecated in
v24.2.4 and is now removed. (The variable itself remains for backward
compatibility but has no effect.) (Note that related session variable
`enforce_home_region` is _not_ deprecated and still functions normally.)
michae2 added a commit to michae2/cockroach that referenced this pull request Aug 25, 2025
In cockroachdb#97827 we added a new ability to dynamically determine the home
region of a row when a query failed due to `enforce_home_region`.
Unfortunately, the strategy of switching to follower reads after a
savepoint rollback is no longer permitted by the KV layer. This commit
removes the dynamic home region lookup ability.

Fixes: cockroachdb#113765

Release note (sql change): The functionality provided by session
variable `enforce_home_region_follower_reads_enabled` was deprecated in
v24.2.4 and is now removed. (The variable itself remains for backward
compatibility but has no effect.) (Note that related session variable
`enforce_home_region` is _not_ deprecated and still functions normally.)
craig bot pushed a commit that referenced this pull request Aug 25, 2025
148314: sql: remove enforce_home_region_follower_reads_enabled r=DrewKimball a=michae2

In #97827 we added a new ability to dynamically determine the home region of a row when a query failed due to `enforce_home_region`. Unfortunately, the strategy of switching to follower reads after a savepoint rollback is no longer permitted by the KV layer. This commit removes the dynamic home region lookup ability.

Fixes: #113765

Release note (sql change): The functionality provided by session variable `enforce_home_region_follower_reads_enabled` was deprecated in v24.2.4 and is now removed. (The variable itself remains for backward compatibility but has no effect.) (Note that related session variable `enforce_home_region` is _not_ deprecated and still functions normally.)

Co-authored-by: Michael Erickson <michae2@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

6 participants