server: honor and validate the service mode for SQL pods by knz · Pull Request #96144 · cockroachdb/cockroach

knz · 2023-01-28T18:38:29Z

Rebased on top of #105441.
Fixes #93145.
Fixes #83650.

TLDR: this makes SQL servers only accept to start if their data state
is "ready" and their deployment type (e.g. separate process) matches
the configured service mode in the tenant record.
Additionally, the SQL server spontaneously terminates if
their service mode or data state is not valid any more (e.g
as a result of DROP TENANT or ALTER TENANT STOP SERVICE).

Prior to this patch, there wasn't a good story about the lifecycle
of separate-process SQL servers ("SQL pods"):

if a SQL server was started against a non-existent tenant,
an obscure error would be raised (database "[1]" does not exist)
if a SQL server was started while a tenant was being added
(keyspace not yet valid), no check would be performed and
data corruption could ensue.
if the tenant record/keyspace was dropped while the SQL server was
running, SQL clients would start encountering obscure errors.

This commit fixes the situation by checking the tenant metadata:

once, during SQL server startup, at which point server startup
is prevented if the service check fails;
then, asynchronously, whenever the metadata is updated, such that
any service check failure results in a graceful shutdown of the SQL
service.

The check proper validates:

that the tenant record exists;
that the data state is "ready";
that the configured service mode matches that requested by the SQL
server.

Examples output upon error:

non-existent tenant:

tenant service check failed: missing record

attempting to start separate-process server while
tenant is running as shared-process:

tenant service check failed: service mode check failed: expected external, record says shared

after ALTER TENANT STOP SERVICE:

tenant service check failed: service mode check failed: expected external, record says none

after DROP TENANT:

tenant service check failed: service mode check failed: expected external, record says dropping

Release note: None

cockroach-teamcity · 2023-01-28T18:38:37Z

This change is

knz · 2023-01-28T18:41:44Z

@stevendanna @arulajmani could you have a look at this PR before we invest in tests.

Currently it uses a unary TenantServiceCheck which is called periodically. I think this would be better served eventually by a streaming RPC that hooks into a rangefeed over the tenant record. Do we want to do this right away? Should we go the way of this PR now, and go for the streaming RPC later?

For the rangefeed approach, I think I will need some help. I feel that this should be integrated with the capability watched somehow but I'm not sure about the APIs.

stevendanna · 2023-01-30T11:05:26Z

I think we'll probably need to rewrite some streaming tests for this. I can help with that this week.

stevendanna · 2023-01-30T11:15:40Z

I think this would be better served eventually by a streaming RPC that hooks into a rangefeed over the tenant record. Do we want to do this right away? Should we go the way of this PR now, and go for the streaming RPC later?

I'm OK with this simple approach for now. Mostly because I would like to see us think about the tenant process lifecycle a bit more before dedicating ourselves to a design. I have questions such as whether service mode is the right abstraction in the long run. If so, do we need some system that avoids a situation in which a change in service mode doesn't allow for external and shared process sql servers running concurrently? Etc. These feel like longer run questions that we don't need to answer today to make some progress.

knz · 2023-07-11T17:50:03Z

RFAL

do we need some system that avoids a situation in which a change in service mode doesn't allow for external and shared process sql servers running concurrently?

The new version of the PR answers this question: now that the service mode is pushed by KV to the SQL service, if we ever desire to support mixed deployments we can have KV servers push different values to different clients.

106609: cli,server: log the initiation of shutdown more clearly r=yuzefovich a=knz Epic: CRDB-26691 Helped debug issues with #96144. Prior to this patch, the error object that triggers spontaneous shutdown and flows through the `stopTrigger` was only kept in RAM and printed at the very tail end of server shutdown (when the CLI code was exiting). This patch ensures it is logged as soon as the shutdown is triggered. It makes debugging slightly easier. Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>

knz · 2023-07-23T15:33:04Z

@stevendanna @yuzefovich RFAL.

(Before this can merge, we will need to do something about the 3s startup overhead. In this PR, this overhead is now incurred for all uses of the test tenant, not just those that use nodelocal.)

stevendanna

Overall this seems like a reasonable step forward to me.

We should double check we are prepared to have this running in Serverless.

stevendanna · 2023-07-24T11:01:08Z

pkg/server/server_sql.go

+	if s.tenantConnect == nil {
+		return errors.AssertionFailedf("programming error: can only check service with a tenant connector")
+	}


Here and elsewhere could be an opportunity to adopt the new must library.

stevendanna · 2023-07-24T11:05:33Z

pkg/server/server_sql.go

+		if !entry.Ready() {
+			// At the version this check was introduced, the server was
+			// already modified to provide metadata during the initial
+			// handshake.
+			//
+			// If we don't have the metadata here, this means that the
+			// connector is talking to an older-version server.
+			return errors.AssertionFailedf("storage layer did not communicate metadata, is it running an older binary version than the client?")
+		}
+
+		if expected, actual := s.serviceMode, entry.ServiceMode; expected != actual {
+			return errors.Newf("service check failed: expected service mode %v, record says %v", expected, actual)
+		}
+		// Extra sanity check. This should never happen (we should enforce
+		// a valid data state when the service mode is not NONE) but it's
+		// cheap to check.
+		if expected, actual := mtinfopb.DataStateReady, entry.DataState; expected != actual {
+			return errors.AssertionFailedf("service check failed: expected data state %v, record says %v", expected, actual)
+		}
+		return nil


Minor, but I could see moving this to a s.checkServerModeMatchesEntry(entry)

stevendanna · 2023-07-24T11:09:26Z

pkg/server/testserver.go

+		if _, err := ts.InternalExecutor().(*sql.InternalExecutor).Exec(
+			ctx, "testserver-set-tenant-service-mode", nil, /* txn */
+			"ALTER TENANT [$1] START SERVICE EXTERNAL",
+			params.TenantID.ToUint64(),
+		); err != nil {


Should we add a check for the service state into SkipTenantCheck below?

yuzefovich

LGTM, but I'll defer to Steven for approval.

Reviewed 1 of 5 files at r9, 46 of 46 files at r13, 41 of 41 files at r14, 41 of 41 files at r15, 45 of 45 files at r16, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nkodali, @renatolabs, @smg260, and @stevendanna)

-- commits line 32 at r16:
nit: something is off in "only accept to start".

pkg/server/server_sql.go line 1702 at r16 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Here and elsewhere could be an opportunity to adopt the new must library.

I'm guessing that we'll want to backport this change (right?), and must library won't be available on 23.1 (it's not available on master yet either).

knz

Reviewed 1 of 14 files at r1, 1 of 46 files at r13.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nkodali, @renatolabs, @smg260, @stevendanna, and @yuzefovich)

-- commits line 32 at r16:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: something is off in "only accept to start".

Adjusted.

pkg/server/server_sql.go line 1702 at r16 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I'm guessing that we'll want to backport this change (right?), and must library won't be available on 23.1 (it's not available on master yet either).

I'm ok with changing the assertion code during an eventual backport, but yahor is right and must hasn't merged yet.

pkg/server/server_sql.go line 1728 at r16 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Minor, but I could see moving this to a s.checkServerModeMatchesEntry(entry)

Done.

pkg/server/testserver.go line 1086 at r16 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Should we add a check for the service state into SkipTenantCheck below?

It seems unnecessary, since the error you get if it's not right is pretty clear about what's going on.

But maybe you'd like to suggest, instead of a check, that StartTenant should proactively run ALTER TENANT START SERVICE EXTERNAL, perhaps unless a testing knob is set?

knz · 2023-07-26T00:58:12Z

We should double check we are prepared to have this running in Serverless.

Internal conversation here.

@stevendanna it seems to me we are in the clear. WDYT?

TLDR: SQL servers now start successfully only if their data state is "ready" and their deployment type (e.g. separate process) matches the configured service mode in the tenant record. Additionally, the SQL server spontaneously terminates if their service mode or data state is not valid any more (e.g as a result of DROP TENANT or ALTER TENANT STOP SERVICE). ---- Prior to this patch, there wasn't a good story about the lifecycle of separate-process SQL servers ("SQL pods"): - if a SQL server was started against a non-existent tenant, an obscure error would be raised (`database "[1]" does not exist`) - if a SQL server was started while a tenant was being added (keyspace not yet valid), no check would be performed and data corruption could ensue. - if the tenant record/keyspace was dropped while the SQL server was running, SQL clients would start encountering obscure errors. This commit fixes the situation by checking the tenant metadata: - once, during SQL server startup, at which point server startup is prevented if the service check fails; - then, asynchronously, whenever the metadata is updated, such that any service check failure results in a graceful shutdown of the SQL service. The check proper validates: - that the tenant record exists; - that the data state is "ready"; - that the configured service mode matches that requested by the SQL server. Examples output upon error: - non-existent tenant: ``` tenant service check failed: missing record ``` - attempting to start separate-process server while tenant is running as shared-process: ``` tenant service check failed: service mode check failed: expected external, record says shared ``` - after ALTER TENANT STOP SERVICE: ``` tenant service check failed: service mode check failed: expected external, record says none ``` - after DROP TENANT: ``` tenant service check failed: service mode check failed: expected external, record says dropping ``` Release note: None

knz · 2023-07-26T11:20:47Z

TFYR

bors r=yuzefovich,stevendanna

craig · 2023-07-26T13:00:22Z

Build succeeded:

Bazel Essential CI (Cockroach)

TLDR: when the service mode of a secondary tenant is changed to NONE, but the data state is still READY, the tenant servers will now shut down gracefully. This applies both to tenant servers run as separate processes (via `cockroach mt start-sql`) and those started via the internal server controller. More details follow. For context, the reader is invited to first refer to the commit immediately prior. Prior to this patch, all internal shutdown requests were considered to be non-graceful. This included the request that was triggered as a result of a change in the service mode. This caused two separate bugs: - it created a race condition between the server controller (for shared-process servers) and the service mode checker task inside the SQL server (`startCheckService`). When the service mode was flipped to 'NONE', the server controller initiated a graceful drain but then that was caught up by the service mode checker and converted to a hard shutdown. This particular bug was a regression caused by cockroachdb#96144 and had been identified through flakes in `TestServiceShutdownUsesGracefulDrain`. - it was generally causing shutdowns upon changes in service mode to always be run as hard shutdowns (both for external-process and shared-process ervers). This is generally undesirable when the data state is still READY, as it prevents cleaning up the SQL liveness record and can cause delays during subsequent server startups. This patch fixes it by using a graceful shutdown request whenever the service mode changes and the data state is still READY. Since the tenant server now handles graceful drains, the patch also removes this responsibility from the server controller. Release note: None

108613: server,cli/start: gracefully drain tenant servers upon service disable r=yuzefovich,stevendanna a=knz First commit from #108612. Thanks to `@stevendanna` for brainstorming solutions. TLDR: when the service mode of a secondary tenant is changed to NONE, but the data state is still READY, the tenant servers will now shut down gracefully. This applies both to tenant servers run as separate processes (via `cockroach mt start-sql`) and those started via the internal server controller. More details follow. For context, the reader is invited to first refer to the commit immediately prior. Prior to this patch, all internal shutdown requests were considered to be non-graceful. This included the request that was triggered as a result of a change in the service mode. This caused two separate bugs: - it created a race condition between the server controller (for shared-process servers) and the service mode checker task inside the SQL server (`startCheckService`). When the service mode was flipped to 'NONE', the server controller initiated a graceful drain but then that was caught up by the service mode checker and converted to a hard shutdown. This particular bug was a regression caused by #96144 and had been identified through flakes in `TestServiceShutdownUsesGracefulDrain`. - it was generally causing shutdowns upon changes in service mode to always be run as hard shutdowns (both for external-process and shared-process ervers). This is generally undesirable when the data state is still READY, as it prevents cleaning up the SQL liveness record and can cause delays during subsequent server startups. This patch fixes it by using a graceful shutdown request whenever the service mode changes and the data state is still READY. Since the tenant server now handles graceful drains, the patch also removes this responsibility from the server controller. Release note: None Fixes #107856 Epic: CRDB-26691 108681: sql/resolver: rename AvoidLeased to AssertNotLeased r=fqazi a=fqazi Previously, we had a flag to avoid leased descriptors, which did not do anything within look-up contexts. This was because this flag was never really used since skipping leased descriptors happens at a higher level with flags on the resolver. To address this, this the patch will make this flag more useful by making it an assertion that no leased descriptor is being used in these contexts. Release note: None 108723: serverctl: fix a docstring r=yuzefovich a=knz Requested by `@yuzefovich` in #108612 (review) Release note: None Epic: CRDB-26691 Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>

knz requested a review from stevendanna January 28, 2023 18:38

knz force-pushed the 20230128-service-mode branch 3 times, most recently from 44b6704 to 353fa43 Compare January 28, 2023 18:47

knz added the A-multitenancy Related to multi-tenancy label Jan 28, 2023

knz force-pushed the 20230128-service-mode branch from 353fa43 to 0118b30 Compare January 31, 2023 17:49

knz mentioned this pull request Mar 11, 2023

server: meta-issue to organize part of the work in CRDB-26691 #98431

Open

knz force-pushed the 20230128-service-mode branch from 0118b30 to e742558 Compare July 11, 2023 17:07

knz marked this pull request as ready for review July 11, 2023 17:14

knz requested review from a team as code owners July 11, 2023 17:14

knz requested review from a team, nkodali, renatolabs and smg260 and removed request for a team July 11, 2023 17:14

knz force-pushed the 20230128-service-mode branch from e742558 to 47bbebf Compare July 11, 2023 17:40

knz requested a review from yuzefovich July 11, 2023 17:50

knz mentioned this pull request Jul 11, 2023

cli,server: log the initiation of shutdown more clearly #106609

Merged

blathers-crl bot mentioned this pull request Jul 12, 2023

release-23.1: cli,server: log the initiation of shutdown more clearly #106665

Merged

knz marked this pull request as draft July 12, 2023 08:04

knz requested a review from a team as a code owner July 23, 2023 15:33

knz force-pushed the 20230128-service-mode branch from 0b0c5d3 to 069e1ba Compare July 23, 2023 15:34

stevendanna reviewed Jul 24, 2023

View reviewed changes

yuzefovich reviewed Jul 24, 2023

View reviewed changes

knz force-pushed the 20230128-service-mode branch from 069e1ba to a94c060 Compare July 25, 2023 11:27

knz requested a review from stevendanna July 25, 2023 11:27

knz commented Jul 25, 2023

View reviewed changes

knz force-pushed the 20230128-service-mode branch 3 times, most recently from 2042382 to ad17aca Compare July 26, 2023 00:55

knz requested a review from a team as a code owner July 26, 2023 00:55

stevendanna approved these changes Jul 26, 2023

View reviewed changes

knz force-pushed the 20230128-service-mode branch from ad17aca to ecc50b7 Compare July 26, 2023 09:58

craig bot merged commit 8c52fea into cockroachdb:master Jul 26, 2023

knz deleted the 20230128-service-mode branch July 26, 2023 13:28

knz mentioned this pull request Aug 11, 2023

server,cli/start: gracefully drain tenant servers upon service disable #108613

Merged

Conversation

knz commented Jan 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jan 28, 2023

Uh oh!

knz commented Jan 28, 2023

Uh oh!

stevendanna commented Jan 30, 2023

Uh oh!

stevendanna commented Jan 30, 2023

Uh oh!

knz commented Jul 11, 2023

Uh oh!

knz commented Jul 23, 2023

Uh oh!

stevendanna left a comment

Choose a reason for hiding this comment

Uh oh!

stevendanna Jul 24, 2023

Choose a reason for hiding this comment

Uh oh!

stevendanna Jul 24, 2023

Choose a reason for hiding this comment

Uh oh!

stevendanna Jul 24, 2023

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

knz commented Jul 26, 2023

Uh oh!

knz commented Jul 26, 2023

Uh oh!

craig bot commented Jul 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

knz commented Jan 28, 2023 •

edited

Loading