Skip to content

server: honor and validate the service mode for SQL pods#96144

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
knz:20230128-service-mode
Jul 26, 2023
Merged

server: honor and validate the service mode for SQL pods#96144
craig[bot] merged 1 commit intocockroachdb:masterfrom
knz:20230128-service-mode

Conversation

@knz
Copy link
Copy Markdown
Contributor

@knz knz commented Jan 28, 2023

Rebased on top of #105441.
Fixes #93145.
Fixes #83650.

Epic: CRDB-26691

TLDR: this makes SQL servers only accept to start if their data state
is "ready" and their deployment type (e.g. separate process) matches
the configured service mode in the tenant record.
Additionally, the SQL server spontaneously terminates if
their service mode or data state is not valid any more (e.g
as a result of DROP TENANT or ALTER TENANT STOP SERVICE).


Prior to this patch, there wasn't a good story about the lifecycle
of separate-process SQL servers ("SQL pods"):

  • if a SQL server was started against a non-existent tenant,
    an obscure error would be raised (database "[1]" does not exist)
  • if a SQL server was started while a tenant was being added
    (keyspace not yet valid), no check would be performed and
    data corruption could ensue.
  • if the tenant record/keyspace was dropped while the SQL server was
    running, SQL clients would start encountering obscure errors.

This commit fixes the situation by checking the tenant metadata:

  • once, during SQL server startup, at which point server startup
    is prevented if the service check fails;
  • then, asynchronously, whenever the metadata is updated, such that
    any service check failure results in a graceful shutdown of the SQL
    service.

The check proper validates:

  • that the tenant record exists;
  • that the data state is "ready";
  • that the configured service mode matches that requested by the SQL
    server.

Examples output upon error:

  • non-existent tenant:

    tenant service check failed: missing record
    
  • attempting to start separate-process server while
    tenant is running as shared-process:

    tenant service check failed: service mode check failed: expected external, record says shared
    
  • after ALTER TENANT STOP SERVICE:

    tenant service check failed: service mode check failed: expected external, record says none
    
  • after DROP TENANT:

    tenant service check failed: service mode check failed: expected external, record says dropping
    

Release note: None

@knz knz requested a review from stevendanna January 28, 2023 18:38
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Jan 28, 2023

@stevendanna @arulajmani could you have a look at this PR before we invest in tests.

Currently it uses a unary TenantServiceCheck which is called periodically. I think this would be better served eventually by a streaming RPC that hooks into a rangefeed over the tenant record. Do we want to do this right away? Should we go the way of this PR now, and go for the streaming RPC later?

For the rangefeed approach, I think I will need some help. I feel that this should be integrated with the capability watched somehow but I'm not sure about the APIs.

@knz knz force-pushed the 20230128-service-mode branch 3 times, most recently from 44b6704 to 353fa43 Compare January 28, 2023 18:47
@knz knz added the A-multitenancy Related to multi-tenancy label Jan 28, 2023
@stevendanna
Copy link
Copy Markdown
Collaborator

I think we'll probably need to rewrite some streaming tests for this. I can help with that this week.

@stevendanna
Copy link
Copy Markdown
Collaborator

I think this would be better served eventually by a streaming RPC that hooks into a rangefeed over the tenant record. Do we want to do this right away? Should we go the way of this PR now, and go for the streaming RPC later?

I'm OK with this simple approach for now. Mostly because I would like to see us think about the tenant process lifecycle a bit more before dedicating ourselves to a design. I have questions such as whether service mode is the right abstraction in the long run. If so, do we need some system that avoids a situation in which a change in service mode doesn't allow for external and shared process sql servers running concurrently? Etc. These feel like longer run questions that we don't need to answer today to make some progress.

@knz knz force-pushed the 20230128-service-mode branch from 353fa43 to 0118b30 Compare January 31, 2023 17:49
@knz knz force-pushed the 20230128-service-mode branch from 0118b30 to e742558 Compare July 11, 2023 17:07
@knz knz marked this pull request as ready for review July 11, 2023 17:14
@knz knz requested review from a team as code owners July 11, 2023 17:14
@knz knz requested review from a team, nkodali, renatolabs and smg260 and removed request for a team July 11, 2023 17:14
@knz knz force-pushed the 20230128-service-mode branch from e742558 to 47bbebf Compare July 11, 2023 17:40
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Jul 11, 2023

RFAL

do we need some system that avoids a situation in which a change in service mode doesn't allow for external and shared process sql servers running concurrently?

The new version of the PR answers this question: now that the service mode is pushed by KV to the SQL service, if we ever desire to support mixed deployments we can have KV servers push different values to different clients.

@knz knz requested a review from yuzefovich July 11, 2023 17:50
craig bot pushed a commit that referenced this pull request Jul 12, 2023
106609: cli,server: log the initiation of shutdown more clearly r=yuzefovich a=knz

Epic: CRDB-26691
Helped debug issues with #96144.

Prior to this patch, the error object that triggers spontaneous shutdown and flows through the `stopTrigger` was only kept in RAM and printed at the very tail end of server shutdown (when the CLI code was exiting).

This patch ensures it is logged as soon as the shutdown is triggered. It makes debugging slightly easier.

Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
@knz knz marked this pull request as draft July 12, 2023 08:04
@knz knz requested a review from a team as a code owner July 23, 2023 15:33
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Jul 23, 2023

@stevendanna @yuzefovich RFAL.

(Before this can merge, we will need to do something about the 3s startup overhead. In this PR, this overhead is now incurred for all uses of the test tenant, not just those that use nodelocal.)

@knz knz force-pushed the 20230128-service-mode branch from 0b0c5d3 to 069e1ba Compare July 23, 2023 15:34
Copy link
Copy Markdown
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this seems like a reasonable step forward to me.

We should double check we are prepared to have this running in Serverless.

Comment on lines +1700 to +1702
if s.tenantConnect == nil {
return errors.AssertionFailedf("programming error: can only check service with a tenant connector")
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere could be an opportunity to adopt the new must library.

Comment on lines +1709 to +1728
if !entry.Ready() {
// At the version this check was introduced, the server was
// already modified to provide metadata during the initial
// handshake.
//
// If we don't have the metadata here, this means that the
// connector is talking to an older-version server.
return errors.AssertionFailedf("storage layer did not communicate metadata, is it running an older binary version than the client?")
}

if expected, actual := s.serviceMode, entry.ServiceMode; expected != actual {
return errors.Newf("service check failed: expected service mode %v, record says %v", expected, actual)
}
// Extra sanity check. This should never happen (we should enforce
// a valid data state when the service mode is not NONE) but it's
// cheap to check.
if expected, actual := mtinfopb.DataStateReady, entry.DataState; expected != actual {
return errors.AssertionFailedf("service check failed: expected data state %v, record says %v", expected, actual)
}
return nil
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but I could see moving this to a s.checkServerModeMatchesEntry(entry)

Comment on lines +1082 to +1097
if _, err := ts.InternalExecutor().(*sql.InternalExecutor).Exec(
ctx, "testserver-set-tenant-service-mode", nil, /* txn */
"ALTER TENANT [$1] START SERVICE EXTERNAL",
params.TenantID.ToUint64(),
); err != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a check for the service state into SkipTenantCheck below?

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'll defer to Steven for approval.

Reviewed 1 of 5 files at r9, 46 of 46 files at r13, 41 of 41 files at r14, 41 of 41 files at r15, 45 of 45 files at r16, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @nkodali, @renatolabs, @smg260, and @stevendanna)


-- commits line 32 at r16:
nit: something is off in "only accept to start".


pkg/server/server_sql.go line 1702 at r16 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Here and elsewhere could be an opportunity to adopt the new must library.

I'm guessing that we'll want to backport this change (right?), and must library won't be available on 23.1 (it's not available on master yet either).

@knz knz force-pushed the 20230128-service-mode branch from 069e1ba to a94c060 Compare July 25, 2023 11:27
@knz knz requested a review from stevendanna July 25, 2023 11:27
Copy link
Copy Markdown
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 14 files at r1, 1 of 46 files at r13.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nkodali, @renatolabs, @smg260, @stevendanna, and @yuzefovich)


-- commits line 32 at r16:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: something is off in "only accept to start".

Adjusted.


pkg/server/server_sql.go line 1702 at r16 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I'm guessing that we'll want to backport this change (right?), and must library won't be available on 23.1 (it's not available on master yet either).

I'm ok with changing the assertion code during an eventual backport, but yahor is right and must hasn't merged yet.


pkg/server/server_sql.go line 1728 at r16 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Minor, but I could see moving this to a s.checkServerModeMatchesEntry(entry)

Done.


pkg/server/testserver.go line 1086 at r16 (raw file):

Previously, stevendanna (Steven Danna) wrote…

Should we add a check for the service state into SkipTenantCheck below?

It seems unnecessary, since the error you get if it's not right is pretty clear about what's going on.

But maybe you'd like to suggest, instead of a check, that StartTenant should proactively run ALTER TENANT START SERVICE EXTERNAL, perhaps unless a testing knob is set?

@knz knz force-pushed the 20230128-service-mode branch 3 times, most recently from 2042382 to ad17aca Compare July 26, 2023 00:55
@knz knz requested a review from a team as a code owner July 26, 2023 00:55
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Jul 26, 2023

We should double check we are prepared to have this running in Serverless.

Internal conversation here.

@stevendanna it seems to me we are in the clear. WDYT?

TLDR: SQL servers now start successfully only if their data state is
"ready" and their deployment type (e.g. separate process) matches the
configured service mode in the tenant record. Additionally, the SQL
server spontaneously terminates if their service mode or data state is
not valid any more (e.g as a result of DROP TENANT or ALTER TENANT
STOP SERVICE).

----

Prior to this patch, there wasn't a good story about the lifecycle
of separate-process SQL servers ("SQL pods"):

- if a SQL server was started against a non-existent tenant,
  an obscure error would be raised (`database "[1]" does not exist`)
- if a SQL server was started while a tenant was being added
  (keyspace not yet valid), no check would be performed and
  data corruption could ensue.
- if the tenant record/keyspace was dropped while the SQL server was
  running, SQL clients would start encountering obscure errors.

This commit fixes the situation by checking the tenant metadata:

- once, during SQL server startup, at which point server startup
  is prevented if the service check fails;
- then, asynchronously, whenever the metadata is updated, such that
  any service check failure results in a graceful shutdown of the SQL
  service.

The check proper validates:
- that the tenant record exists;
- that the data state is "ready";
- that the configured service mode matches that requested by the SQL
  server.

Examples output upon error:

- non-existent tenant:
  ```
  tenant service check failed: missing record
  ```

- attempting to start separate-process server while
  tenant is running as shared-process:
  ```
  tenant service check failed: service mode check failed: expected external, record says shared
  ```

- after ALTER TENANT STOP SERVICE:
  ```
  tenant service check failed: service mode check failed: expected external, record says none
  ```

- after DROP TENANT:
  ```
  tenant service check failed: service mode check failed: expected external, record says dropping
  ```

Release note: None
@knz knz force-pushed the 20230128-service-mode branch from ad17aca to ecc50b7 Compare July 26, 2023 09:58
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Jul 26, 2023

TFYR

bors r=yuzefovich,stevendanna

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 26, 2023

Build succeeded:

@craig craig bot merged commit 8c52fea into cockroachdb:master Jul 26, 2023
@knz knz deleted the 20230128-service-mode branch July 26, 2023 13:28
knz added a commit to knz/cockroach that referenced this pull request Aug 11, 2023
TLDR: when the service mode of a secondary tenant is changed to NONE,
but the data state is still READY, the tenant servers will now shut
down gracefully. This applies both to tenant servers run as separate
processes (via `cockroach mt start-sql`) and those started via the
internal server controller.

More details follow.

For context, the reader is invited to first refer to the commit
immediately prior.

Prior to this patch, all internal shutdown requests were considered to
be non-graceful. This included the request that was triggered as a
result of a change in the service mode.

This caused two separate bugs:

- it created a race condition between the server controller (for
  shared-process servers) and the service mode checker task inside the
  SQL server (`startCheckService`). When the service mode was flipped to
  'NONE', the server controller initiated a graceful drain but then that
  was caught up by the service mode checker and converted to a
  hard shutdown. This particular bug was a regression caused by cockroachdb#96144
  and had been identified through flakes in `TestServiceShutdownUsesGracefulDrain`.

- it was generally causing shutdowns upon changes in service mode to
  always be run as hard shutdowns (both for external-process and shared-process
  ervers). This is generally undesirable when the data state is still
  READY, as it prevents cleaning up the SQL liveness record and can
  cause delays during subsequent server startups.

This patch fixes it by using a graceful shutdown request whenever
the service mode changes and the data state is still READY.
Since the tenant server now handles graceful drains, the patch
also removes this responsibility from the server controller.

Release note: None
knz added a commit to knz/cockroach that referenced this pull request Aug 14, 2023
TLDR: when the service mode of a secondary tenant is changed to NONE,
but the data state is still READY, the tenant servers will now shut
down gracefully. This applies both to tenant servers run as separate
processes (via `cockroach mt start-sql`) and those started via the
internal server controller.

More details follow.

For context, the reader is invited to first refer to the commit
immediately prior.

Prior to this patch, all internal shutdown requests were considered to
be non-graceful. This included the request that was triggered as a
result of a change in the service mode.

This caused two separate bugs:

- it created a race condition between the server controller (for
  shared-process servers) and the service mode checker task inside the
  SQL server (`startCheckService`). When the service mode was flipped to
  'NONE', the server controller initiated a graceful drain but then that
  was caught up by the service mode checker and converted to a
  hard shutdown. This particular bug was a regression caused by cockroachdb#96144
  and had been identified through flakes in `TestServiceShutdownUsesGracefulDrain`.

- it was generally causing shutdowns upon changes in service mode to
  always be run as hard shutdowns (both for external-process and shared-process
  ervers). This is generally undesirable when the data state is still
  READY, as it prevents cleaning up the SQL liveness record and can
  cause delays during subsequent server startups.

This patch fixes it by using a graceful shutdown request whenever
the service mode changes and the data state is still READY.
Since the tenant server now handles graceful drains, the patch
also removes this responsibility from the server controller.

Release note: None
craig bot pushed a commit that referenced this pull request Aug 14, 2023
108613: server,cli/start: gracefully drain tenant servers upon service disable r=yuzefovich,stevendanna a=knz

First commit from #108612.
Thanks to `@stevendanna` for brainstorming solutions.

TLDR: when the service mode of a secondary tenant is changed to NONE,
but the data state is still READY, the tenant servers will now shut
down gracefully. This applies both to tenant servers run as separate
processes (via `cockroach mt start-sql`) and those started via the
internal server controller.

More details follow.

For context, the reader is invited to first refer to the commit
immediately prior.

Prior to this patch, all internal shutdown requests were considered to
be non-graceful. This included the request that was triggered as a
result of a change in the service mode.

This caused two separate bugs:

- it created a race condition between the server controller (for
  shared-process servers) and the service mode checker task inside the
  SQL server (`startCheckService`). When the service mode was flipped to
  'NONE', the server controller initiated a graceful drain but then that
  was caught up by the service mode checker and converted to a
  hard shutdown. This particular bug was a regression caused by #96144
  and had been identified through flakes in `TestServiceShutdownUsesGracefulDrain`.

- it was generally causing shutdowns upon changes in service mode to
  always be run as hard shutdowns (both for external-process and shared-process
  ervers). This is generally undesirable when the data state is still
  READY, as it prevents cleaning up the SQL liveness record and can
  cause delays during subsequent server startups.

This patch fixes it by using a graceful shutdown request whenever
the service mode changes and the data state is still READY.
Since the tenant server now handles graceful drains, the patch
also removes this responsibility from the server controller.

Release note: None
Fixes #107856
Epic: CRDB-26691

108681: sql/resolver: rename AvoidLeased to AssertNotLeased r=fqazi a=fqazi

Previously, we had a flag to avoid leased descriptors, which did not do anything within look-up contexts. This was because this flag was never really used since skipping leased descriptors happens at a higher level with flags on the resolver. To address this, this
the patch will make this flag more useful by making it an assertion that no leased descriptor is being used in these contexts.

Release note: None

108723: serverctl: fix a docstring r=yuzefovich a=knz

Requested by `@yuzefovich` in #108612 (review)

Release note: None
Epic: CRDB-26691

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-multitenancy Related to multi-tenancy

Projects

None yet

Development

Successfully merging this pull request may close these issues.

server: blocked mixed-style deployments [CRDB-14537 followup] sql: connecting to an inactive tenant should return an error

4 participants