server: honor and validate the service mode for SQL pods#96144
server: honor and validate the service mode for SQL pods#96144craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
|
@stevendanna @arulajmani could you have a look at this PR before we invest in tests. Currently it uses a unary TenantServiceCheck which is called periodically. I think this would be better served eventually by a streaming RPC that hooks into a rangefeed over the tenant record. Do we want to do this right away? Should we go the way of this PR now, and go for the streaming RPC later? For the rangefeed approach, I think I will need some help. I feel that this should be integrated with the capability watched somehow but I'm not sure about the APIs. |
44b6704 to
353fa43
Compare
|
I think we'll probably need to rewrite some streaming tests for this. I can help with that this week. |
I'm OK with this simple approach for now. Mostly because I would like to see us think about the tenant process lifecycle a bit more before dedicating ourselves to a design. I have questions such as whether service mode is the right abstraction in the long run. If so, do we need some system that avoids a situation in which a change in service mode doesn't allow for external and shared process sql servers running concurrently? Etc. These feel like longer run questions that we don't need to answer today to make some progress. |
353fa43 to
0118b30
Compare
0118b30 to
e742558
Compare
e742558 to
47bbebf
Compare
|
RFAL
The new version of the PR answers this question: now that the service mode is pushed by KV to the SQL service, if we ever desire to support mixed deployments we can have KV servers push different values to different clients. |
106609: cli,server: log the initiation of shutdown more clearly r=yuzefovich a=knz Epic: CRDB-26691 Helped debug issues with #96144. Prior to this patch, the error object that triggers spontaneous shutdown and flows through the `stopTrigger` was only kept in RAM and printed at the very tail end of server shutdown (when the CLI code was exiting). This patch ensures it is logged as soon as the shutdown is triggered. It makes debugging slightly easier. Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
|
@stevendanna @yuzefovich RFAL. (Before this can merge, we will need to do something about the 3s startup overhead. In this PR, this overhead is now incurred for all uses of the test tenant, not just those that use nodelocal.) |
0b0c5d3 to
069e1ba
Compare
stevendanna
left a comment
There was a problem hiding this comment.
Overall this seems like a reasonable step forward to me.
We should double check we are prepared to have this running in Serverless.
| if s.tenantConnect == nil { | ||
| return errors.AssertionFailedf("programming error: can only check service with a tenant connector") | ||
| } |
There was a problem hiding this comment.
Here and elsewhere could be an opportunity to adopt the new must library.
pkg/server/server_sql.go
Outdated
| if !entry.Ready() { | ||
| // At the version this check was introduced, the server was | ||
| // already modified to provide metadata during the initial | ||
| // handshake. | ||
| // | ||
| // If we don't have the metadata here, this means that the | ||
| // connector is talking to an older-version server. | ||
| return errors.AssertionFailedf("storage layer did not communicate metadata, is it running an older binary version than the client?") | ||
| } | ||
|
|
||
| if expected, actual := s.serviceMode, entry.ServiceMode; expected != actual { | ||
| return errors.Newf("service check failed: expected service mode %v, record says %v", expected, actual) | ||
| } | ||
| // Extra sanity check. This should never happen (we should enforce | ||
| // a valid data state when the service mode is not NONE) but it's | ||
| // cheap to check. | ||
| if expected, actual := mtinfopb.DataStateReady, entry.DataState; expected != actual { | ||
| return errors.AssertionFailedf("service check failed: expected data state %v, record says %v", expected, actual) | ||
| } | ||
| return nil |
There was a problem hiding this comment.
Minor, but I could see moving this to a s.checkServerModeMatchesEntry(entry)
| if _, err := ts.InternalExecutor().(*sql.InternalExecutor).Exec( | ||
| ctx, "testserver-set-tenant-service-mode", nil, /* txn */ | ||
| "ALTER TENANT [$1] START SERVICE EXTERNAL", | ||
| params.TenantID.ToUint64(), | ||
| ); err != nil { |
There was a problem hiding this comment.
Should we add a check for the service state into SkipTenantCheck below?
yuzefovich
left a comment
There was a problem hiding this comment.
LGTM, but I'll defer to Steven for approval.
Reviewed 1 of 5 files at r9, 46 of 46 files at r13, 41 of 41 files at r14, 41 of 41 files at r15, 45 of 45 files at r16, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @knz, @nkodali, @renatolabs, @smg260, and @stevendanna)
-- commits line 32 at r16:
nit: something is off in "only accept to start".
pkg/server/server_sql.go line 1702 at r16 (raw file):
Previously, stevendanna (Steven Danna) wrote…
Here and elsewhere could be an opportunity to adopt the new
mustlibrary.
I'm guessing that we'll want to backport this change (right?), and must library won't be available on 23.1 (it's not available on master yet either).
069e1ba to
a94c060
Compare
knz
left a comment
There was a problem hiding this comment.
Reviewed 1 of 14 files at r1, 1 of 46 files at r13.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @nkodali, @renatolabs, @smg260, @stevendanna, and @yuzefovich)
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: something is off in "only accept to start".
Adjusted.
pkg/server/server_sql.go line 1702 at r16 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I'm guessing that we'll want to backport this change (right?), and
mustlibrary won't be available on 23.1 (it's not available on master yet either).
I'm ok with changing the assertion code during an eventual backport, but yahor is right and must hasn't merged yet.
pkg/server/server_sql.go line 1728 at r16 (raw file):
Previously, stevendanna (Steven Danna) wrote…
Minor, but I could see moving this to a
s.checkServerModeMatchesEntry(entry)
Done.
pkg/server/testserver.go line 1086 at r16 (raw file):
Previously, stevendanna (Steven Danna) wrote…
Should we add a check for the service state into SkipTenantCheck below?
It seems unnecessary, since the error you get if it's not right is pretty clear about what's going on.
But maybe you'd like to suggest, instead of a check, that StartTenant should proactively run ALTER TENANT START SERVICE EXTERNAL, perhaps unless a testing knob is set?
2042382 to
ad17aca
Compare
Internal conversation here. @stevendanna it seems to me we are in the clear. WDYT? |
TLDR: SQL servers now start successfully only if their data state is
"ready" and their deployment type (e.g. separate process) matches the
configured service mode in the tenant record. Additionally, the SQL
server spontaneously terminates if their service mode or data state is
not valid any more (e.g as a result of DROP TENANT or ALTER TENANT
STOP SERVICE).
----
Prior to this patch, there wasn't a good story about the lifecycle
of separate-process SQL servers ("SQL pods"):
- if a SQL server was started against a non-existent tenant,
an obscure error would be raised (`database "[1]" does not exist`)
- if a SQL server was started while a tenant was being added
(keyspace not yet valid), no check would be performed and
data corruption could ensue.
- if the tenant record/keyspace was dropped while the SQL server was
running, SQL clients would start encountering obscure errors.
This commit fixes the situation by checking the tenant metadata:
- once, during SQL server startup, at which point server startup
is prevented if the service check fails;
- then, asynchronously, whenever the metadata is updated, such that
any service check failure results in a graceful shutdown of the SQL
service.
The check proper validates:
- that the tenant record exists;
- that the data state is "ready";
- that the configured service mode matches that requested by the SQL
server.
Examples output upon error:
- non-existent tenant:
```
tenant service check failed: missing record
```
- attempting to start separate-process server while
tenant is running as shared-process:
```
tenant service check failed: service mode check failed: expected external, record says shared
```
- after ALTER TENANT STOP SERVICE:
```
tenant service check failed: service mode check failed: expected external, record says none
```
- after DROP TENANT:
```
tenant service check failed: service mode check failed: expected external, record says dropping
```
Release note: None
ad17aca to
ecc50b7
Compare
|
TFYR bors r=yuzefovich,stevendanna |
|
Build succeeded: |
TLDR: when the service mode of a secondary tenant is changed to NONE, but the data state is still READY, the tenant servers will now shut down gracefully. This applies both to tenant servers run as separate processes (via `cockroach mt start-sql`) and those started via the internal server controller. More details follow. For context, the reader is invited to first refer to the commit immediately prior. Prior to this patch, all internal shutdown requests were considered to be non-graceful. This included the request that was triggered as a result of a change in the service mode. This caused two separate bugs: - it created a race condition between the server controller (for shared-process servers) and the service mode checker task inside the SQL server (`startCheckService`). When the service mode was flipped to 'NONE', the server controller initiated a graceful drain but then that was caught up by the service mode checker and converted to a hard shutdown. This particular bug was a regression caused by cockroachdb#96144 and had been identified through flakes in `TestServiceShutdownUsesGracefulDrain`. - it was generally causing shutdowns upon changes in service mode to always be run as hard shutdowns (both for external-process and shared-process ervers). This is generally undesirable when the data state is still READY, as it prevents cleaning up the SQL liveness record and can cause delays during subsequent server startups. This patch fixes it by using a graceful shutdown request whenever the service mode changes and the data state is still READY. Since the tenant server now handles graceful drains, the patch also removes this responsibility from the server controller. Release note: None
TLDR: when the service mode of a secondary tenant is changed to NONE, but the data state is still READY, the tenant servers will now shut down gracefully. This applies both to tenant servers run as separate processes (via `cockroach mt start-sql`) and those started via the internal server controller. More details follow. For context, the reader is invited to first refer to the commit immediately prior. Prior to this patch, all internal shutdown requests were considered to be non-graceful. This included the request that was triggered as a result of a change in the service mode. This caused two separate bugs: - it created a race condition between the server controller (for shared-process servers) and the service mode checker task inside the SQL server (`startCheckService`). When the service mode was flipped to 'NONE', the server controller initiated a graceful drain but then that was caught up by the service mode checker and converted to a hard shutdown. This particular bug was a regression caused by cockroachdb#96144 and had been identified through flakes in `TestServiceShutdownUsesGracefulDrain`. - it was generally causing shutdowns upon changes in service mode to always be run as hard shutdowns (both for external-process and shared-process ervers). This is generally undesirable when the data state is still READY, as it prevents cleaning up the SQL liveness record and can cause delays during subsequent server startups. This patch fixes it by using a graceful shutdown request whenever the service mode changes and the data state is still READY. Since the tenant server now handles graceful drains, the patch also removes this responsibility from the server controller. Release note: None
108613: server,cli/start: gracefully drain tenant servers upon service disable r=yuzefovich,stevendanna a=knz First commit from #108612. Thanks to `@stevendanna` for brainstorming solutions. TLDR: when the service mode of a secondary tenant is changed to NONE, but the data state is still READY, the tenant servers will now shut down gracefully. This applies both to tenant servers run as separate processes (via `cockroach mt start-sql`) and those started via the internal server controller. More details follow. For context, the reader is invited to first refer to the commit immediately prior. Prior to this patch, all internal shutdown requests were considered to be non-graceful. This included the request that was triggered as a result of a change in the service mode. This caused two separate bugs: - it created a race condition between the server controller (for shared-process servers) and the service mode checker task inside the SQL server (`startCheckService`). When the service mode was flipped to 'NONE', the server controller initiated a graceful drain but then that was caught up by the service mode checker and converted to a hard shutdown. This particular bug was a regression caused by #96144 and had been identified through flakes in `TestServiceShutdownUsesGracefulDrain`. - it was generally causing shutdowns upon changes in service mode to always be run as hard shutdowns (both for external-process and shared-process ervers). This is generally undesirable when the data state is still READY, as it prevents cleaning up the SQL liveness record and can cause delays during subsequent server startups. This patch fixes it by using a graceful shutdown request whenever the service mode changes and the data state is still READY. Since the tenant server now handles graceful drains, the patch also removes this responsibility from the server controller. Release note: None Fixes #107856 Epic: CRDB-26691 108681: sql/resolver: rename AvoidLeased to AssertNotLeased r=fqazi a=fqazi Previously, we had a flag to avoid leased descriptors, which did not do anything within look-up contexts. This was because this flag was never really used since skipping leased descriptors happens at a higher level with flags on the resolver. To address this, this the patch will make this flag more useful by making it an assertion that no leased descriptor is being used in these contexts. Release note: None 108723: serverctl: fix a docstring r=yuzefovich a=knz Requested by `@yuzefovich` in #108612 (review) Release note: None Epic: CRDB-26691 Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Rebased on top of #105441.
Fixes #93145.
Fixes #83650.
Epic: CRDB-26691
TLDR: this makes SQL servers only accept to start if their data state
is "ready" and their deployment type (e.g. separate process) matches
the configured service mode in the tenant record.
Additionally, the SQL server spontaneously terminates if
their service mode or data state is not valid any more (e.g
as a result of DROP TENANT or ALTER TENANT STOP SERVICE).
Prior to this patch, there wasn't a good story about the lifecycle
of separate-process SQL servers ("SQL pods"):
an obscure error would be raised (
database "[1]" does not exist)(keyspace not yet valid), no check would be performed and
data corruption could ensue.
running, SQL clients would start encountering obscure errors.
This commit fixes the situation by checking the tenant metadata:
is prevented if the service check fails;
any service check failure results in a graceful shutdown of the SQL
service.
The check proper validates:
server.
Examples output upon error:
non-existent tenant:
attempting to start separate-process server while
tenant is running as shared-process:
after ALTER TENANT STOP SERVICE:
after DROP TENANT:
Release note: None