server,cli/start: gracefully drain tenant servers upon service disable#108613
server,cli/start: gracefully drain tenant servers upon service disable#108613craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
bc1d994 to
ff8b7dc
Compare
stevendanna
left a comment
There was a problem hiding this comment.
I like where this landed and it ended up being simpler than the approach I took.
Do we think we need any sort of timeout on the graceful drain here such that after some amount of time we give up draining gracefully and do a hard shutdown?
I am ambivalent about it. It could go either way, but I'd like to emphasize that this is orthogonal to the issue that this PR is fixing. If we want to add such a timeout, I'd rather do that in a different PR. |
ff8b7dc to
6585818
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
LGTM, but I'll defer to Steven.
Reviewed 5 of 5 files at r2, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @knz, @rachitgsrivastava, and @stevendanna)
pkg/server/server_sql.go line 1768 at r2 (raw file):
return false, errors.AssertionFailedf("service check failed: expected data state %v, record says %v", expected, actual) } return true, nil
nit: I was a bit confused by this, consider leaving a comment here that given err return argument is nil, useGracefulDrain return argument doesn't matter.
stevendanna
left a comment
There was a problem hiding this comment.
I am ambivalent about it. It could go either way, but I'd like to emphasize that this is orthogonal to the issue that this PR is fixing. If we want to add such a timeout, I'd rather do that in a different PR.
That's fine by me. I'm not sure it is really orthogonal since this PR effectively changes what was previously an immediate shutdown to a graceful shutdown and thus introduces the possibility of the shutdown taking a non-trivial amount of time. But, since this is new functionality that hasn't been exposed to users, there is no existing user expectation.
I believe that is incorrect. In the v23.1 code (the original version I implemented) the shutdown was graceful. |
|
TFYR! bors r=yuzefovich,stevendanna |
|
bors r- i'll address yahor's nit first |
|
Canceled. |
TLDR: when the service mode of a secondary tenant is changed to NONE, but the data state is still READY, the tenant servers will now shut down gracefully. This applies both to tenant servers run as separate processes (via `cockroach mt start-sql`) and those started via the internal server controller. More details follow. For context, the reader is invited to first refer to the commit immediately prior. Prior to this patch, all internal shutdown requests were considered to be non-graceful. This included the request that was triggered as a result of a change in the service mode. This caused two separate bugs: - it created a race condition between the server controller (for shared-process servers) and the service mode checker task inside the SQL server (`startCheckService`). When the service mode was flipped to 'NONE', the server controller initiated a graceful drain but then that was caught up by the service mode checker and converted to a hard shutdown. This particular bug was a regression caused by cockroachdb#96144 and had been identified through flakes in `TestServiceShutdownUsesGracefulDrain`. - it was generally causing shutdowns upon changes in service mode to always be run as hard shutdowns (both for external-process and shared-process ervers). This is generally undesirable when the data state is still READY, as it prevents cleaning up the SQL liveness record and can cause delays during subsequent server startups. This patch fixes it by using a graceful shutdown request whenever the service mode changes and the data state is still READY. Since the tenant server now handles graceful drains, the patch also removes this responsibility from the server controller. Release note: None
6585818 to
103c0a7
Compare
knz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong and @rachitgsrivastava)
pkg/server/server_sql.go line 1768 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: I was a bit confused by this, consider leaving a comment here that given
errreturn argument isnil,useGracefulDrainreturn argument doesn't matter.
Done.
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
|
bors r=yuzefovich,stevendanna |
|
This PR was included in a batch that was canceled, it will be automatically retried |
|
This PR was included in a batch that timed out, it will be automatically retried |
|
Build succeeded: |
First commit from #108612.
Thanks to @stevendanna for brainstorming solutions.
TLDR: when the service mode of a secondary tenant is changed to NONE,
but the data state is still READY, the tenant servers will now shut
down gracefully. This applies both to tenant servers run as separate
processes (via
cockroach mt start-sql) and those started via theinternal server controller.
More details follow.
For context, the reader is invited to first refer to the commit
immediately prior.
Prior to this patch, all internal shutdown requests were considered to
be non-graceful. This included the request that was triggered as a
result of a change in the service mode.
This caused two separate bugs:
it created a race condition between the server controller (for
shared-process servers) and the service mode checker task inside the
SQL server (
startCheckService). When the service mode was flipped to'NONE', the server controller initiated a graceful drain but then that
was caught up by the service mode checker and converted to a
hard shutdown. This particular bug was a regression caused by server: honor and validate the service mode for SQL pods #96144
and had been identified through flakes in
TestServiceShutdownUsesGracefulDrain.it was generally causing shutdowns upon changes in service mode to
always be run as hard shutdowns (both for external-process and shared-process
ervers). This is generally undesirable when the data state is still
READY, as it prevents cleaning up the SQL liveness record and can
cause delays during subsequent server startups.
This patch fixes it by using a graceful shutdown request whenever
the service mode changes and the data state is still READY.
Since the tenant server now handles graceful drains, the patch
also removes this responsibility from the server controller.
Release note: None
Fixes #107856
Epic: CRDB-26691