server,cli/start: gracefully drain tenant servers upon service disable by knz · Pull Request #108613 · cockroachdb/cockroach

knz · 2023-08-11T16:48:06Z

First commit from #108612.
Thanks to @stevendanna for brainstorming solutions.

TLDR: when the service mode of a secondary tenant is changed to NONE,
but the data state is still READY, the tenant servers will now shut
down gracefully. This applies both to tenant servers run as separate
processes (via cockroach mt start-sql) and those started via the
internal server controller.

More details follow.

For context, the reader is invited to first refer to the commit
immediately prior.

Prior to this patch, all internal shutdown requests were considered to
be non-graceful. This included the request that was triggered as a
result of a change in the service mode.

This caused two separate bugs:

it created a race condition between the server controller (for
shared-process servers) and the service mode checker task inside the
SQL server (startCheckService). When the service mode was flipped to
'NONE', the server controller initiated a graceful drain but then that
was caught up by the service mode checker and converted to a
hard shutdown. This particular bug was a regression caused by server: honor and validate the service mode for SQL pods #96144
and had been identified through flakes in TestServiceShutdownUsesGracefulDrain.
it was generally causing shutdowns upon changes in service mode to
always be run as hard shutdowns (both for external-process and shared-process
ervers). This is generally undesirable when the data state is still
READY, as it prevents cleaning up the SQL liveness record and can
cause delays during subsequent server startups.

This patch fixes it by using a graceful shutdown request whenever
the service mode changes and the data state is still READY.
Since the tenant server now handles graceful drains, the patch
also removes this responsibility from the server controller.

Release note: None
Fixes #107856
Epic: CRDB-26691

cockroach-teamcity · 2023-08-11T16:48:15Z

This change is

stevendanna

I like where this landed and it ended up being simpler than the approach I took.

Do we think we need any sort of timeout on the graceful drain here such that after some amount of time we give up draining gracefully and do a hard shutdown?

knz · 2023-08-11T20:13:31Z

Do we think we need any sort of timeout on the graceful drain here such that after some amount of time we give up draining gracefully and do a hard shutdown?

I am ambivalent about it. It could go either way, but I'd like to emphasize that this is orthogonal to the issue that this PR is fixing. If we want to add such a timeout, I'd rather do that in a different PR.

yuzefovich

LGTM, but I'll defer to Steven.

Reviewed 5 of 5 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @knz, @rachitgsrivastava, and @stevendanna)

pkg/server/server_sql.go line 1768 at r2 (raw file):

		return false, errors.AssertionFailedf("service check failed: expected data state %v, record says %v", expected, actual)
	}
	return true, nil

nit: I was a bit confused by this, consider leaving a comment here that given err return argument is nil, useGracefulDrain return argument doesn't matter.

stevendanna

I am ambivalent about it. It could go either way, but I'd like to emphasize that this is orthogonal to the issue that this PR is fixing. If we want to add such a timeout, I'd rather do that in a different PR.

That's fine by me. I'm not sure it is really orthogonal since this PR effectively changes what was previously an immediate shutdown to a graceful shutdown and thus introduces the possibility of the shutdown taking a non-trivial amount of time. But, since this is new functionality that hasn't been exposed to users, there is no existing user expectation.

knz · 2023-08-14T11:58:38Z

I'm not sure it is really orthogonal since this PR effectively changes what was previously an immediate shutdown to a graceful shutdown

I believe that is incorrect. In the v23.1 code (the original version I implemented) the shutdown was graceful.
It only became immediate when I merged #96144, which we are considering a regression (only in the master branch for now). This patch recovers the v23.1 behavior.

knz · 2023-08-14T12:30:09Z

TFYR!

bors r=yuzefovich,stevendanna

knz · 2023-08-14T12:32:06Z

bors r-

i'll address yahor's nit first

craig · 2023-08-14T12:32:08Z

Canceled.

TLDR: when the service mode of a secondary tenant is changed to NONE, but the data state is still READY, the tenant servers will now shut down gracefully. This applies both to tenant servers run as separate processes (via `cockroach mt start-sql`) and those started via the internal server controller. More details follow. For context, the reader is invited to first refer to the commit immediately prior. Prior to this patch, all internal shutdown requests were considered to be non-graceful. This included the request that was triggered as a result of a change in the service mode. This caused two separate bugs: - it created a race condition between the server controller (for shared-process servers) and the service mode checker task inside the SQL server (`startCheckService`). When the service mode was flipped to 'NONE', the server controller initiated a graceful drain but then that was caught up by the service mode checker and converted to a hard shutdown. This particular bug was a regression caused by cockroachdb#96144 and had been identified through flakes in `TestServiceShutdownUsesGracefulDrain`. - it was generally causing shutdowns upon changes in service mode to always be run as hard shutdowns (both for external-process and shared-process ervers). This is generally undesirable when the data state is still READY, as it prevents cleaning up the SQL liveness record and can cause delays during subsequent server startups. This patch fixes it by using a graceful shutdown request whenever the service mode changes and the data state is still READY. Since the tenant server now handles graceful drains, the patch also removes this responsibility from the server controller. Release note: None

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong and @rachitgsrivastava)

pkg/server/server_sql.go line 1768 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: I was a bit confused by this, consider leaving a comment here that given err return argument is nil, useGracefulDrain return argument doesn't matter.

Done.

blathers-crl · 2023-08-14T16:58:55Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

knz · 2023-08-14T16:59:02Z

bors r=yuzefovich,stevendanna

craig · 2023-08-14T17:32:23Z

This PR was included in a batch that was canceled, it will be automatically retried

craig · 2023-08-14T20:31:33Z

This PR was included in a batch that timed out, it will be automatically retried

craig · 2023-08-14T22:02:25Z

Build succeeded:

Bazel Essential CI (Cockroach)

knz requested review from a team as code owners August 11, 2023 16:48

knz requested review from DarrylWong and rachitgsrivastava and removed request for a team August 11, 2023 16:48

knz requested review from stevendanna and yuzefovich August 11, 2023 16:48

knz force-pushed the 20230811-tenant-drain branch 3 times, most recently from bc1d994 to ff8b7dc Compare August 11, 2023 17:05

stevendanna reviewed Aug 11, 2023

View reviewed changes

knz requested a review from stevendanna August 11, 2023 20:13

knz added the db-cy-23 label Aug 11, 2023

knz force-pushed the 20230811-tenant-drain branch from ff8b7dc to 6585818 Compare August 11, 2023 20:15

yuzefovich reviewed Aug 12, 2023

View reviewed changes

stevendanna approved these changes Aug 14, 2023

View reviewed changes

knz force-pushed the 20230811-tenant-drain branch from 6585818 to 103c0a7 Compare August 14, 2023 16:58

knz commented Aug 14, 2023

View reviewed changes

craig bot merged commit 96b4a8e into cockroachdb:master Aug 14, 2023

knz mentioned this pull request Sep 20, 2023

tenant startup hang #108209

Closed

Conversation

knz commented Aug 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Aug 11, 2023

Uh oh!

stevendanna left a comment

Choose a reason for hiding this comment

Uh oh!

knz commented Aug 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

stevendanna left a comment

Choose a reason for hiding this comment

Uh oh!

knz commented Aug 14, 2023

Uh oh!

knz commented Aug 14, 2023

Uh oh!

knz commented Aug 14, 2023

Uh oh!

craig bot commented Aug 14, 2023

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

blathers-crl bot commented Aug 14, 2023

Uh oh!

knz commented Aug 14, 2023

Uh oh!

craig bot commented Aug 14, 2023

Uh oh!

craig bot commented Aug 14, 2023

Uh oh!

craig bot commented Aug 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

knz commented Aug 11, 2023 •

edited

Loading

knz commented Aug 11, 2023 •

edited

Loading