Skip to content

server,cli/start: gracefully drain tenant servers upon service disable#108613

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
knz:20230811-tenant-drain
Aug 14, 2023
Merged

server,cli/start: gracefully drain tenant servers upon service disable#108613
craig[bot] merged 1 commit intocockroachdb:masterfrom
knz:20230811-tenant-drain

Conversation

@knz
Copy link
Copy Markdown
Contributor

@knz knz commented Aug 11, 2023

First commit from #108612.
Thanks to @stevendanna for brainstorming solutions.

TLDR: when the service mode of a secondary tenant is changed to NONE,
but the data state is still READY, the tenant servers will now shut
down gracefully. This applies both to tenant servers run as separate
processes (via cockroach mt start-sql) and those started via the
internal server controller.

More details follow.

For context, the reader is invited to first refer to the commit
immediately prior.

Prior to this patch, all internal shutdown requests were considered to
be non-graceful. This included the request that was triggered as a
result of a change in the service mode.

This caused two separate bugs:

  • it created a race condition between the server controller (for
    shared-process servers) and the service mode checker task inside the
    SQL server (startCheckService). When the service mode was flipped to
    'NONE', the server controller initiated a graceful drain but then that
    was caught up by the service mode checker and converted to a
    hard shutdown. This particular bug was a regression caused by server: honor and validate the service mode for SQL pods #96144
    and had been identified through flakes in TestServiceShutdownUsesGracefulDrain.

  • it was generally causing shutdowns upon changes in service mode to
    always be run as hard shutdowns (both for external-process and shared-process
    ervers). This is generally undesirable when the data state is still
    READY, as it prevents cleaning up the SQL liveness record and can
    cause delays during subsequent server startups.

This patch fixes it by using a graceful shutdown request whenever
the service mode changes and the data state is still READY.
Since the tenant server now handles graceful drains, the patch
also removes this responsibility from the server controller.

Release note: None
Fixes #107856
Epic: CRDB-26691

@knz knz requested review from a team as code owners August 11, 2023 16:48
@knz knz requested review from DarrylWong and rachitgsrivastava and removed request for a team August 11, 2023 16:48
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@knz knz requested review from stevendanna and yuzefovich August 11, 2023 16:48
@knz knz force-pushed the 20230811-tenant-drain branch 3 times, most recently from bc1d994 to ff8b7dc Compare August 11, 2023 17:05
Copy link
Copy Markdown
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like where this landed and it ended up being simpler than the approach I took.

Do we think we need any sort of timeout on the graceful drain here such that after some amount of time we give up draining gracefully and do a hard shutdown?

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Aug 11, 2023

Do we think we need any sort of timeout on the graceful drain here such that after some amount of time we give up draining gracefully and do a hard shutdown?

I am ambivalent about it. It could go either way, but I'd like to emphasize that this is orthogonal to the issue that this PR is fixing. If we want to add such a timeout, I'd rather do that in a different PR.

@knz knz requested a review from stevendanna August 11, 2023 20:13
@knz knz added the db-cy-23 label Aug 11, 2023
@knz knz force-pushed the 20230811-tenant-drain branch from ff8b7dc to 6585818 Compare August 11, 2023 20:15
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'll defer to Steven.

Reviewed 5 of 5 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @knz, @rachitgsrivastava, and @stevendanna)


pkg/server/server_sql.go line 1768 at r2 (raw file):

		return false, errors.AssertionFailedf("service check failed: expected data state %v, record says %v", expected, actual)
	}
	return true, nil

nit: I was a bit confused by this, consider leaving a comment here that given err return argument is nil, useGracefulDrain return argument doesn't matter.

Copy link
Copy Markdown
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ambivalent about it. It could go either way, but I'd like to emphasize that this is orthogonal to the issue that this PR is fixing. If we want to add such a timeout, I'd rather do that in a different PR.

That's fine by me. I'm not sure it is really orthogonal since this PR effectively changes what was previously an immediate shutdown to a graceful shutdown and thus introduces the possibility of the shutdown taking a non-trivial amount of time. But, since this is new functionality that hasn't been exposed to users, there is no existing user expectation.

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Aug 14, 2023

I'm not sure it is really orthogonal since this PR effectively changes what was previously an immediate shutdown to a graceful shutdown

I believe that is incorrect. In the v23.1 code (the original version I implemented) the shutdown was graceful.
It only became immediate when I merged #96144, which we are considering a regression (only in the master branch for now). This patch recovers the v23.1 behavior.

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Aug 14, 2023

TFYR!

bors r=yuzefovich,stevendanna

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Aug 14, 2023

bors r-

i'll address yahor's nit first

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 14, 2023

Canceled.

TLDR: when the service mode of a secondary tenant is changed to NONE,
but the data state is still READY, the tenant servers will now shut
down gracefully. This applies both to tenant servers run as separate
processes (via `cockroach mt start-sql`) and those started via the
internal server controller.

More details follow.

For context, the reader is invited to first refer to the commit
immediately prior.

Prior to this patch, all internal shutdown requests were considered to
be non-graceful. This included the request that was triggered as a
result of a change in the service mode.

This caused two separate bugs:

- it created a race condition between the server controller (for
  shared-process servers) and the service mode checker task inside the
  SQL server (`startCheckService`). When the service mode was flipped to
  'NONE', the server controller initiated a graceful drain but then that
  was caught up by the service mode checker and converted to a
  hard shutdown. This particular bug was a regression caused by cockroachdb#96144
  and had been identified through flakes in `TestServiceShutdownUsesGracefulDrain`.

- it was generally causing shutdowns upon changes in service mode to
  always be run as hard shutdowns (both for external-process and shared-process
  ervers). This is generally undesirable when the data state is still
  READY, as it prevents cleaning up the SQL liveness record and can
  cause delays during subsequent server startups.

This patch fixes it by using a graceful shutdown request whenever
the service mode changes and the data state is still READY.
Since the tenant server now handles graceful drains, the patch
also removes this responsibility from the server controller.

Release note: None
@knz knz force-pushed the 20230811-tenant-drain branch from 6585818 to 103c0a7 Compare August 14, 2023 16:58
Copy link
Copy Markdown
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong and @rachitgsrivastava)


pkg/server/server_sql.go line 1768 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: I was a bit confused by this, consider leaving a comment here that given err return argument is nil, useGracefulDrain return argument doesn't matter.

Done.

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Aug 14, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Aug 14, 2023

bors r=yuzefovich,stevendanna

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 14, 2023

This PR was included in a batch that was canceled, it will be automatically retried

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 14, 2023

This PR was included in a batch that timed out, it will be automatically retried

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 14, 2023

Build succeeded:

@craig craig bot merged commit 96b4a8e into cockroachdb:master Aug 14, 2023
@knz knz mentioned this pull request Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ccl/serverccl: TestServiceShutdownUsesGracefulDrain failed

4 participants