server: avoid missing service mode changes by stevendanna · Pull Request #112295 · cockroachdb/cockroach

stevendanna · 2023-10-13T12:05:45Z

In #112001 we introduced a bug and an unintended behaviour change.

The bug is that if we receive a notification of a state change from
none to shared when the server is still shutting down, that state
change will be ignored. Namely, the following can happen:

ALTER VIRTUAL CLUSTER a STOP SERVICE
Watcher gets notification of shutdown and notifies virtual
cluster's SQL server.
Tenant 1 starts shutdown but does not fully complete it
ALTER VIRTUAL CLUSTER a START SERVICE SHARED
Watcher notifies the server orchestrator; but, since the SQL server has
not finished stopping from the previous stop request, it appears as if
it is already started.
Tenant 1 finishes shutdown.
Server orchestrator never again tries to start the virtual cluster.

The newly added test reveals this under stress.

The behaviour change is that previously if a SQL server for a virtual
cluster failed to start up, it would previously be restarted.

Here, we fix both of these by re-introducing a periodic polling of the
service state. Unlike the previous polling, we poll the watcher state
so we are not generating a SQL query every second.

Further, since we are now calling the tenantcapabailities watcher
GetAllTenants method every second in addition to on every update, I've
moved where we allocate the list of all tenants to our handle update
call.

An alternative here would be to revert #112001 completely. I think
there are still advantage to using the watcher: not generating a SQL
query on every node once per second and after the integration of

Fixes #112077

Release note: None

cockroach-teamcity · 2023-10-13T12:05:58Z

This change is

stevendanna · 2023-10-13T12:26:54Z

First commit is #112166

adityamaru

This LGTM, but this is my first time reading this code so let's wait for Yahor too to give this a thumb

yuzefovich

Seems reasonable to me too,

Reviewed 2 of 2 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @stevendanna)

-- commits line 26 at r2:
nit: "tenant 1" and VC "a" is the same virtual cluster, right?

-- commits line 50 at r2:
nit: incomplete sentence.

pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/watcher.go line 53 at r2 (raw file):

		allTenants []tenantcapabilities.Entry

		// anyChangeChs is closed on any change to the set of

nit: s/anyChangeChs/anyChangeCh/.

pkg/server/server_controller_test.go line 75 at r2 (raw file):

}

// TestServerControllStopStart is, when run under stress, a regression

nit: s/TestServerControllStopStart/TestServerControllerStopStart/.

Currently, tenant servers are slow to shut down after a STOP SERVICE. This can cause flakes because (1) it is unsafe to revert a tenant that is still writing and (2) the test assumes it can connect anew after bringing the tenant back online, but the tenant might still be draining. Epic: none Release note: None

In cockroachdb#112001 we introduced a bug and an unintended behaviour change. The bug is that if we receive a notification of a state change from none to shared when the server is still shutting down, that state change will be ignored. Namely, the following can happen: 1. ALTER VIRTUAL CLUSTER a STOP SERVICE 2. Watcher gets notification of shutdown and notifies virtual cluster's SQL server. 3. Tenant "a" starts shutdown but does not fully complete it 4. ALTER VIRTUAL CLUSTER a START SERVICE SHARED 5. Watcher notifies the server orchestrator; but, since the SQL server has not finished stopping from the previous stop request, it appears as if it is already started. 6. Tenant "a" finishes shutdown. 7. Server orchestrator never again tries to start the virtual cluster. The newly added test reveals this under stress. The behaviour change is that previously if a SQL server for a virtual cluster failed to start up, it would previously be restarted. Here, we fix both of these by re-introducing a periodic polling of the service state. Unlike the previous polling, we poll the watcher state so we are not generating a SQL query every second. Further, since we are now calling the tenantcapabailities watcher GetAllTenants method every second in addition to on every update, I've moved where we allocate the list of all tenants to our handle update call. An alternative here would be to revert cockroachdb#112001 completely. I think there are still advantage to using the watcher: not generating a SQL query on every node once per second and more responsive server startup after the integration of cockroachdb#112094. Fixes cockroachdb#112077 Release note: None

stevendanna · 2023-10-16T09:23:02Z

bors r=adityamaru,yuzefovich

craig · 2023-10-16T09:58:01Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-10-16T09:58:10Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from a5e6b80 to blathers/backport-release-23.2-112295: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

stevendanna requested a review from yuzefovich October 13, 2023 12:05

stevendanna requested review from a team as code owners October 13, 2023 12:05

stevendanna requested review from adityamaru and removed request for a team October 13, 2023 12:05

stevendanna added the backport-23.2.x PAST MAINTENANCE SUPPORT: 23.2 patch releases via ER request only label Oct 13, 2023

adityamaru approved these changes Oct 13, 2023

View reviewed changes

yuzefovich approved these changes Oct 13, 2023

View reviewed changes

stevendanna added 2 commits October 16, 2023 08:38

stevendanna force-pushed the fix-112077 branch from 9ca95ae to 13ddea3 Compare October 16, 2023 07:42

craig bot merged commit c8df48c into cockroachdb:master Oct 16, 2023

This was referenced Oct 16, 2023

streamingccl: fix test flake #112166

Closed

release-23.2: server: avoid missing service mode changes #112403

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: avoid missing service mode changes #112295

server: avoid missing service mode changes #112295
craig[bot] merged 2 commits intocockroachdb:masterfrom
stevendanna:fix-112077

stevendanna commented Oct 13, 2023

Uh oh!

cockroach-teamcity commented Oct 13, 2023

Uh oh!

stevendanna commented Oct 13, 2023

Uh oh!

adityamaru left a comment

Uh oh!

yuzefovich left a comment

Uh oh!

stevendanna commented Oct 16, 2023

Uh oh!

craig bot commented Oct 16, 2023

Uh oh!

blathers-crl bot commented Oct 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

stevendanna commented Oct 13, 2023

Uh oh!

cockroach-teamcity commented Oct 13, 2023

Uh oh!

stevendanna commented Oct 13, 2023

Uh oh!

adityamaru left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

stevendanna commented Oct 16, 2023

Uh oh!

craig bot commented Oct 16, 2023

Uh oh!

blathers-crl bot commented Oct 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants