Skip to content

server: avoid missing service mode changes #112295

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
stevendanna:fix-112077
Oct 16, 2023
Merged

server: avoid missing service mode changes #112295
craig[bot] merged 2 commits intocockroachdb:masterfrom
stevendanna:fix-112077

Conversation

@stevendanna
Copy link
Copy Markdown
Collaborator

In #112001 we introduced a bug and an unintended behaviour change.

The bug is that if we receive a notification of a state change from
none to shared when the server is still shutting down, that state
change will be ignored. Namely, the following can happen:

  1. ALTER VIRTUAL CLUSTER a STOP SERVICE
  2. Watcher gets notification of shutdown and notifies virtual
    cluster's SQL server.
  3. Tenant 1 starts shutdown but does not fully complete it
  4. ALTER VIRTUAL CLUSTER a START SERVICE SHARED
  5. Watcher notifies the server orchestrator; but, since the SQL server has
    not finished stopping from the previous stop request, it appears as if
    it is already started.
  6. Tenant 1 finishes shutdown.
  7. Server orchestrator never again tries to start the virtual cluster.

The newly added test reveals this under stress.

The behaviour change is that previously if a SQL server for a virtual
cluster failed to start up, it would previously be restarted.

Here, we fix both of these by re-introducing a periodic polling of the
service state. Unlike the previous polling, we poll the watcher state
so we are not generating a SQL query every second.

Further, since we are now calling the tenantcapabailities watcher
GetAllTenants method every second in addition to on every update, I've
moved where we allocate the list of all tenants to our handle update
call.

An alternative here would be to revert #112001 completely. I think
there are still advantage to using the watcher: not generating a SQL
query on every node once per second and after the integration of

Fixes #112077

Release note: None

@stevendanna stevendanna requested review from a team as code owners October 13, 2023 12:05
@stevendanna stevendanna requested review from adityamaru and removed request for a team October 13, 2023 12:05
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@stevendanna stevendanna added the backport-23.2.x PAST MAINTENANCE SUPPORT: 23.2 patch releases via ER request only label Oct 13, 2023
@stevendanna
Copy link
Copy Markdown
Collaborator Author

First commit is #112166

Copy link
Copy Markdown
Contributor

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, but this is my first time reading this code so let's wait for Yahor too to give this a thumb

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me too, :lgtm:

Reviewed 2 of 2 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @stevendanna)


-- commits line 26 at r2:
nit: "tenant 1" and VC "a" is the same virtual cluster, right?


-- commits line 50 at r2:
nit: incomplete sentence.


pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/watcher.go line 53 at r2 (raw file):

		allTenants []tenantcapabilities.Entry

		// anyChangeChs is closed on any change to the set of

nit: s/anyChangeChs/anyChangeCh/.


pkg/server/server_controller_test.go line 75 at r2 (raw file):

}

// TestServerControllStopStart is, when run under stress, a regression

nit: s/TestServerControllStopStart/TestServerControllerStopStart/.

Currently, tenant servers are slow to shut down after a STOP SERVICE.
This can cause flakes because (1) it is unsafe to revert a tenant that
is still writing and (2) the test assumes it can connect anew after
bringing the tenant back online, but the tenant might still be
draining.

Epic: none

Release note: None
In cockroachdb#112001 we introduced a bug and an unintended behaviour change.

The bug is that if we receive a notification of a state change from
none to shared when the server is still shutting down, that state
change will be ignored. Namely, the following can happen:

1. ALTER VIRTUAL CLUSTER a STOP SERVICE
2. Watcher gets notification of shutdown and notifies virtual
   cluster's SQL server.
3. Tenant "a" starts shutdown but does not fully complete it
4. ALTER VIRTUAL CLUSTER a START SERVICE SHARED
5. Watcher notifies the server orchestrator; but, since the SQL server has
   not finished stopping from the previous stop request, it appears as if
   it is already started.
6. Tenant "a" finishes shutdown.
7. Server orchestrator never again tries to start the virtual cluster.

The newly added test reveals this under stress.

The behaviour change is that previously if a SQL server for a virtual
cluster failed to start up, it would previously be restarted.

Here, we fix both of these by re-introducing a periodic polling of the
service state.  Unlike the previous polling, we poll the watcher state
so we are not generating a SQL query every second.

Further, since we are now calling the tenantcapabailities watcher
GetAllTenants method every second in addition to on every update, I've
moved where we allocate the list of all tenants to our handle update
call.

An alternative here would be to revert cockroachdb#112001 completely.  I think
there are still advantage to using the watcher: not generating a SQL
query on every node once per second and more responsive server startup
after the integration of cockroachdb#112094.

Fixes cockroachdb#112077

Release note: None
@stevendanna
Copy link
Copy Markdown
Collaborator Author

bors r=adityamaru,yuzefovich

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Oct 16, 2023

Build succeeded:

@craig craig bot merged commit c8df48c into cockroachdb:master Oct 16, 2023
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Oct 16, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from a5e6b80 to blathers/backport-release-23.2-112295: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-23.2.x PAST MAINTENANCE SUPPORT: 23.2 patch releases via ER request only

Projects

None yet

Development

Successfully merging this pull request may close these issues.

server: server orchestrator can fail to start tenant

4 participants