server: avoid missing service mode changes #112295
server: avoid missing service mode changes #112295craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
|
First commit is #112166 |
adityamaru
left a comment
There was a problem hiding this comment.
This LGTM, but this is my first time reading this code so let's wait for Yahor too to give this a thumb
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r1, 4 of 4 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @stevendanna)
-- commits line 26 at r2:
nit: "tenant 1" and VC "a" is the same virtual cluster, right?
-- commits line 50 at r2:
nit: incomplete sentence.
pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/watcher.go line 53 at r2 (raw file):
allTenants []tenantcapabilities.Entry // anyChangeChs is closed on any change to the set of
nit: s/anyChangeChs/anyChangeCh/.
pkg/server/server_controller_test.go line 75 at r2 (raw file):
} // TestServerControllStopStart is, when run under stress, a regression
nit: s/TestServerControllStopStart/TestServerControllerStopStart/.
Currently, tenant servers are slow to shut down after a STOP SERVICE. This can cause flakes because (1) it is unsafe to revert a tenant that is still writing and (2) the test assumes it can connect anew after bringing the tenant back online, but the tenant might still be draining. Epic: none Release note: None
In cockroachdb#112001 we introduced a bug and an unintended behaviour change. The bug is that if we receive a notification of a state change from none to shared when the server is still shutting down, that state change will be ignored. Namely, the following can happen: 1. ALTER VIRTUAL CLUSTER a STOP SERVICE 2. Watcher gets notification of shutdown and notifies virtual cluster's SQL server. 3. Tenant "a" starts shutdown but does not fully complete it 4. ALTER VIRTUAL CLUSTER a START SERVICE SHARED 5. Watcher notifies the server orchestrator; but, since the SQL server has not finished stopping from the previous stop request, it appears as if it is already started. 6. Tenant "a" finishes shutdown. 7. Server orchestrator never again tries to start the virtual cluster. The newly added test reveals this under stress. The behaviour change is that previously if a SQL server for a virtual cluster failed to start up, it would previously be restarted. Here, we fix both of these by re-introducing a periodic polling of the service state. Unlike the previous polling, we poll the watcher state so we are not generating a SQL query every second. Further, since we are now calling the tenantcapabailities watcher GetAllTenants method every second in addition to on every update, I've moved where we allocate the list of all tenants to our handle update call. An alternative here would be to revert cockroachdb#112001 completely. I think there are still advantage to using the watcher: not generating a SQL query on every node once per second and more responsive server startup after the integration of cockroachdb#112094. Fixes cockroachdb#112077 Release note: None
9ca95ae to
13ddea3
Compare
|
bors r=adityamaru,yuzefovich |
|
Build succeeded: |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from a5e6b80 to blathers/backport-release-23.2-112295: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
In #112001 we introduced a bug and an unintended behaviour change.
The bug is that if we receive a notification of a state change from
none to shared when the server is still shutting down, that state
change will be ignored. Namely, the following can happen:
cluster's SQL server.
not finished stopping from the previous stop request, it appears as if
it is already started.
The newly added test reveals this under stress.
The behaviour change is that previously if a SQL server for a virtual
cluster failed to start up, it would previously be restarted.
Here, we fix both of these by re-introducing a periodic polling of the
service state. Unlike the previous polling, we poll the watcher state
so we are not generating a SQL query every second.
Further, since we are now calling the tenantcapabailities watcher
GetAllTenants method every second in addition to on every update, I've
moved where we allocate the list of all tenants to our handle update
call.
An alternative here would be to revert #112001 completely. I think
there are still advantage to using the watcher: not generating a SQL
query on every node once per second and after the integration of
Fixes #112077
Release note: None