tenantcapabilitieswatcher: make the watcher react faster by knz · Pull Request #112094 · cockroachdb/cockroach

knz · 2023-10-10T14:12:54Z

Prior to this patch, the tenant info watcher would only react to
changes to system.tenants upon rangefeed cache flushes, which could
be (in default config) up to 3 seconds after the change is committed.

This commit accelerates the behavior by processing updates as soon as
the change is observed in the rangefeed.

This new behavior is similar to the way that cluster settings changes
are effected immediately in the settings watcher (pkg/settingswatcher).

Release note: None

cockroach-teamcity · 2023-10-10T14:13:09Z

This change is

stevendanna

I need to double-check, but I believe that the semantics of the translateEvent callback is that you are subject to the redelivery of duplicates. So I think we need to store the event's timestamp in our cache and skip updates from any older events to prevent regressing the state of a given tenant.

knz · 2023-10-10T14:31:21Z

good point. let me pick that up.

Release note: None

Prior to this patch, the tenant info watcher would only react to changes to `system.tenants` upon rangefeed cache flushes, which could be (in default config) up to 3 seconds after the change is committed. This commit accelerates the behavior by processing updates as soon as the change is observed in the rangefeed. This new behavior is similar to the way that cluster settings changes are effected immediately in the settings watcher (pkg/settingswatcher). Release note: None

stevendanna

Overall this looks good to me. One consequence of doing this is that changefeed restarts will cause >1 update for those watching for updates such as the server orchestrator. But as we discussed, those systems should be able to deal with many such updates per second.

A potential follow-up is that we added some code that restarts the capability watcher to avoid waiting out the 3 seconds (see TestingRestart in this file). With this change, I think that should largely be unnecessary.

Reviewed 1 of 1 files at r1, 7 of 7 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)

pkg/multitenant/tenantcapabilities/tenantcapabilitieswatcher/testdata/rangefeed_errors line 106 at r5 (raw file):

get-capabilities ten=10
----
ten=10 name=foo state=add service=shared cap={can_admin_unsplit:true}

Is this test telling us that we won't subsequently update our cache on the tenant deletion event if it happens during an error restart? Is that something we need to handle?

In cockroachdb#112001 we introduced a bug and an unintended behaviour change. The bug is that if we receive a notification of a state change from none to shared when the server is still shutting down, that state change will be ignored. Namely, the following can happen: 1. ALTER VIRTUAL CLUSTER a STOP SERVICE 2. Watcher gets notification of shutdown and notifies virtual cluster's SQL server. 3. Tenant "a" starts shutdown but does not fully complete it 4. ALTER VIRTUAL CLUSTER a START SERVICE SHARED 5. Watcher notifies the server orchestrator; but, since the SQL server has not finished stopping from the previous stop request, it appears as if it is already started. 6. Tenant "a" finishes shutdown. 7. Server orchestrator never again tries to start the virtual cluster. The newly added test reveals this under stress. The behaviour change is that previously if a SQL server for a virtual cluster failed to start up, it would previously be restarted. Here, we fix both of these by re-introducing a periodic polling of the service state. Unlike the previous polling, we poll the watcher state so we are not generating a SQL query every second. Further, since we are now calling the tenantcapabailities watcher GetAllTenants method every second in addition to on every update, I've moved where we allocate the list of all tenants to our handle update call. An alternative here would be to revert cockroachdb#112001 completely. I think there are still advantage to using the watcher: not generating a SQL query on every node once per second and more responsive server startup after the integration of cockroachdb#112094. Fixes cockroachdb#112077 Release note: None

114022: obsservice: export statement insights to obs service r=maryliag a=maryliag When an Insights is detected, we store it in memory and also export it to Observability Service. This behaviour is controlled by a new cluster setting `sql.insights.export.enabled`. If the value is disabled, no data is exported. Epic: CC-25978 Running YCSB for 10min (with Export Enable and Disabled): | Export | Type | ops(total) | ops/sec(cum) | avg(ms) | p50(ms) | p95(ms) | p99(ms) | pMax(ms) | |--|--|--|--|--|--|--|--|--| | Disabled | Read | 11779980 | 19633.3 | 0.9 | 0.7 | 2.2 | 3.7 | 192.9 | | Enabled | Read | 12470462 | 20784.1 | 0.9 | 0.7 | 2.0 | 3.4 | 201.3 | | Disabled | Update | 619139 | 1031.9 | 1.8 | 1.6 | 3.8 | 6.3 | 159.4 | | Enabled | Update | 656124 | 1093.5 | 1.7 | 1.4 | 3.5 | 5.8 | 121.6 | | Disabled | Result | 12399119 | 20665.2 | 1.0 | 0.8 | 2.4 | 3.9 | 192.9 | | Enabled | Total | 13126586 | 21877.6 | 0.9 | 0.7 | 2.2 | 3.7 | 201.3 | Running the Insights Workload (with Export Enable and Disabled) which is mostly statements that will be detected as having an Insight | Export | ops(total) | ops/sec(cum) | avg(ms) | p50(ms) | p95(ms) | p99(ms) | pMax(ms) | |--|--|--|--|--|--|--|--| |Disabled| 5948 | 9.9 | 2013.6 | 1811.9 | 5100.3 | 6442.5 | 10200.5 | |Enabled| 5906 | 9.8 | 2025.1 | 1744.8 | 5100.3 | 6174.0 | 9663.7 | Release note: None 114719: tenantcapabilitieswatcher: make the watcher react faster r=yuzefovich a=stevendanna Needed for #111637. Epic: CRDB-26691 Superceeds #112094 Prior to this patch, the tenant info watcher would only react to changes to `system.tenants` upon rangefeed cache flushes, which could be (in default config) up to 3 seconds after the change is committed. This commit accelerates the behavior by processing updates as soon as the rangefeed observes the change. This new behavior is similar to the way that cluster settings changes are processed immediately in the settings watcher (pkg/settingswatcher). In order to handle deletions that occur during errors that aren't automatically retried inside the rangefeed library (and are instead retried by the watcher resulting in a new full scan), we emit any scan-generated rangefeed events at their scan timestamp, allowing us a means of clearing any stale data from the cache. Release note: None 114743: server, ui: show internal stats with custer setting enabled r=maryliag a=maryliag Previously, when the cluster setting `sql.stats.response.show_internal.enabled` was set to true, we were still trying to use the activity tables to return data to the UI, but that tables doesn't have data from internal stats. This commit does that proper check to not use the Activity tables if that value is enabled. Also, since enabling that setting means the user wants to see the internal stats, this commit also changes the UI so that it returns the internal stats, oterhwise it might be confusing selecting top 100 but the UI only showing 40 for example. Fixes #114460 https://www.loom.com/share/8f3279ae4fd441999c24c68fb0fa8b14 Release note (bug fix): SQL Activity properly showing internal statements when cluster setting `sql.stats.response.show_internal.enabled` is set to true. 114746: storage: correct terminology in log line for disk stalls r=jbowens a=itsbilal Previously, we'd use the term "file write stall" in the log line for a disk stall, and not use the term "disk stall" which is what metrics and documentation refers to this event as. Furthermore, "write stalls" are a different, unrelated kind of stall in Pebble. This change addresses this by changing the wording to refer to this event as a "disk stall" in the log. Epic: none Release note (ops change): Updates the error message in case of stalled disks to use the appropriate term for that event ("disk stall"), which is also what metrics/dashboards refer to that event as. Co-authored-by: maryliag <marylia@cockroachlabs.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Steven Danna <danna@cockroachlabs.com> Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>

stevendanna · 2023-11-21T16:52:44Z

Merged as part of #114719

tenantcapabilitieswatcher: add a missing log scope

69c6c40

Release note: None

knz requested a review from stevendanna October 10, 2023 14:12

knz requested a review from a team as a code owner October 10, 2023 14:12

stevendanna reviewed Oct 10, 2023

View reviewed changes

tenantcapabilitieswatcher: make TestDataDriven faster

582c998

Release note: None

knz force-pushed the 20231010-faster-tenant-watcher branch from 601d6cc to a5022fe Compare October 10, 2023 15:06

knz force-pushed the 20231010-faster-tenant-watcher branch from a5022fe to 76b1930 Compare October 10, 2023 17:12

knz requested a review from stevendanna October 10, 2023 17:12

stevendanna reviewed Oct 10, 2023

View reviewed changes

stevendanna self-assigned this Oct 17, 2023

yuzefovich mentioned this pull request Oct 25, 2023

serverutils: probabilistic shared process virtual cluster #112822

Merged

stevendanna mentioned this pull request Nov 20, 2023

tenantcapabilitieswatcher: make the watcher react faster #114719

Merged

blathers-crl bot mentioned this pull request Nov 21, 2023

release-23.2: tenantcapabilitieswatcher: make the watcher react faster #114818

Merged

stevendanna closed this Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tenantcapabilitieswatcher: make the watcher react faster#112094

tenantcapabilitieswatcher: make the watcher react faster#112094
knz wants to merge 3 commits intocockroachdb:masterfrom
knz:20231010-faster-tenant-watcher

knz commented Oct 10, 2023 •

edited

Loading

Uh oh!

cockroach-teamcity commented Oct 10, 2023

Uh oh!

stevendanna left a comment

Uh oh!

knz commented Oct 10, 2023

Uh oh!

stevendanna left a comment

Uh oh!

stevendanna commented Nov 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

knz commented Oct 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Oct 10, 2023

Uh oh!

stevendanna left a comment

Choose a reason for hiding this comment

Uh oh!

knz commented Oct 10, 2023

Uh oh!

stevendanna left a comment

Choose a reason for hiding this comment

Uh oh!

stevendanna commented Nov 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

knz commented Oct 10, 2023 •

edited

Loading