server/drain: shut down SQL subsystems gracefully before releasing table leases#100476
Conversation
| } | ||
|
|
||
| // Inform the job system that the node is draining. | ||
| s.sqlServer.jobRegistry.SetDraining() |
There was a problem hiding this comment.
Do we need this to happen after the above step?
knz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @miretskiy, and @rytaft)
pkg/server/drain.go line 359 at r4 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
Do we need this to happen after the above step?
How so? I figured that if we mark the registry as draining before we wait for SQL clients to disconnect, we create the risk that one of these sessions will issue a BACKUP or some other job-based statement before it disconnects, and encounter a job error as a result (that the registry is now unavailable). Do you need a comment here to explain that?
|
|
|
huh why did you close it? |
Just curious. The whole shutdown is a bit of a "dance"... I was working on some other code, and wanted to get as much
I have NO idea how that happen. Apologies. I guess I was typing reply, and hit tab or something to change focus... |
knz
left a comment
There was a problem hiding this comment.
I was working on some other code, and wanted to get as much
"notification" as possible when node starts draining, so moving this removes that time.
If we find there is a benefit to notifying various subsystems when a drain is starting (as opposed to the hard deadline of it completing), we could also build that. It's out of scope of this PR but I think it'd be an interesting project.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson, @miretskiy, and @rytaft)
1006874 to
920ac9b
Compare
|
@jeffswenson i have added the extra logging and test knob in a last commit as we discussed. |
rytaft
left a comment
There was a problem hiding this comment.
SQL Queries changes LGTM
Reviewed 3 of 3 files at r2, 3 of 3 files at r6.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @knz and @miretskiy)
-- commits line 20 at r2:
These changes look good. Do you have any explicit tests that this is working?
pkg/sql/stats/automatic_stats.go line 814 at r2 (raw file):
// helps avoiding log spam. err = errors.New("server is shutting down") }
If I understand correctly how this works, one of these cases will be chosen at random if all are ready. Should we bias selection of the latter two cases to avoid the log spam?
knz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @rytaft)
Previously, rytaft (Rebecca Taft) wrote…
These changes look good. Do you have any explicit tests that this is working?
Will add those as part of #100436 which will exercise this code path.
pkg/server/drain.go line 359 at r4 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
How so? I figured that if we mark the registry as draining before we wait for SQL clients to disconnect, we create the risk that one of these sessions will issue a BACKUP or some other job-based statement before it disconnects, and encounter a job error as a result (that the registry is now unavailable). Do you need a comment here to explain that?
Added a comment.
pkg/sql/stats/automatic_stats.go line 814 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
If I understand correctly how this works, one of these cases will be chosen at random if all are ready. Should we bias selection of the latter two cases to avoid the log spam?
As discussed in #100482 it is actually undesirable to fail to queue the mutation during a server shutdown. So probably we benefit from the extra warning in that case.
920ac9b to
e04a286
Compare
This change ensures that the job registry shuts down and cancels all currently running job before the KV layer gets shut down. This avoids noise in the logs and also creates a natural place in the control flow of a drain where the SQL instance can de-register itself with a guarantee that the record is not needed any more. Release note: None
This ensures that the auto-stats task shut down before the SQL lease manager and other infrastructure has started quiescing. It avoids log spam caused by the auto-stats tasks and provide a guarantee of no further SQL activity caused by auto-stats after that stage of a graceful drain. Release note: None
This commit ensures that the background tasks that manage the persistence of SQL stats are shut down during graceful drain, to reduce log spam at the end of the shutdown sequence and also to provide a sequence point where we are confident there is no more SQL activity. Release note: None
Release note: None
Prior to this patch, it wasn't possible for a test to assert that a graceful drain occurred prior to shutting down a server. This patch adds it. It also clarifies the logging output around the last phases of a graceful drain to ease troubleshooting. Release note:
e04a286 to
0a0985e
Compare
|
bors r=JeffSwenson,rytaft |
|
Build succeeded: |
The test becamse flaky after cockroachdb#100476 merged Fixes cockroachdb#100903 Release note: None
100893: ui: search criteria ux improvements r=maryliag a=maryliag Some of the names of sort on search criteria were not a match for the column name on the tables, which could cause confusion. This commit updates the values of "P99" to "P99 Latency" and "Service Latency" to "Statement time" and "Transaction time". Epic: None Release note (ui change): Update sort label on Search Criteria to match the name on the table columns. 101058: roachtest: bump tpccbench timeout r=srosenberg a=renatolabs Looking at the test history, we see that tpccbench may sometimes take longer than 5h, especially in multi-region setups. For that reason, we bump the timeout for this test to 7h, which should be sufficient and avoid failures due to timeouts. This commit also removes an unused `MinVersion` field in the `tpccBenchSpec` struct. Resolves #100975. Release note: None 101077: changefeedccl: Fix TestChangefeedHandlesDrainingNodes test r=miretskiy a=miretskiy The test becamse flaky after #100476 merged Fixes #100903 Release note: None 101097: server: fix a race condition during server initialization r=irfansharif a=knz Fixes #91414. Fixes #101010. Fixes #100902. The call to `registerEnginesForDiskStatsMap` needs to wait until the store IDs are known. Release note: None Epic: None Co-authored-by: maryliag <marylia@cockroachlabs.com> Co-authored-by: Renato Costa <renato@cockroachlabs.com> Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Needed for #99941 and #99958.
Epic: CRDB-23559
See individual commits for details.