server,jobs: Better handle node drain#103709
Conversation
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
knz
left a comment
There was a problem hiding this comment.
This is better; I like it.
Could we also get a review from @stevendanna who has a keener eye on the topic of channel concurrency. Thanks
Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava, @miretskiy, and @rhu713)
pkg/jobs/registry.go line 2021 at r1 (raw file):
close(r.drainRequested) defer close(r.drainRequested)
Looks like you're closing the channel twice here.
pkg/jobs/registry.go line 2031 at r1 (raw file):
t.Reset(maxWait) for numWait > 0 {
You need to wait on the stopper's ShouldQuiesce here. It's a different condition than ctx.Done (for the time being - we plan to improve that).
knz
left a comment
There was a problem hiding this comment.
sorry i didn't mean to block this.
miretskiy
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava, @knz, @rhu713, and @stevendanna)
pkg/jobs/registry.go line 2021 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
Looks like you're closing the channel twice here.
Doh; that's what you gate for late sunday PRs. Should have defer closed r.drainJobs instead
pkg/jobs/registry.go line 2031 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
You need to wait on the stopper's ShouldQuiesce here. It's a different condition than ctx.Done (for the time being - we plan to improve that).
Ack.
|
@knz -- tftr -- okay to revert wait period to 10 seconds as part of this PR? |
Yes - as long as we also manually confirm that a drain remains fast(er) in most cases, e.g. when shutting down a |
10s it is; tested by reverting change that had to set it to 0 in drain_test (it would become flaky if job wait logic was as it was prior to this PR) |
8ee48a5 to
394c302
Compare
jayshrivastava
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz, @miretskiy, @rhu713, and @stevendanna)
pkg/jobs/registry.go line 2052 at r3 (raw file):
func (r *Registry) OnDrain() (<-chan struct{}, func()) { r.mu.Lock() r.mu.numDrainWait++
It's a little odd that a job can call OnDrain and increment r.mu.numDrainWait in the middle of DrainRequested being executed by the registry. Say the registry is executing DrainRequested and the timer is almost finished and a new job calls OnDrain. The timer will fire immediately and trigger the drain. The job called OnDrain expecting to have 10 seconds to clean up, but it did not get 10 seconds.
It may be worth having a fast path where OnDrain returns an error if the registry is already draining (r.mu.draining = true). You could also change the last case of the select in DrainRequested to do numWait-- without locking the mutex.
I think it's fine. We are shutting down -- races during that time happen all the time.
I think race detection could still trigger; but regardless, I'm not sure it's worth the effort. |
|
@stevendanna mind taking a look? |
I think jobs that care about this could immediately check if the returned channel is closed before starting other work. This would be pretty similar to returning a bool or error, just a extra line or code or two. |
stevendanna
left a comment
There was a problem hiding this comment.
Overall looks reasonable to me. Thanks!
pkg/jobs/registry.go
Outdated
| t := timeutil.NewTimer() | ||
| defer t.Stop() | ||
| t.Reset(maxWait) |
There was a problem hiding this comment.
No need to change this, but since the function already takes a context, the caller could set a deadline on the context rather than having a separate wait_time argument.
There was a problem hiding this comment.
Ohh... I like this a lot. Making this change.
Rework job registry drain signal to terminate the drain as soon as the last job that was watching for drain signal completes its drain Epic: CRDB-26978 Release note: None
|
bors r+ |
|
Build succeeded: |
Rework job registry drain signal to terminate the drain as soon as the last job that was watching for drain signal completes its drain
Epic: CRDB-26978
Release note: None