server,jobs: Better handle node drain by miretskiy · Pull Request #103709 · cockroachdb/cockroach

miretskiy · 2023-05-22T00:40:28Z

Rework job registry drain signal to terminate the drain as soon as the last job that was watching for drain signal completes its drain

Epic: CRDB-26978
Release note: None

blathers-crl · 2023-05-22T00:40:32Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-05-22T00:40:40Z

This change is

knz

This is better; I like it.
Could we also get a review from @stevendanna who has a keener eye on the topic of channel concurrency. Thanks

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava, @miretskiy, and @rhu713)

pkg/jobs/registry.go line 2021 at r1 (raw file):

	close(r.drainRequested)
	defer close(r.drainRequested)

Looks like you're closing the channel twice here.

pkg/jobs/registry.go line 2031 at r1 (raw file):

	t.Reset(maxWait)

	for numWait > 0 {

You need to wait on the stopper's ShouldQuiesce here. It's a different condition than ctx.Done (for the time being - we plan to improve that).

knz

sorry i didn't mean to block this.

not blocking

miretskiy

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava, @knz, @rhu713, and @stevendanna)

pkg/jobs/registry.go line 2021 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Looks like you're closing the channel twice here.

Doh; that's what you gate for late sunday PRs. Should have defer closed r.drainJobs instead

pkg/jobs/registry.go line 2031 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

You need to wait on the stopper's ShouldQuiesce here. It's a different condition than ctx.Done (for the time being - we plan to improve that).

Ack.

miretskiy · 2023-05-22T11:21:24Z

@knz -- tftr -- okay to revert wait period to 10 seconds as part of this PR?

knz · 2023-05-22T11:28:53Z

okay to revert wait period to 10 seconds as part of this PR?

Yes - as long as we also manually confirm that a drain remains fast(er) in most cases, e.g. when shutting down a cockroach start-single-node process gracefully.

miretskiy · 2023-05-22T11:40:51Z

okay to revert wait period to 10 seconds as part of this PR?

Yes - as long as we also manually confirm that a drain remains fast(er) in most cases, e.g. when shutting down a cockroach start-single-node process gracefully.

10s it is; tested by reverting change that had to set it to 0 in drain_test (it would become flaky if job wait logic was as it was prior to this PR)

jayshrivastava

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @miretskiy, @rhu713, and @stevendanna)

pkg/jobs/registry.go line 2052 at r3 (raw file):

func (r *Registry) OnDrain() (<-chan struct{}, func()) {
	r.mu.Lock()
	r.mu.numDrainWait++

It's a little odd that a job can call OnDrain and increment r.mu.numDrainWait in the middle of DrainRequested being executed by the registry. Say the registry is executing DrainRequested and the timer is almost finished and a new job calls OnDrain. The timer will fire immediately and trigger the drain. The job called OnDrain expecting to have 10 seconds to clean up, but it did not get 10 seconds.

It may be worth having a fast path where OnDrain returns an error if the registry is already draining (r.mu.draining = true). You could also change the last case of the select in DrainRequested to do numWait-- without locking the mutex.

miretskiy · 2023-05-22T14:13:58Z

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz, @miretskiy, @rhu713, and @stevendanna)

pkg/jobs/registry.go line 2052 at r3 (raw file):
func (r *Registry) OnDrain() (<-chan struct{}, func()) {
	r.mu.Lock()
	r.mu.numDrainWait++
It's a little odd that a job can call OnDrain and increment r.mu.numDrainWait in the middle of DrainRequested being executed by the registry. Say the registry is executing DrainRequested and the timer is almost finished and a new job calls OnDrain. The timer will fire immediately and trigger the drain. The job called OnDrain expecting to have 10 seconds to clean up, but it did not get 10 seconds.

I think it's fine. We are shutting down -- races during that time happen all the time.
OnDrain is never a guarantee -- so, if the job is relying on 10 seconds -- the job should stop doing that.
It's a best effort affair.

It may be worth having a fast path where OnDrain returns an error if the registry is already draining (r.mu.draining = true). You could also change the last case of the select in DrainRequested to do numWait-- without locking the mutex.

I think race detection could still trigger; but regardless, I'm not sure it's worth the effort.

miretskiy · 2023-05-22T17:39:15Z

@stevendanna mind taking a look?

stevendanna · 2023-05-23T13:37:50Z

It's a little odd that a job can call OnDrain and increment r.mu.numDrainWait in the middle of DrainRequested being executed by the registry.

I think jobs that care about this could immediately check if the returned channel is closed before starting other work. This would be pretty similar to returning a bool or error, just a extra line or code or two.

stevendanna

Overall looks reasonable to me. Thanks!

stevendanna · 2023-05-23T13:28:14Z

pkg/jobs/registry.go

+	t := timeutil.NewTimer()
+	defer t.Stop()
+	t.Reset(maxWait)


No need to change this, but since the function already takes a context, the caller could set a deadline on the context rather than having a separate wait_time argument.

Ohh... I like this a lot. Making this change.

pkg/jobs/registry.go

Rework job registry drain signal to terminate the drain as soon as the last job that was watching for drain signal completes its drain Epic: CRDB-26978 Release note: None

miretskiy · 2023-05-23T16:26:54Z

bors r+

craig · 2023-05-23T17:21:07Z

Build succeeded:

Bazel Essential CI (Cockroach)

miretskiy requested a review from knz May 22, 2023 00:40

miretskiy requested review from a team as code owners May 22, 2023 00:40

miretskiy requested review from jayshrivastava and rhu713 and removed request for a team May 22, 2023 00:40

knz previously requested changes May 22, 2023

View reviewed changes

knz reviewed May 22, 2023

View reviewed changes

knz requested a review from stevendanna May 22, 2023 09:29

miretskiy force-pushed the draino branch from a1be25e to 94c49bc Compare May 22, 2023 11:20

miretskiy requested a review from knz May 22, 2023 11:20

miretskiy commented May 22, 2023

View reviewed changes

miretskiy force-pushed the draino branch 2 times, most recently from 8ee48a5 to 394c302 Compare May 22, 2023 12:07

jayshrivastava reviewed May 22, 2023

View reviewed changes

miretskiy force-pushed the draino branch from 394c302 to 8416e59 Compare May 22, 2023 14:19

stevendanna approved these changes May 23, 2023

View reviewed changes

server,jobs: Better handle node drain

8ed61bf

Rework job registry drain signal to terminate the drain as soon as the last job that was watching for drain signal completes its drain Epic: CRDB-26978 Release note: None

miretskiy force-pushed the draino branch from 8416e59 to 8ed61bf Compare May 23, 2023 14:31

craig bot merged commit e3b7e0c into cockroachdb:master May 23, 2023

Conversation

miretskiy commented May 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blathers-crl bot commented May 22, 2023

Uh oh!

cockroach-teamcity commented May 22, 2023

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

miretskiy left a comment

Choose a reason for hiding this comment

Uh oh!

miretskiy commented May 22, 2023

Uh oh!

knz commented May 22, 2023

Uh oh!

miretskiy commented May 22, 2023

Uh oh!

jayshrivastava left a comment

Choose a reason for hiding this comment

Uh oh!

miretskiy commented May 22, 2023

Uh oh!

miretskiy commented May 22, 2023

Uh oh!

stevendanna commented May 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stevendanna left a comment

Choose a reason for hiding this comment

Uh oh!

stevendanna May 23, 2023

Choose a reason for hiding this comment

Uh oh!

miretskiy May 23, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

miretskiy commented May 23, 2023

Uh oh!

craig bot commented May 23, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

miretskiy commented May 22, 2023 •

edited

Loading

stevendanna commented May 23, 2023 •

edited

Loading