Skip to content

server,jobs: Better handle node drain#103709

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
miretskiy:draino
May 23, 2023
Merged

server,jobs: Better handle node drain#103709
craig[bot] merged 1 commit intocockroachdb:masterfrom
miretskiy:draino

Conversation

@miretskiy
Copy link
Copy Markdown
Contributor

@miretskiy miretskiy commented May 22, 2023

Rework job registry drain signal to terminate the drain as soon as the last job that was watching for drain signal completes its drain

Epic: CRDB-26978
Release note: None

@miretskiy miretskiy requested a review from knz May 22, 2023 00:40
@miretskiy miretskiy requested review from a team as code owners May 22, 2023 00:40
@miretskiy miretskiy requested review from jayshrivastava and rhu713 and removed request for a team May 22, 2023 00:40
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented May 22, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

knz
knz previously requested changes May 22, 2023
Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better; I like it.
Could we also get a review from @stevendanna who has a keener eye on the topic of channel concurrency. Thanks

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava, @miretskiy, and @rhu713)


pkg/jobs/registry.go line 2021 at r1 (raw file):

	close(r.drainRequested)
	defer close(r.drainRequested)

Looks like you're closing the channel twice here.


pkg/jobs/registry.go line 2031 at r1 (raw file):

	t.Reset(maxWait)

	for numWait > 0 {

You need to wait on the stopper's ShouldQuiesce here. It's a different condition than ctx.Done (for the time being - we plan to improve that).

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry i didn't mean to block this.

@knz knz dismissed their stale review May 22, 2023 09:28

not blocking

@knz knz requested a review from stevendanna May 22, 2023 09:29
@miretskiy miretskiy requested a review from knz May 22, 2023 11:20
Copy link
Copy Markdown
Contributor Author

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jayshrivastava, @knz, @rhu713, and @stevendanna)


pkg/jobs/registry.go line 2021 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

Looks like you're closing the channel twice here.

Doh; that's what you gate for late sunday PRs. Should have defer closed r.drainJobs instead


pkg/jobs/registry.go line 2031 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

You need to wait on the stopper's ShouldQuiesce here. It's a different condition than ctx.Done (for the time being - we plan to improve that).

Ack.

@miretskiy
Copy link
Copy Markdown
Contributor Author

@knz -- tftr -- okay to revert wait period to 10 seconds as part of this PR?

@knz
Copy link
Copy Markdown
Contributor

knz commented May 22, 2023

okay to revert wait period to 10 seconds as part of this PR?

Yes - as long as we also manually confirm that a drain remains fast(er) in most cases, e.g. when shutting down a cockroach start-single-node process gracefully.

@miretskiy
Copy link
Copy Markdown
Contributor Author

okay to revert wait period to 10 seconds as part of this PR?

Yes - as long as we also manually confirm that a drain remains fast(er) in most cases, e.g. when shutting down a cockroach start-single-node process gracefully.

10s it is; tested by reverting change that had to set it to 0 in drain_test (it would become flaky if job wait logic was as it was prior to this PR)

@miretskiy miretskiy force-pushed the draino branch 2 times, most recently from 8ee48a5 to 394c302 Compare May 22, 2023 12:07
Copy link
Copy Markdown
Contributor

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @miretskiy, @rhu713, and @stevendanna)


pkg/jobs/registry.go line 2052 at r3 (raw file):

func (r *Registry) OnDrain() (<-chan struct{}, func()) {
	r.mu.Lock()
	r.mu.numDrainWait++

It's a little odd that a job can call OnDrain and increment r.mu.numDrainWait in the middle of DrainRequested being executed by the registry. Say the registry is executing DrainRequested and the timer is almost finished and a new job calls OnDrain. The timer will fire immediately and trigger the drain. The job called OnDrain expecting to have 10 seconds to clean up, but it did not get 10 seconds.

It may be worth having a fast path where OnDrain returns an error if the registry is already draining (r.mu.draining = true). You could also change the last case of the select in DrainRequested to do numWait-- without locking the mutex.

@miretskiy
Copy link
Copy Markdown
Contributor Author

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz, @miretskiy, @rhu713, and @stevendanna)

pkg/jobs/registry.go line 2052 at r3 (raw file):

func (r *Registry) OnDrain() (<-chan struct{}, func()) {
	r.mu.Lock()
	r.mu.numDrainWait++

It's a little odd that a job can call OnDrain and increment r.mu.numDrainWait in the middle of DrainRequested being executed by the registry. Say the registry is executing DrainRequested and the timer is almost finished and a new job calls OnDrain. The timer will fire immediately and trigger the drain. The job called OnDrain expecting to have 10 seconds to clean up, but it did not get 10 seconds.

I think it's fine. We are shutting down -- races during that time happen all the time.
OnDrain is never a guarantee -- so, if the job is relying on 10 seconds -- the job should stop doing that.
It's a best effort affair.

It may be worth having a fast path where OnDrain returns an error if the registry is already draining (r.mu.draining = true). You could also change the last case of the select in DrainRequested to do numWait-- without locking the mutex.

I think race detection could still trigger; but regardless, I'm not sure it's worth the effort.

@miretskiy
Copy link
Copy Markdown
Contributor Author

@stevendanna mind taking a look?

@stevendanna
Copy link
Copy Markdown
Collaborator

stevendanna commented May 23, 2023

It's a little odd that a job can call OnDrain and increment r.mu.numDrainWait in the middle of DrainRequested being executed by the registry.

I think jobs that care about this could immediately check if the returned channel is closed before starting other work. This would be pretty similar to returning a bool or error, just a extra line or code or two.

Copy link
Copy Markdown
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks reasonable to me. Thanks!

Comment on lines +2026 to +2028
t := timeutil.NewTimer()
defer t.Stop()
t.Reset(maxWait)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to change this, but since the function already takes a context, the caller could set a deadline on the context rather than having a separate wait_time argument.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh... I like this a lot. Making this change.

Rework job registry drain signal to terminate the drain
as soon as the last job that was watching for drain signal
completes its drain

Epic: CRDB-26978
Release note: None
@miretskiy
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 23, 2023

Build succeeded:

@craig craig bot merged commit e3b7e0c into cockroachdb:master May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants