Skip to content

roachtest: allow callers to listen for node events in monitor#108594

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
renatolabs:rc/roachtest-allow-monitor-node-deaths-only
Aug 14, 2023
Merged

roachtest: allow callers to listen for node events in monitor#108594
craig[bot] merged 1 commit intocockroachdb:masterfrom
renatolabs:rc/roachtest-allow-monitor-node-deaths-only

Conversation

@renatolabs
Copy link
Copy Markdown

In #107548, the (*monitorImpl).wait() function was modified such that, if the caller provided no functions (via the Go method), then wait() would block while listening for node events. However, that was a change in the semantics of the function: previously, a monitor without goroutines would return immediately after a call to wait() is made. This behaviour is preferred as it's familiar: it's how errgroup and our own ctxgroup work.

That said, we still need some way to support the use-case where callers are only interested in listening for unexpected node deaths without having to pass a function to Go(). To support that scenario, a new function is added to the Monitor interface: WaitForNodeDeath. This function is like WaitE, except that only errors related to unexpected node deaths are returned. When callers wish to terminate the monitoring goroutine, they call call Wait{,E}() as before.

The internal implementation of the monitorImpl is also simplified by making use of two errgroups to achieve this behaviour, replacing the previous, harder to understand, logic.

Fixes: #108530.

Release note: None

@renatolabs renatolabs requested a review from a team as a code owner August 11, 2023 14:28
@renatolabs renatolabs requested review from DarrylWong and rachitgsrivastava and removed request for a team August 11, 2023 14:28
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@renatolabs renatolabs force-pushed the rc/roachtest-allow-monitor-node-deaths-only branch from a46ad50 to 33121c4 Compare August 11, 2023 14:32
@srosenberg srosenberg self-requested a review August 11, 2023 16:04
Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @rachitgsrivastava, and @renatolabs)


pkg/cmd/roachtest/monitor.go line 115 at r1 (raw file):

			// enables that use case. But it also offers protection against accidental
			// panics (NPEs and such) which should not bubble up to the runtime.
			err = errors.Wrap(errors.WithStack(rErr), "monitor task failed")

Should we make the error message more explicit, i.e., monitor user task failed?


pkg/cmd/roachtest/monitor.go line 167 at r1 (raw file):

			eventsCh, err := roachprod.Monitor(m.ctx, m.l, m.nodes, install.MonitorOpts{})
			if err != nil {
				return errors.Wrap(err, "monitor command failure")

monitor node command failure?


pkg/cmd/roachtest/monitor.go line 200 at r1 (raw file):

	userErr := m.userGroup.Wait()
	m.cancel() // stop monitoring goroutine
	return errors.CombineErrors(userErr, m.WaitForNodeDeath())

Not immediately obvious why it's correct to call WaitForNodeDeath here. Maybe it's worth a comment that the context has been cancelled at this point, and Wait doesn't block.


pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 268 at r1 (raw file):

		var latestVersion roachpb.Version
		var opts retry.Options
		err := opts.Do(ctx, func(ctx context.Context) error {

This would change the semantics of timeout, no?

@renatolabs renatolabs requested a review from miretskiy August 11, 2023 18:06
@renatolabs renatolabs force-pushed the rc/roachtest-allow-monitor-node-deaths-only branch 2 times, most recently from 77158e5 to 00c30f8 Compare August 11, 2023 18:52
Copy link
Copy Markdown
Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @miretskiy, @rachitgsrivastava, and @srosenberg)


pkg/cmd/roachtest/monitor.go line 115 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Should we make the error message more explicit, i.e., monitor user task failed?

Good point, done.


pkg/cmd/roachtest/monitor.go line 167 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

monitor node command failure?

Updated.


pkg/cmd/roachtest/monitor.go line 200 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Not immediately obvious why it's correct to call WaitForNodeDeath here. Maybe it's worth a comment that the context has been cancelled at this point, and Wait doesn't block.

Added a comment, let me know if it's clear.


pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 268 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

This would change the semantics of timeout, no?

Hah, good point, I saw this block and not respect context cancelation during a test, so I updated it to an API that does take context into account. In the process, I forgot about the timeout. Fixed.

@miretskiy miretskiy requested a review from srosenberg August 11, 2023 18:54
Copy link
Copy Markdown
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @rachitgsrivastava, @renatolabs, and @srosenberg)


pkg/cmd/roachtest/monitor.go line 51 at r1 (raw file):

	userGroup    *errgroup.Group // user-provided functions
	monitorGroup *errgroup.Group // monitor goroutine
	monitorOnce  sync.Once       // guarantees monitor goroutine is only started once

is there a reason not to start node health monitor upon monitor creation (newMonitor?)
It might make things a bit simpler (certainly would remove the need to have monitorOnce.

Also, strongly consider using ctxgroup instead of errgroup -- ctxgropu avoids a lot of unpleasant errgroup surprises.

@renatolabs
Copy link
Copy Markdown
Author

Surprisingly, sstable-corruption/table failed with these changes because of specific assumptions on the error returned by the monitor. I updated that test to be more flexible; I'll run more extensive tests on this branch before merging. It's hard to change the monitor without breaking any test. 😅

Copy link
Copy Markdown
Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @miretskiy, @rachitgsrivastava, and @srosenberg)


pkg/cmd/roachtest/monitor.go line 51 at r1 (raw file):

is there a reason not to start node health monitor upon monitor creation (newMonitor?)

That was my initial attempt, see #108554. TLDR: a lot of tests break if we did that, and it's a bigger change than I want to tackle right now.

Also, strongly consider using ctxgroup instead of errgroup -- ctxgropu avoids a lot of unpleasant errgroup surprises.

I'll give it a try.

@miretskiy
Copy link
Copy Markdown
Contributor

@renatolabs -- also, my comments non-blocking; thanks for doing this.

Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to change the monitor without breaking any test.

Indeed, monitor is pretty gnarly wrt correctness. Thanks for doing this!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @miretskiy, and @rachitgsrivastava)

@renatolabs renatolabs force-pushed the rc/roachtest-allow-monitor-node-deaths-only branch from 00c30f8 to 9b8ffa6 Compare August 11, 2023 19:54
In cockroachdb#107548, the `(*monitorImpl).wait()` function was modified such
that, if the caller provided no functions (via the `Go` method), then
`wait()` would block while listening for node events. However, that
was a change in the semantics of the function: previously, a monitor
without goroutines would return immediately after a call to `wait()`
is made. This behaviour is preferred as it's familiar: it's how
`errgroup` and our own `ctxgroup` work.

That said, we still need some way to support the use-case where
callers are only interested in listening for unexpected node deaths
without having to pass a function to `Go()`. To support that scenario,
a new function is added to the `Monitor` interface:
`WaitForNodeDeath`. This function is like `WaitE`, except that only
errors related to unexpected node deaths are returned. When callers
wish to terminate the monitoring goroutine, they call call
`Wait{,E}()` as before.

The internal implementation of the `monitorImpl` is also simplified by
making use of two `errgroup`s to achieve this behaviour, replacing the
previous, harder to understand, logic.

Fixes: cockroachdb#108530.

Release note: None
@renatolabs renatolabs force-pushed the rc/roachtest-allow-monitor-node-deaths-only branch from 9b8ffa6 to 85f7005 Compare August 11, 2023 21:43
@renatolabs
Copy link
Copy Markdown
Author

Run with SELECT_PROBABILITY=0.5 has only "expected" failures:

https://teamcity.cockroachdb.com/viewLog.html?buildId=11289749&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel&tab=buildResultsDiv

I think this is good to go, TFTR!

bors r=srosenberg

@renatolabs renatolabs added the backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only label Aug 14, 2023
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 14, 2023

Build succeeded:

@craig craig bot merged commit 83f1421 into cockroachdb:master Aug 14, 2023
@renatolabs renatolabs deleted the rc/roachtest-allow-monitor-node-deaths-only branch August 14, 2023 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: Monitor.Wait should work correctly without active tasks

4 participants