roachtest: allow callers to listen for node events in monitor by renatolabs · Pull Request #108594 · cockroachdb/cockroach

renatolabs · 2023-08-11T14:28:39Z

In #107548, the (*monitorImpl).wait() function was modified such that, if the caller provided no functions (via the Go method), then wait() would block while listening for node events. However, that was a change in the semantics of the function: previously, a monitor without goroutines would return immediately after a call to wait() is made. This behaviour is preferred as it's familiar: it's how errgroup and our own ctxgroup work.

That said, we still need some way to support the use-case where callers are only interested in listening for unexpected node deaths without having to pass a function to Go(). To support that scenario, a new function is added to the Monitor interface: WaitForNodeDeath. This function is like WaitE, except that only errors related to unexpected node deaths are returned. When callers wish to terminate the monitoring goroutine, they call call Wait{,E}() as before.

The internal implementation of the monitorImpl is also simplified by making use of two errgroups to achieve this behaviour, replacing the previous, harder to understand, logic.

Fixes: #108530.

Release note: None

cockroach-teamcity · 2023-08-11T14:28:49Z

This change is

srosenberg

Reviewed 1 of 4 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @rachitgsrivastava, and @renatolabs)

pkg/cmd/roachtest/monitor.go line 115 at r1 (raw file):

			// enables that use case. But it also offers protection against accidental
			// panics (NPEs and such) which should not bubble up to the runtime.
			err = errors.Wrap(errors.WithStack(rErr), "monitor task failed")

Should we make the error message more explicit, i.e., monitor user task failed?

pkg/cmd/roachtest/monitor.go line 167 at r1 (raw file):

			eventsCh, err := roachprod.Monitor(m.ctx, m.l, m.nodes, install.MonitorOpts{})
			if err != nil {
				return errors.Wrap(err, "monitor command failure")

monitor node command failure?

pkg/cmd/roachtest/monitor.go line 200 at r1 (raw file):

	userErr := m.userGroup.Wait()
	m.cancel() // stop monitoring goroutine
	return errors.CombineErrors(userErr, m.WaitForNodeDeath())

Not immediately obvious why it's correct to call WaitForNodeDeath here. Maybe it's worth a comment that the context has been cancelled at this point, and Wait doesn't block.

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 268 at r1 (raw file):

		var latestVersion roachpb.Version
		var opts retry.Options
		err := opts.Do(ctx, func(ctx context.Context) error {

This would change the semantics of timeout, no?

renatolabs

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @miretskiy, @rachitgsrivastava, and @srosenberg)

pkg/cmd/roachtest/monitor.go line 115 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Should we make the error message more explicit, i.e., monitor user task failed?

Good point, done.

pkg/cmd/roachtest/monitor.go line 167 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

monitor node command failure?

Updated.

pkg/cmd/roachtest/monitor.go line 200 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Not immediately obvious why it's correct to call WaitForNodeDeath here. Maybe it's worth a comment that the context has been cancelled at this point, and Wait doesn't block.

Added a comment, let me know if it's clear.

pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 268 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

This would change the semantics of timeout, no?

Hah, good point, I saw this block and not respect context cancelation during a test, so I updated it to an API that does take context into account. In the process, I forgot about the timeout. Fixed.

miretskiy

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @rachitgsrivastava, @renatolabs, and @srosenberg)

pkg/cmd/roachtest/monitor.go line 51 at r1 (raw file):

	userGroup    *errgroup.Group // user-provided functions
	monitorGroup *errgroup.Group // monitor goroutine
	monitorOnce  sync.Once       // guarantees monitor goroutine is only started once

is there a reason not to start node health monitor upon monitor creation (newMonitor?)
It might make things a bit simpler (certainly would remove the need to have monitorOnce.

Also, strongly consider using ctxgroup instead of errgroup -- ctxgropu avoids a lot of unpleasant errgroup surprises.

renatolabs · 2023-08-11T18:54:56Z

Surprisingly, sstable-corruption/table failed with these changes because of specific assumptions on the error returned by the monitor. I updated that test to be more flexible; I'll run more extensive tests on this branch before merging. It's hard to change the monitor without breaking any test. 😅

renatolabs

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @miretskiy, @rachitgsrivastava, and @srosenberg)

pkg/cmd/roachtest/monitor.go line 51 at r1 (raw file):

is there a reason not to start node health monitor upon monitor creation (newMonitor?)

That was my initial attempt, see #108554. TLDR: a lot of tests break if we did that, and it's a bigger change than I want to tackle right now.

Also, strongly consider using ctxgroup instead of errgroup -- ctxgropu avoids a lot of unpleasant errgroup surprises.

I'll give it a try.

miretskiy · 2023-08-11T18:59:10Z

@renatolabs -- also, my comments non-blocking; thanks for doing this.

srosenberg

It's hard to change the monitor without breaking any test.

Indeed, monitor is pretty gnarly wrt correctness. Thanks for doing this!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @miretskiy, and @rachitgsrivastava)

In cockroachdb#107548, the `(*monitorImpl).wait()` function was modified such that, if the caller provided no functions (via the `Go` method), then `wait()` would block while listening for node events. However, that was a change in the semantics of the function: previously, a monitor without goroutines would return immediately after a call to `wait()` is made. This behaviour is preferred as it's familiar: it's how `errgroup` and our own `ctxgroup` work. That said, we still need some way to support the use-case where callers are only interested in listening for unexpected node deaths without having to pass a function to `Go()`. To support that scenario, a new function is added to the `Monitor` interface: `WaitForNodeDeath`. This function is like `WaitE`, except that only errors related to unexpected node deaths are returned. When callers wish to terminate the monitoring goroutine, they call call `Wait{,E}()` as before. The internal implementation of the `monitorImpl` is also simplified by making use of two `errgroup`s to achieve this behaviour, replacing the previous, harder to understand, logic. Fixes: cockroachdb#108530. Release note: None

renatolabs · 2023-08-14T13:24:28Z

Run with SELECT_PROBABILITY=0.5 has only "expected" failures:

https://teamcity.cockroachdb.com/viewLog.html?buildId=11289749&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel&tab=buildResultsDiv

I think this is good to go, TFTR!

bors r=srosenberg

craig · 2023-08-14T14:36:10Z

Build succeeded:

Bazel Essential CI (Cockroach)

renatolabs requested a review from a team as a code owner August 11, 2023 14:28

renatolabs requested review from DarrylWong and rachitgsrivastava and removed request for a team August 11, 2023 14:28

renatolabs force-pushed the rc/roachtest-allow-monitor-node-deaths-only branch from a46ad50 to 33121c4 Compare August 11, 2023 14:32

srosenberg self-requested a review August 11, 2023 16:04

srosenberg approved these changes Aug 11, 2023

View reviewed changes

renatolabs requested a review from miretskiy August 11, 2023 18:06

renatolabs force-pushed the rc/roachtest-allow-monitor-node-deaths-only branch 2 times, most recently from 77158e5 to 00c30f8 Compare August 11, 2023 18:52

renatolabs commented Aug 11, 2023

View reviewed changes

miretskiy requested a review from srosenberg August 11, 2023 18:54

miretskiy reviewed Aug 11, 2023

View reviewed changes

renatolabs commented Aug 11, 2023

View reviewed changes

srosenberg approved these changes Aug 11, 2023

View reviewed changes

renatolabs force-pushed the rc/roachtest-allow-monitor-node-deaths-only branch from 00c30f8 to 9b8ffa6 Compare August 11, 2023 19:54

renatolabs force-pushed the rc/roachtest-allow-monitor-node-deaths-only branch from 9b8ffa6 to 85f7005 Compare August 11, 2023 21:43

renatolabs added the backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only label Aug 14, 2023

craig bot merged commit 83f1421 into cockroachdb:master Aug 14, 2023

blathers-crl bot mentioned this pull request Aug 14, 2023

release-23.1: roachtest: allow callers to listen for node events in monitor #108716

Merged

renatolabs deleted the rc/roachtest-allow-monitor-node-deaths-only branch August 14, 2023 14:36

renatolabs mentioned this pull request Jan 23, 2024

roachtest: split Monitor functionality in two components #118214

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: allow callers to listen for node events in monitor#108594

roachtest: allow callers to listen for node events in monitor#108594
craig[bot] merged 1 commit intocockroachdb:masterfrom
renatolabs:rc/roachtest-allow-monitor-node-deaths-only

renatolabs commented Aug 11, 2023

Uh oh!

cockroach-teamcity commented Aug 11, 2023

Uh oh!

srosenberg left a comment

Uh oh!

renatolabs left a comment

Uh oh!

miretskiy left a comment

Uh oh!

renatolabs commented Aug 11, 2023

Uh oh!

renatolabs left a comment

Uh oh!

miretskiy commented Aug 11, 2023

Uh oh!

srosenberg left a comment

Uh oh!

renatolabs commented Aug 14, 2023

Uh oh!

craig bot commented Aug 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

renatolabs commented Aug 11, 2023

Uh oh!

cockroach-teamcity commented Aug 11, 2023

Uh oh!

srosenberg left a comment

Choose a reason for hiding this comment

Uh oh!

renatolabs left a comment

Choose a reason for hiding this comment

Uh oh!

miretskiy left a comment

Choose a reason for hiding this comment

Uh oh!

renatolabs commented Aug 11, 2023

Uh oh!

renatolabs left a comment

Choose a reason for hiding this comment

Uh oh!

miretskiy commented Aug 11, 2023

Uh oh!

srosenberg left a comment

Choose a reason for hiding this comment

Uh oh!

renatolabs commented Aug 14, 2023

Uh oh!

craig bot commented Aug 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants