roachtest: allow callers to listen for node events in monitor#108594
Conversation
a46ad50 to
33121c4
Compare
srosenberg
left a comment
There was a problem hiding this comment.
Reviewed 1 of 4 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @rachitgsrivastava, and @renatolabs)
pkg/cmd/roachtest/monitor.go line 115 at r1 (raw file):
// enables that use case. But it also offers protection against accidental // panics (NPEs and such) which should not bubble up to the runtime. err = errors.Wrap(errors.WithStack(rErr), "monitor task failed")
Should we make the error message more explicit, i.e., monitor user task failed?
pkg/cmd/roachtest/monitor.go line 167 at r1 (raw file):
eventsCh, err := roachprod.Monitor(m.ctx, m.l, m.nodes, install.MonitorOpts{}) if err != nil { return errors.Wrap(err, "monitor command failure")
monitor node command failure?
pkg/cmd/roachtest/monitor.go line 200 at r1 (raw file):
userErr := m.userGroup.Wait() m.cancel() // stop monitoring goroutine return errors.CombineErrors(userErr, m.WaitForNodeDeath())
Not immediately obvious why it's correct to call WaitForNodeDeath here. Maybe it's worth a comment that the context has been cancelled at this point, and Wait doesn't block.
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 268 at r1 (raw file):
var latestVersion roachpb.Version var opts retry.Options err := opts.Do(ctx, func(ctx context.Context) error {
This would change the semantics of timeout, no?
77158e5 to
00c30f8
Compare
renatolabs
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @miretskiy, @rachitgsrivastava, and @srosenberg)
pkg/cmd/roachtest/monitor.go line 115 at r1 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Should we make the error message more explicit, i.e.,
monitor user task failed?
Good point, done.
pkg/cmd/roachtest/monitor.go line 167 at r1 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
monitor node command failure?
Updated.
pkg/cmd/roachtest/monitor.go line 200 at r1 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Not immediately obvious why it's correct to call
WaitForNodeDeathhere. Maybe it's worth a comment that the context has been cancelled at this point, andWaitdoesn't block.
Added a comment, let me know if it's clear.
pkg/cmd/roachtest/roachtestutil/clusterupgrade/clusterupgrade.go line 268 at r1 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
This would change the semantics of
timeout, no?
Hah, good point, I saw this block and not respect context cancelation during a test, so I updated it to an API that does take context into account. In the process, I forgot about the timeout. Fixed.
miretskiy
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @rachitgsrivastava, @renatolabs, and @srosenberg)
pkg/cmd/roachtest/monitor.go line 51 at r1 (raw file):
userGroup *errgroup.Group // user-provided functions monitorGroup *errgroup.Group // monitor goroutine monitorOnce sync.Once // guarantees monitor goroutine is only started once
is there a reason not to start node health monitor upon monitor creation (newMonitor?)
It might make things a bit simpler (certainly would remove the need to have monitorOnce.
Also, strongly consider using ctxgroup instead of errgroup -- ctxgropu avoids a lot of unpleasant errgroup surprises.
|
Surprisingly, |
renatolabs
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @miretskiy, @rachitgsrivastava, and @srosenberg)
pkg/cmd/roachtest/monitor.go line 51 at r1 (raw file):
is there a reason not to start node health monitor upon monitor creation (newMonitor?)
That was my initial attempt, see #108554. TLDR: a lot of tests break if we did that, and it's a bigger change than I want to tackle right now.
Also, strongly consider using ctxgroup instead of errgroup -- ctxgropu avoids a lot of unpleasant errgroup surprises.
I'll give it a try.
|
@renatolabs -- also, my comments non-blocking; thanks for doing this. |
srosenberg
left a comment
There was a problem hiding this comment.
It's hard to change the monitor without breaking any test.
Indeed, monitor is pretty gnarly wrt correctness. Thanks for doing this!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @miretskiy, and @rachitgsrivastava)
00c30f8 to
9b8ffa6
Compare
In cockroachdb#107548, the `(*monitorImpl).wait()` function was modified such that, if the caller provided no functions (via the `Go` method), then `wait()` would block while listening for node events. However, that was a change in the semantics of the function: previously, a monitor without goroutines would return immediately after a call to `wait()` is made. This behaviour is preferred as it's familiar: it's how `errgroup` and our own `ctxgroup` work. That said, we still need some way to support the use-case where callers are only interested in listening for unexpected node deaths without having to pass a function to `Go()`. To support that scenario, a new function is added to the `Monitor` interface: `WaitForNodeDeath`. This function is like `WaitE`, except that only errors related to unexpected node deaths are returned. When callers wish to terminate the monitoring goroutine, they call call `Wait{,E}()` as before. The internal implementation of the `monitorImpl` is also simplified by making use of two `errgroup`s to achieve this behaviour, replacing the previous, harder to understand, logic. Fixes: cockroachdb#108530. Release note: None
9b8ffa6 to
85f7005
Compare
|
Run with I think this is good to go, TFTR! bors r=srosenberg |
|
Build succeeded: |
In #107548, the
(*monitorImpl).wait()function was modified such that, if the caller provided no functions (via theGomethod), thenwait()would block while listening for node events. However, that was a change in the semantics of the function: previously, a monitor without goroutines would return immediately after a call towait()is made. This behaviour is preferred as it's familiar: it's howerrgroupand our ownctxgroupwork.That said, we still need some way to support the use-case where callers are only interested in listening for unexpected node deaths without having to pass a function to
Go(). To support that scenario, a new function is added to theMonitorinterface:WaitForNodeDeath. This function is likeWaitE, except that only errors related to unexpected node deaths are returned. When callers wish to terminate the monitoring goroutine, they call callWait{,E}()as before.The internal implementation of the
monitorImplis also simplified by making use of twoerrgroups to achieve this behaviour, replacing the previous, harder to understand, logic.Fixes: #108530.
Release note: None