roachtest: support running steps in the background in mixedversion#96990
Conversation
|
This is part of a stacked PR comprised of 3 parts. Please review them in order.
To be reviewed in this PR: two last commits. |
c57360a to
3d4f8a8
Compare
3d4f8a8 to
c930a70
Compare
|
Update:
To be reviewed: all changes in the PR. |
c930a70 to
e70dac8
Compare
srosenberg
left a comment
There was a problem hiding this comment.
Nice work! I'll take another pass tomorrow.
Reviewed 4 of 4 files at r1, 1 of 5 files at r2, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)
pkg/cmd/roachtest/roachtestutil/commandbuilder.go line 65 at r1 (raw file):
func (c *Command) MaybeFlag(condition bool, name string, val interface{}) *Command { if condition { return c.Flag(name, fmt.Sprint(val))
Double-formatting; Flag already applies Sprint.
pkg/cmd/roachtest/roachtestutil/commandbuilder.go line 96 at r1 (raw file):
} // String returns the string representation of the command, which can
It's a canonical representation, right? (Since it's ordered by flag names.)
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 299 at r2 (raw file):
// background during the test. Background functions are kicked off // once the cluster has been initialized (i.e., after all startup // steps have finished).
When do they stop normally?
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 328 at r2 (raw file):
addSeed := func(cmd *roachtestutil.Command) { if !cmd.HasFlag("seed") { cmd.Flag("seed", seed)
Why not use different seeds if neither initCmd nor runCmd has specified it?
pkg/cmd/roachtest/roachtestutil/mixedversion/planner.go line 252 at r2 (raw file):
var delayStr string if s.delay.Milliseconds() > 0 { delayStr = fmt.Sprintf("after %s delay", s.delay)
The original version is debug-friendlier than the optimized one. E.g., if we see "0 delay" frequently, that may indicate a biased PRNG.
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 42 at r2 (raw file):
ctx context.Context testContext *Context backgroundCount int64
Maybe rename to bgCount and describe that it denotes the number of created background steps; it's not immediately clear.
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 61 at r2 (raw file):
testFailure struct { displayed bool
Methinks summarized might be better.
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 65 at r2 (raw file):
seed int64 binaryVersions []roachpb.Version clusterVersionsBefore []roachpb.Version
Before and after what? One is read from the cache, the next one after refresh. It's worth a comment.
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 116 at r2 (raw file):
// each step in sequence. func (tr *testRunner) run() error { defer tr.closeConnections()
What's the default timeout for db.Close() and hence the upper-bound for defers?
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 121 at r2 (raw file):
tr.cancel() // Wait for some time so that any background tasks are properly // canceled and their cancelation messages are displayed in the
Nit: cancellation
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 123 at r2 (raw file):
// canceled and their cancelation messages are displayed in the // logs accordingly. time.Sleep(5 * time.Second)
This is only best-effort, right? We know that no magic sleep value will suffice in the worst-case :)
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 147 at r2 (raw file):
} return fmt.Errorf("background step `%s` returned error: %w", event.Name, event.Err)
What if a background step fails concurrently with some other step? The error(s) in the stepsErr won't be reported.
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 249 at r2 (raw file):
} // stepError augments generates a `testFailure` error by augmenting
Nit: remove augments
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 434 at r2 (raw file):
} func (br *backgroundRunner) Events() <-chan backgroundEvent {
CompletedEvents seems more precise.
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 484 at r2 (raw file):
} // Background allows test writers to create functions that run in the
Nit: writers --> author
srosenberg
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 299 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
When do they stop normally?
It's worth clarifying that userFunc itself is executed locally, i.e., same node where runner is executing. However, it may end up executing commands on remote nodes. Note, this is a confounding factor for the API user; i.e., does the framework ensure my closure runs remotely, or am I responsible for it?
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 680 at r2 (raw file):
idGen func() int, prng *rand.Rand, nodes option.NodeListOption,
nodes is unused
srosenberg
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 412 at r2 (raw file):
func newBackgroundRunner(ctx context.Context) *backgroundRunner { g, _ := errgroup.WithContext(ctx)
I had to pause here to confirm the right context is passed below. Have you considered using ctxgroup.WithContext(ctx) instead? The wrapper makes juggling of parent and child contexts less error-prone.
`commandbuilder.go` adds the `roachtestutil.Command` type, a thin wrapper that helps tests build complex commands to be run in a cluster. The way roachtests deal with this right now is by a combination of string concatenation and ad-hoc `fmt.Sprint` calls. This has at least two drawbacks: * it makes it difficult to piece together the command that is actually being run in a cluster, especially if conditionals and string formatters are involved; * if a function takes a command (string) as parameter, there's no easy way to either verify if a flag is present or to safely add a command line flag. The `roachtestutil.Command` type aims to solve both of the shortcomings. Over time, roachtests can move from raw strings to this type as needed. In addition, as `roachtest` starts to provide higher level APIs for test writers, having a more programmable interface to commands is desirable. Epic: CRDB-19321 Release note: None
This adds a `BackgroundFunc` API to the `mixedversion` package in roachtest, allowing test writers to run tasks in the background during an upgrade test. The most common use-case for this functionality is running a workload while the cluster upgrades (other similar use-cases exist in a variety of tests); for this reason, a `Workload` convenience function is added that allows tests to add a workload to a mixed-version test with one function call. Currently, each test needs to devise their own mechanism to: spawn the background task; monitor its execution; and terminate the test on error. The current API aims to reduce copying and pasting of such logic, making for a more declarative test. In the future, the test planner itself could decide to run some steps in the background and it should be able to leverage the mechanisms introduced in this commit. Epic: CRDB-19321 Release note: None
e70dac8 to
b16ffe1
Compare
renatolabs
left a comment
There was a problem hiding this comment.
RFAL.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, and @srosenberg)
pkg/cmd/roachtest/roachtestutil/commandbuilder.go line 65 at r1 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Double-formatting;
Flagalready appliesSprint.
Fixed.
pkg/cmd/roachtest/roachtestutil/commandbuilder.go line 96 at r1 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
It's a canonical representation, right? (Since it's ordered by flag names.)
Indeed. Comment updated.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 299 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
It's worth clarifying that
userFuncitself is executed locally, i.e., same node whererunneris executing. However, it may end up executing commands on remote nodes. Note, this is a confounding factor for the API user; i.e., does the framework ensure my closure runs remotely, or am I responsible for it?
Added words to this function's documentation and also to the userFunc type.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 328 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Why not use different seeds if neither
initCmdnorrunCmdhas specified it?
I'm not sure I understand the suggestion.
The idea here is that if the user provided a specific seed, they probably have good reason to do it (e.g., reading fixtures from a bucket). If there's no seed in either of these commands, it means the caller is fine with arbitrary seeds (since they are now random by default), so we can set anything we want.
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 680 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
nodesis unused
Huh, I guess that's something we'll miss from golint. Thanks for catching.
pkg/cmd/roachtest/roachtestutil/mixedversion/planner.go line 252 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
The original version is debug-friendlier than the optimized one. E.g., if we see "0 delay" frequently, that may indicate a biased PRNG.
I was thinking people could be confused when reading "with 0s delay", but I also see your point. Added the delay message unconditionally.
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 42 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Maybe rename to
bgCountand describe that it denotes the number of created background steps; it's not immediately clear.
Done.
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 61 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Methinks
summarizedmight be better.
Renamed.
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 65 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Before and after what? One is read from the cache, the next one after refresh. It's worth a comment.
Added comment to explain.
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 116 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
What's the default timeout for
db.Close()and hence the upper-bound for defers?
I did a quick search and didn't find anything conclusive; I don't think that call has a builtin timeout mechanism (?). That said, the old mixed-version framework (newMixedVersionTest and friends) have the same db.Close() loop at the end of the test and I don't remember that being a source of headaches in the past, so I'm thinking we can leave this as-is until it becomes a problem.
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 121 at r2 (raw file):
So... I'm mostly used to spell cancelled and cancellation myself, but I noticed Go uses context.Canceled so I started using canceled and cancelation here. If we are to trust sapling.ai (first result on a Google search):
Cancelation is predominantly used in 🇺🇸 American (US) English (en-US) while cancellation is predominantly used in 🇬🇧 British English (used in UK/AU/NZ) (en-GB).
https://sapling.ai/usage/cancelation-vs-cancellation
🤷 I'm fine either way but opted for consistency.
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 123 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
This is only best-effort, right? We know that no magic sleep value will suffice in the worst-case :)
Yep, it's here just for convenience and does not impact correctness. The main goal, as the comment says, is to allow the context cancelation to be observed by any other background functions so that they log their termination appropriately.
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 147 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
What if a background step fails concurrently with some other step? The error(s) in the
stepsErrwon't be reported.
It won't be reported in the main test.log, but the failure won't be lost -- it should still be captured in the step's logger:
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 249 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Nit: remove
augments
Done.
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 412 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
I had to pause here to confirm the right context is passed below. Have you considered using
ctxgroup.WithContext(ctx)instead? The wrapper makes juggling of parent and child contexts less error-prone.
I was on the edge of restructuring how these contexts are handled so this comment was the push I needed to actually do it. Test runner now uses ctxgroup; downside is that userFunc now takes a context.Context parameter (another one), but I think things are much better now: the test runner's context is the parent context of every mixed-version related context.
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 434 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
CompletedEventsseems more precise.
Renamed.
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 484 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Nit:
writers-->author
Done.
srosenberg
left a comment
There was a problem hiding this comment.
Thanks for addressing all comments! I clarified the thing about using different PRNG seeds for init and load, but it's a super minor point. Otherwise, this looks great!
Reviewed 2 of 7 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 328 at r2 (raw file):
If there's no seed in either of these commands, it means the caller is fine with arbitrary seeds
Right. My suggestion is why not use different seeds in this case? seed is the same in both calls to addSeed
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 121 at r2 (raw file):
Previously, renatolabs (Renato Costa) wrote…
So... I'm mostly used to spell
cancelledandcancellationmyself, but I noticed Go usescontext.Canceledso I started usingcanceledandcancelationhere. If we are to trustsapling.ai(first result on a Google search):Cancelation is predominantly used in 🇺🇸 American (US) English (en-US) while cancellation is predominantly used in 🇬🇧 British English (used in UK/AU/NZ) (en-GB).
https://sapling.ai/usage/cancelation-vs-cancellation
🤷 I'm fine either way but opted for consistency.
Hehe, I didn't know if was a chiefly British thing :) Personally, I prefer British English, it's so more nuanced; let's stick with that.
herkolategan
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 121 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Hehe, I didn't know if was a chiefly British thing :) Personally, I prefer British English, it's so more nuanced; let's stick with that.
I'm chuffed with the outcome of this comment 😄
renatolabs
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @smg260 and @srosenberg)
pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 328 at r2 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
If there's no seed in either of these commands, it means the caller is fine with arbitrary seeds
Right. My suggestion is why not use different seeds in this case?
seedis the same in both calls toaddSeed
Oh, right! The reason I'm using the same seed is that some workloads rely on the same seed being passed to both init and run (admittedly, rand is the only one that comes to mind where that dependency exists, but there could be others). Using the same seed is a safe choice.
|
bors r=srosenberg TFTR! |
|
Build succeeded: |
This adds a
BackgroundFuncAPI to themixedversionpackage inroachtest, allowing test writers to run tasks in the background during
an upgrade test. The most common use-case for this functionality is
running a workload while the cluster upgrades (other similar use-cases
exist in a variety of tests); for this reason, a
Workloadconvenience function is added that allows tests to add a workload to a
mixed-version test with one function call.
Currently, each test needs to devise their own mechanism to: spawn the
background task; monitor its execution; and terminate the test on
error.
The current API aims to reduce copying and pasting of such logic,
making for a more declarative test. In the future, the test planner
itself could decide to run some steps in the background and it should
be able to leverage the mechanisms introduced in this commit.
Epic: CRDB-19321
Release note: None