roachtest: support running steps in the background in `mixedversion` by renatolabs · Pull Request #96990 · cockroachdb/cockroach

renatolabs · 2023-02-10T23:16:16Z

This adds a BackgroundFunc API to the mixedversion package in
roachtest, allowing test writers to run tasks in the background during
an upgrade test. The most common use-case for this functionality is
running a workload while the cluster upgrades (other similar use-cases
exist in a variety of tests); for this reason, a Workload
convenience function is added that allows tests to add a workload to a
mixed-version test with one function call.

Currently, each test needs to devise their own mechanism to: spawn the
background task; monitor its execution; and terminate the test on
error.

The current API aims to reduce copying and pasting of such logic,
making for a more declarative test. In the future, the test planner
itself could decide to run some steps in the background and it should
be able to leverage the mechanisms introduced in this commit.

Epic: CRDB-19321

Release note: None

cockroach-teamcity · 2023-02-10T23:16:30Z

This change is

renatolabs · 2023-02-10T23:19:08Z

This is part of a stacked PR comprised of 3 parts. Please review them in order.

To be reviewed in this PR: two last commits.

renatolabs · 2023-02-24T19:43:43Z

Update:

To be reviewed: all changes in the PR.

srosenberg

Nice work! I'll take another pass tomorrow.

Reviewed 4 of 4 files at r1, 1 of 5 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)

pkg/cmd/roachtest/roachtestutil/commandbuilder.go line 65 at r1 (raw file):

func (c *Command) MaybeFlag(condition bool, name string, val interface{}) *Command {
	if condition {
		return c.Flag(name, fmt.Sprint(val))

Double-formatting; Flag already applies Sprint.

pkg/cmd/roachtest/roachtestutil/commandbuilder.go line 96 at r1 (raw file):

}

// String returns the string representation of the command, which can

It's a canonical representation, right? (Since it's ordered by flag names.)

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 299 at r2 (raw file):

// background during the test. Background functions are kicked off
// once the cluster has been initialized (i.e., after all startup
// steps have finished).

When do they stop normally?

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 328 at r2 (raw file):

	addSeed := func(cmd *roachtestutil.Command) {
		if !cmd.HasFlag("seed") {
			cmd.Flag("seed", seed)

Why not use different seeds if neither initCmd nor runCmd has specified it?

pkg/cmd/roachtest/roachtestutil/mixedversion/planner.go line 252 at r2 (raw file):

		var delayStr string
		if s.delay.Milliseconds() > 0 {
			delayStr = fmt.Sprintf("after %s delay", s.delay)

The original version is debug-friendlier than the optimized one. E.g., if we see "0 delay" frequently, that may indicate a biased PRNG.

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 42 at r2 (raw file):

		ctx             context.Context
		testContext     *Context
		backgroundCount int64

Maybe rename to bgCount and describe that it denotes the number of created background steps; it's not immediately clear.

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 61 at r2 (raw file):

	testFailure struct {
		displayed             bool

Methinks summarized might be better.

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 65 at r2 (raw file):

		seed                  int64
		binaryVersions        []roachpb.Version
		clusterVersionsBefore []roachpb.Version

Before and after what? One is read from the cache, the next one after refresh. It's worth a comment.

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 116 at r2 (raw file):

// each step in sequence.
func (tr *testRunner) run() error {
	defer tr.closeConnections()

What's the default timeout for db.Close() and hence the upper-bound for defers?

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 121 at r2 (raw file):

		tr.cancel()
		// Wait for some time so that any background tasks are properly
		// canceled and their cancelation messages are displayed in the

Nit: cancellation

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 123 at r2 (raw file):

		// canceled and their cancelation messages are displayed in the
		// logs accordingly.
		time.Sleep(5 * time.Second)

This is only best-effort, right? We know that no magic sleep value will suffice in the worst-case :)

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 147 at r2 (raw file):

			}

			return fmt.Errorf("background step `%s` returned error: %w", event.Name, event.Err)

What if a background step fails concurrently with some other step? The error(s) in the stepsErr won't be reported.

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 249 at r2 (raw file):

}

// stepError augments generates a `testFailure` error by augmenting

Nit: remove augments

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 434 at r2 (raw file):

}

func (br *backgroundRunner) Events() <-chan backgroundEvent {

CompletedEvents seems more precise.

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 484 at r2 (raw file):

}

// Background allows test writers to create functions that run in the

Nit: writers --> author

srosenberg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 299 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

When do they stop normally?

It's worth clarifying that userFunc itself is executed locally, i.e., same node where runner is executing. However, it may end up executing commands on remote nodes. Note, this is a confounding factor for the API user; i.e., does the framework ensure my closure runs remotely, or am I responsible for it?

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 680 at r2 (raw file):

	idGen func() int,
	prng *rand.Rand,
	nodes option.NodeListOption,

nodes is unused

srosenberg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 412 at r2 (raw file):

func newBackgroundRunner(ctx context.Context) *backgroundRunner {
	g, _ := errgroup.WithContext(ctx)

I had to pause here to confirm the right context is passed below. Have you considered using ctxgroup.WithContext(ctx) instead? The wrapper makes juggling of parent and child contexts less error-prone.

`commandbuilder.go` adds the `roachtestutil.Command` type, a thin wrapper that helps tests build complex commands to be run in a cluster. The way roachtests deal with this right now is by a combination of string concatenation and ad-hoc `fmt.Sprint` calls. This has at least two drawbacks: * it makes it difficult to piece together the command that is actually being run in a cluster, especially if conditionals and string formatters are involved; * if a function takes a command (string) as parameter, there's no easy way to either verify if a flag is present or to safely add a command line flag. The `roachtestutil.Command` type aims to solve both of the shortcomings. Over time, roachtests can move from raw strings to this type as needed. In addition, as `roachtest` starts to provide higher level APIs for test writers, having a more programmable interface to commands is desirable. Epic: CRDB-19321 Release note: None

This adds a `BackgroundFunc` API to the `mixedversion` package in roachtest, allowing test writers to run tasks in the background during an upgrade test. The most common use-case for this functionality is running a workload while the cluster upgrades (other similar use-cases exist in a variety of tests); for this reason, a `Workload` convenience function is added that allows tests to add a workload to a mixed-version test with one function call. Currently, each test needs to devise their own mechanism to: spawn the background task; monitor its execution; and terminate the test on error. The current API aims to reduce copying and pasting of such logic, making for a more declarative test. In the future, the test planner itself could decide to run some steps in the background and it should be able to leverage the mechanisms introduced in this commit. Epic: CRDB-19321 Release note: None

renatolabs

RFAL.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, and @srosenberg)

pkg/cmd/roachtest/roachtestutil/commandbuilder.go line 65 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Double-formatting; Flag already applies Sprint.

Fixed.

pkg/cmd/roachtest/roachtestutil/commandbuilder.go line 96 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

It's a canonical representation, right? (Since it's ordered by flag names.)

Indeed. Comment updated.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 299 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

It's worth clarifying that userFunc itself is executed locally, i.e., same node where runner is executing. However, it may end up executing commands on remote nodes. Note, this is a confounding factor for the API user; i.e., does the framework ensure my closure runs remotely, or am I responsible for it?

Added words to this function's documentation and also to the userFunc type.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 328 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Why not use different seeds if neither initCmd nor runCmd has specified it?

I'm not sure I understand the suggestion.

The idea here is that if the user provided a specific seed, they probably have good reason to do it (e.g., reading fixtures from a bucket). If there's no seed in either of these commands, it means the caller is fine with arbitrary seeds (since they are now random by default), so we can set anything we want.

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 680 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

nodes is unused

Huh, I guess that's something we'll miss from golint. Thanks for catching.

pkg/cmd/roachtest/roachtestutil/mixedversion/planner.go line 252 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

The original version is debug-friendlier than the optimized one. E.g., if we see "0 delay" frequently, that may indicate a biased PRNG.

I was thinking people could be confused when reading "with 0s delay", but I also see your point. Added the delay message unconditionally.

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 42 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Maybe rename to bgCount and describe that it denotes the number of created background steps; it's not immediately clear.

Done.

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 61 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Methinks summarized might be better.

Renamed.

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 65 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Before and after what? One is read from the cache, the next one after refresh. It's worth a comment.

Added comment to explain.

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 116 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

What's the default timeout for db.Close() and hence the upper-bound for defers?

I did a quick search and didn't find anything conclusive; I don't think that call has a builtin timeout mechanism (?). That said, the old mixed-version framework (newMixedVersionTest and friends) have the same db.Close() loop at the end of the test and I don't remember that being a source of headaches in the past, so I'm thinking we can leave this as-is until it becomes a problem.

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 121 at r2 (raw file):
So... I'm mostly used to spell cancelled and cancellation myself, but I noticed Go uses context.Canceled so I started using canceled and cancelation here. If we are to trust sapling.ai (first result on a Google search):

Cancelation is predominantly used in 🇺🇸 American (US) English (en-US) while cancellation is predominantly used in 🇬🇧 British English (used in UK/AU/NZ) (en-GB).

https://sapling.ai/usage/cancelation-vs-cancellation

🤷 I'm fine either way but opted for consistency.

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 123 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

This is only best-effort, right? We know that no magic sleep value will suffice in the worst-case :)

Yep, it's here just for convenience and does not impact correctness. The main goal, as the comment says, is to allow the context cancelation to be observed by any other background functions so that they log their termination appropriately.

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 147 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

What if a background step fails concurrently with some other step? The error(s) in the stepsErr won't be reported.

It won't be reported in the main test.log, but the failure won't be lost -- it should still be captured in the step's logger:

https://github.com/cockroachdb/cockroach/pull/96990/files#diff-6976b20431cd2998815618de4e9abcc6cf7d05a6c009a642821d1eef440de7cfR295

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 249 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Nit: remove augments

Done.

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 412 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

I had to pause here to confirm the right context is passed below. Have you considered using ctxgroup.WithContext(ctx) instead? The wrapper makes juggling of parent and child contexts less error-prone.

I was on the edge of restructuring how these contexts are handled so this comment was the push I needed to actually do it. Test runner now uses ctxgroup; downside is that userFunc now takes a context.Context parameter (another one), but I think things are much better now: the test runner's context is the parent context of every mixed-version related context.

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 434 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

CompletedEvents seems more precise.

Renamed.

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 484 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Nit: writers --> author

Done.

srosenberg

Thanks for addressing all comments! I clarified the thing about using different PRNG seeds for init and load, but it's a super minor point. Otherwise, this looks great!

Reviewed 2 of 7 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 328 at r2 (raw file):

If there's no seed in either of these commands, it means the caller is fine with arbitrary seeds

Right. My suggestion is why not use different seeds in this case? seed is the same in both calls to addSeed

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 121 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

So... I'm mostly used to spell cancelled and cancellation myself, but I noticed Go uses context.Canceled so I started using canceled and cancelation here. If we are to trust sapling.ai (first result on a Google search):

Cancelation is predominantly used in 🇺🇸 American (US) English (en-US) while cancellation is predominantly used in 🇬🇧 British English (used in UK/AU/NZ) (en-GB).

https://sapling.ai/usage/cancelation-vs-cancellation

🤷 I'm fine either way but opted for consistency.

Hehe, I didn't know if was a chiefly British thing :) Personally, I prefer British English, it's so more nuanced; let's stick with that.

herkolategan

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)

pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 121 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Hehe, I didn't know if was a chiefly British thing :) Personally, I prefer British English, it's so more nuanced; let's stick with that.

I'm chuffed with the outcome of this comment 😄

renatolabs

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @smg260 and @srosenberg)

pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 328 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

If there's no seed in either of these commands, it means the caller is fine with arbitrary seeds

Right. My suggestion is why not use different seeds in this case? seed is the same in both calls to addSeed

Oh, right! The reason I'm using the same seed is that some workloads rely on the same seed being passed to both init and run (admittedly, rand is the only one that comes to mind where that dependency exists, but there could be others). Using the same seed is a safe choice.

renatolabs · 2023-03-09T14:28:57Z

bors r=srosenberg

TFTR!

craig · 2023-03-09T15:06:51Z

Build succeeded:

Bazel Essential CI (Cockroach)

renatolabs requested a review from a team as a code owner February 10, 2023 23:16

renatolabs requested review from smg260 and srosenberg and removed request for a team February 10, 2023 23:16

renatolabs mentioned this pull request Feb 10, 2023

roachtest: improvements to mixedversion package #96989

Merged

renatolabs mentioned this pull request Feb 10, 2023

roachtest: update mixed-version backup to use new framework #96991

Merged

renatolabs force-pushed the rc/2-mixed-version-roachtest-background branch 3 times, most recently from c57360a to 3d4f8a8 Compare February 10, 2023 23:55

renatolabs force-pushed the rc/2-mixed-version-roachtest-background branch from 3d4f8a8 to c930a70 Compare February 24, 2023 19:41

renatolabs force-pushed the rc/2-mixed-version-roachtest-background branch from c930a70 to e70dac8 Compare February 24, 2023 19:45

srosenberg requested a review from herkolategan March 7, 2023 06:17

srosenberg reviewed Mar 7, 2023

View reviewed changes

Renato Costa added 2 commits March 8, 2023 11:39

renatolabs force-pushed the rc/2-mixed-version-roachtest-background branch from e70dac8 to b16ffe1 Compare March 8, 2023 21:24

renatolabs commented Mar 8, 2023

View reviewed changes

srosenberg approved these changes Mar 9, 2023

View reviewed changes

herkolategan reviewed Mar 9, 2023

View reviewed changes

renatolabs commented Mar 9, 2023

View reviewed changes

craig bot merged commit dcd687c into cockroachdb:master Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: support running steps in the background in `mixedversion`#96990

roachtest: support running steps in the background in `mixedversion`#96990
craig[bot] merged 2 commits intocockroachdb:masterfrom
renatolabs:rc/2-mixed-version-roachtest-background

renatolabs commented Feb 10, 2023

Uh oh!

cockroach-teamcity commented Feb 10, 2023

Uh oh!

renatolabs commented Feb 10, 2023

Uh oh!

renatolabs commented Feb 24, 2023

Uh oh!

srosenberg left a comment

Uh oh!

srosenberg left a comment

Uh oh!

srosenberg left a comment

Uh oh!

renatolabs left a comment

Uh oh!

srosenberg left a comment

Uh oh!

herkolategan left a comment

Uh oh!

renatolabs left a comment

Uh oh!

renatolabs commented Mar 9, 2023

Uh oh!

craig bot commented Mar 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

renatolabs commented Feb 10, 2023

Uh oh!

cockroach-teamcity commented Feb 10, 2023

Uh oh!

renatolabs commented Feb 10, 2023

Uh oh!

renatolabs commented Feb 24, 2023

Uh oh!

srosenberg left a comment

Choose a reason for hiding this comment

Uh oh!

srosenberg left a comment

Choose a reason for hiding this comment

Uh oh!

srosenberg left a comment

Choose a reason for hiding this comment

Uh oh!

renatolabs left a comment

Choose a reason for hiding this comment

Uh oh!

srosenberg left a comment

Choose a reason for hiding this comment

Uh oh!

herkolategan left a comment

Choose a reason for hiding this comment

Uh oh!

renatolabs left a comment

Choose a reason for hiding this comment

Uh oh!

renatolabs commented Mar 9, 2023

Uh oh!

craig bot commented Mar 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants