Skip to content

roachtest: support running steps in the background in mixedversion#96990

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
renatolabs:rc/2-mixed-version-roachtest-background
Mar 9, 2023
Merged

roachtest: support running steps in the background in mixedversion#96990
craig[bot] merged 2 commits intocockroachdb:masterfrom
renatolabs:rc/2-mixed-version-roachtest-background

Conversation

@renatolabs
Copy link
Copy Markdown

This adds a BackgroundFunc API to the mixedversion package in
roachtest, allowing test writers to run tasks in the background during
an upgrade test. The most common use-case for this functionality is
running a workload while the cluster upgrades (other similar use-cases
exist in a variety of tests); for this reason, a Workload
convenience function is added that allows tests to add a workload to a
mixed-version test with one function call.

Currently, each test needs to devise their own mechanism to: spawn the
background task; monitor its execution; and terminate the test on
error.

The current API aims to reduce copying and pasting of such logic,
making for a more declarative test. In the future, the test planner
itself could decide to run some steps in the background and it should
be able to leverage the mechanisms introduced in this commit.

Epic: CRDB-19321

Release note: None

@renatolabs renatolabs requested a review from a team as a code owner February 10, 2023 23:16
@renatolabs renatolabs requested review from smg260 and srosenberg and removed request for a team February 10, 2023 23:16
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@renatolabs
Copy link
Copy Markdown
Author

This is part of a stacked PR comprised of 3 parts. Please review them in order.

  1. roachtest: improvements to mixedversion package #96989
  2. roachtest: support running steps in the background in mixedversion #96990 (this one!)
  3. roachtest: update mixed-version backup to use new framework #96991

To be reviewed in this PR: two last commits.

@renatolabs renatolabs force-pushed the rc/2-mixed-version-roachtest-background branch 3 times, most recently from c57360a to 3d4f8a8 Compare February 10, 2023 23:55
@renatolabs renatolabs force-pushed the rc/2-mixed-version-roachtest-background branch from 3d4f8a8 to c930a70 Compare February 24, 2023 19:41
@renatolabs
Copy link
Copy Markdown
Author

@renatolabs renatolabs force-pushed the rc/2-mixed-version-roachtest-background branch from c930a70 to e70dac8 Compare February 24, 2023 19:45
@srosenberg srosenberg requested a review from herkolategan March 7, 2023 06:17
Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I'll take another pass tomorrow.

Reviewed 4 of 4 files at r1, 1 of 5 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)


pkg/cmd/roachtest/roachtestutil/commandbuilder.go line 65 at r1 (raw file):

func (c *Command) MaybeFlag(condition bool, name string, val interface{}) *Command {
	if condition {
		return c.Flag(name, fmt.Sprint(val))

Double-formatting; Flag already applies Sprint.


pkg/cmd/roachtest/roachtestutil/commandbuilder.go line 96 at r1 (raw file):

}

// String returns the string representation of the command, which can

It's a canonical representation, right? (Since it's ordered by flag names.)


pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 299 at r2 (raw file):

// background during the test. Background functions are kicked off
// once the cluster has been initialized (i.e., after all startup
// steps have finished).

When do they stop normally?


pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 328 at r2 (raw file):

	addSeed := func(cmd *roachtestutil.Command) {
		if !cmd.HasFlag("seed") {
			cmd.Flag("seed", seed)

Why not use different seeds if neither initCmd nor runCmd has specified it?


pkg/cmd/roachtest/roachtestutil/mixedversion/planner.go line 252 at r2 (raw file):

		var delayStr string
		if s.delay.Milliseconds() > 0 {
			delayStr = fmt.Sprintf("after %s delay", s.delay)

The original version is debug-friendlier than the optimized one. E.g., if we see "0 delay" frequently, that may indicate a biased PRNG.


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 42 at r2 (raw file):

		ctx             context.Context
		testContext     *Context
		backgroundCount int64

Maybe rename to bgCount and describe that it denotes the number of created background steps; it's not immediately clear.


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 61 at r2 (raw file):

	testFailure struct {
		displayed             bool

Methinks summarized might be better.


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 65 at r2 (raw file):

		seed                  int64
		binaryVersions        []roachpb.Version
		clusterVersionsBefore []roachpb.Version

Before and after what? One is read from the cache, the next one after refresh. It's worth a comment.


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 116 at r2 (raw file):

// each step in sequence.
func (tr *testRunner) run() error {
	defer tr.closeConnections()

What's the default timeout for db.Close() and hence the upper-bound for defers?


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 121 at r2 (raw file):

		tr.cancel()
		// Wait for some time so that any background tasks are properly
		// canceled and their cancelation messages are displayed in the

Nit: cancellation


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 123 at r2 (raw file):

		// canceled and their cancelation messages are displayed in the
		// logs accordingly.
		time.Sleep(5 * time.Second)

This is only best-effort, right? We know that no magic sleep value will suffice in the worst-case :)


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 147 at r2 (raw file):

			}

			return fmt.Errorf("background step `%s` returned error: %w", event.Name, event.Err)

What if a background step fails concurrently with some other step? The error(s) in the stepsErr won't be reported.


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 249 at r2 (raw file):

}

// stepError augments generates a `testFailure` error by augmenting

Nit: remove augments


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 434 at r2 (raw file):

}

func (br *backgroundRunner) Events() <-chan backgroundEvent {

CompletedEvents seems more precise.


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 484 at r2 (raw file):

}

// Background allows test writers to create functions that run in the

Nit: writers --> author

Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)


pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 299 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

When do they stop normally?

It's worth clarifying that userFunc itself is executed locally, i.e., same node where runner is executing. However, it may end up executing commands on remote nodes. Note, this is a confounding factor for the API user; i.e., does the framework ensure my closure runs remotely, or am I responsible for it?


pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 680 at r2 (raw file):

	idGen func() int,
	prng *rand.Rand,
	nodes option.NodeListOption,

nodes is unused

Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 412 at r2 (raw file):

func newBackgroundRunner(ctx context.Context) *backgroundRunner {
	g, _ := errgroup.WithContext(ctx)

I had to pause here to confirm the right context is passed below. Have you considered using ctxgroup.WithContext(ctx) instead? The wrapper makes juggling of parent and child contexts less error-prone.

Renato Costa added 2 commits March 8, 2023 11:39
`commandbuilder.go` adds the `roachtestutil.Command` type, a thin
wrapper that helps tests build complex commands to be run in a
cluster. The way roachtests deal with this right now is by a
combination of string concatenation and ad-hoc `fmt.Sprint`
calls. This has at least two drawbacks:

* it makes it difficult to piece together the command that is actually
being run in a cluster, especially if conditionals and string
formatters are involved;

* if a function takes a command (string) as parameter, there's no easy
way to either verify if a flag is present or to safely add a command
line flag.

The `roachtestutil.Command` type aims to solve both of the
shortcomings. Over time, roachtests can move from raw strings to this
type as needed. In addition, as `roachtest` starts to provide higher
level APIs for test writers, having a more programmable interface to
commands is desirable.

Epic: CRDB-19321

Release note: None
This adds a `BackgroundFunc` API to the `mixedversion` package in
roachtest, allowing test writers to run tasks in the background during
an upgrade test. The most common use-case for this functionality is
running a workload while the cluster upgrades (other similar use-cases
exist in a variety of tests); for this reason, a `Workload`
convenience function is added that allows tests to add a workload to a
mixed-version test with one function call.

Currently, each test needs to devise their own mechanism to: spawn the
background task; monitor its execution; and terminate the test on
error.

The current API aims to reduce copying and pasting of such logic,
making for a more declarative test. In the future, the test planner
itself could decide to run some steps in the background and it should
be able to leverage the mechanisms introduced in this commit.

Epic: CRDB-19321

Release note: None
@renatolabs renatolabs force-pushed the rc/2-mixed-version-roachtest-background branch from e70dac8 to b16ffe1 Compare March 8, 2023 21:24
Copy link
Copy Markdown
Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, and @srosenberg)


pkg/cmd/roachtest/roachtestutil/commandbuilder.go line 65 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Double-formatting; Flag already applies Sprint.

Fixed.


pkg/cmd/roachtest/roachtestutil/commandbuilder.go line 96 at r1 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

It's a canonical representation, right? (Since it's ordered by flag names.)

Indeed. Comment updated.


pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 299 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

It's worth clarifying that userFunc itself is executed locally, i.e., same node where runner is executing. However, it may end up executing commands on remote nodes. Note, this is a confounding factor for the API user; i.e., does the framework ensure my closure runs remotely, or am I responsible for it?

Added words to this function's documentation and also to the userFunc type.


pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 328 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Why not use different seeds if neither initCmd nor runCmd has specified it?

I'm not sure I understand the suggestion.

The idea here is that if the user provided a specific seed, they probably have good reason to do it (e.g., reading fixtures from a bucket). If there's no seed in either of these commands, it means the caller is fine with arbitrary seeds (since they are now random by default), so we can set anything we want.


pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 680 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

nodes is unused

Huh, I guess that's something we'll miss from golint. Thanks for catching.


pkg/cmd/roachtest/roachtestutil/mixedversion/planner.go line 252 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

The original version is debug-friendlier than the optimized one. E.g., if we see "0 delay" frequently, that may indicate a biased PRNG.

I was thinking people could be confused when reading "with 0s delay", but I also see your point. Added the delay message unconditionally.


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 42 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Maybe rename to bgCount and describe that it denotes the number of created background steps; it's not immediately clear.

Done.


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 61 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Methinks summarized might be better.

Renamed.


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 65 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Before and after what? One is read from the cache, the next one after refresh. It's worth a comment.

Added comment to explain.


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 116 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

What's the default timeout for db.Close() and hence the upper-bound for defers?

I did a quick search and didn't find anything conclusive; I don't think that call has a builtin timeout mechanism (?). That said, the old mixed-version framework (newMixedVersionTest and friends) have the same db.Close() loop at the end of the test and I don't remember that being a source of headaches in the past, so I'm thinking we can leave this as-is until it becomes a problem.


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 121 at r2 (raw file):
So... I'm mostly used to spell cancelled and cancellation myself, but I noticed Go uses context.Canceled so I started using canceled and cancelation here. If we are to trust sapling.ai (first result on a Google search):

Cancelation is predominantly used in 🇺🇸 American (US) English (en-US) while cancellation is predominantly used in 🇬🇧 British English (used in UK/AU/NZ) (en-GB).

https://sapling.ai/usage/cancelation-vs-cancellation

🤷 I'm fine either way but opted for consistency.


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 123 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

This is only best-effort, right? We know that no magic sleep value will suffice in the worst-case :)

Yep, it's here just for convenience and does not impact correctness. The main goal, as the comment says, is to allow the context cancelation to be observed by any other background functions so that they log their termination appropriately.


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 147 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

What if a background step fails concurrently with some other step? The error(s) in the stepsErr won't be reported.

It won't be reported in the main test.log, but the failure won't be lost -- it should still be captured in the step's logger:

https://github.com/cockroachdb/cockroach/pull/96990/files#diff-6976b20431cd2998815618de4e9abcc6cf7d05a6c009a642821d1eef440de7cfR295


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 249 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Nit: remove augments

Done.


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 412 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

I had to pause here to confirm the right context is passed below. Have you considered using ctxgroup.WithContext(ctx) instead? The wrapper makes juggling of parent and child contexts less error-prone.

I was on the edge of restructuring how these contexts are handled so this comment was the push I needed to actually do it. Test runner now uses ctxgroup; downside is that userFunc now takes a context.Context parameter (another one), but I think things are much better now: the test runner's context is the parent context of every mixed-version related context.


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 434 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

CompletedEvents seems more precise.

Renamed.


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 484 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Nit: writers --> author

Done.

Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all comments! I clarified the thing about using different PRNG seeds for init and load, but it's a super minor point. Otherwise, this looks great!

Reviewed 2 of 7 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)


pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 328 at r2 (raw file):

If there's no seed in either of these commands, it means the caller is fine with arbitrary seeds

Right. My suggestion is why not use different seeds in this case? seed is the same in both calls to addSeed


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 121 at r2 (raw file):

Previously, renatolabs (Renato Costa) wrote…

So... I'm mostly used to spell cancelled and cancellation myself, but I noticed Go uses context.Canceled so I started using canceled and cancelation here. If we are to trust sapling.ai (first result on a Google search):

Cancelation is predominantly used in 🇺🇸 American (US) English (en-US) while cancellation is predominantly used in 🇬🇧 British English (used in UK/AU/NZ) (en-GB).

https://sapling.ai/usage/cancelation-vs-cancellation

🤷 I'm fine either way but opted for consistency.

Hehe, I didn't know if was a chiefly British thing :) Personally, I prefer British English, it's so more nuanced; let's stick with that.

Copy link
Copy Markdown
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)


pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 121 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Hehe, I didn't know if was a chiefly British thing :) Personally, I prefer British English, it's so more nuanced; let's stick with that.

I'm chuffed with the outcome of this comment 😄

Copy link
Copy Markdown
Author

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @smg260 and @srosenberg)


pkg/cmd/roachtest/roachtestutil/mixedversion/mixedversion.go line 328 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

If there's no seed in either of these commands, it means the caller is fine with arbitrary seeds

Right. My suggestion is why not use different seeds in this case? seed is the same in both calls to addSeed

Oh, right! The reason I'm using the same seed is that some workloads rely on the same seed being passed to both init and run (admittedly, rand is the only one that comes to mind where that dependency exists, but there could be others). Using the same seed is a safe choice.

@renatolabs
Copy link
Copy Markdown
Author

bors r=srosenberg

TFTR!

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 9, 2023

Build succeeded:

@craig craig bot merged commit dcd687c into cockroachdb:master Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants