Roachtest SSH Retries by smg260 · Pull Request #90451 · cockroachdb/cockroach

smg260 · 2022-10-21T16:16:27Z

roachtest: This change introduces a default retry when encountering an error of 255 (SSH). It is transparent to callers. If desired, it would be relatively simple to expose this via the cluster interface and allow callers to specify a retry handler/options on a per command basis, perhaps something like:

// retry if we get an exit code of 12
retryOpts { 
	opt: retry.Options { .. } , 
	shouldRetryFn: func(res RunResultDetails) bool { return res.RemoteExitStatus == 12 }
}
c.RunE(ctx, nodes, retryOpts, "my_cmd")

but this PR does not introduce that.

Today there appears the concept of a "roachprod" and a "command error", the former of which denotes an issue in roachprod handling code, the latter representing an error executing a command over SSH. This PR preserves this behaviour and introduces an updated function signature from f(i int) ([]byte, error) to f(i int) RunResultDetails, error. RunResultDetails has been expanded to capture information about the execution of a command, such as stderr/our, exit code, error, attempt number. Any roachprod error will result in an error being returned, and set in RunResultDetails.Err. Any command error would be only set in RunResultDetails.Err with the returned error being nil. This allows callers to differentiate between a roachprod and a command error - something which existing code depends on.

Retry handling code can use a function's returned RunResultDetails to determine whether a retry should take place. This PR has default retry handling on RunResultDetails.RemoteExitStatus == 255.

Notable changes

Retry on 255 exit code for any function executed via ParallelE
Consolidated Run/RunWithDetails (significant duplication of code)
Classify exit error at the session.go level so that it is available for all commands and to all callers
Remove "echo $?" option to print exit code at end of cmd. It is not clear why this was done. (Perhaps for visibility in logs?) Accompanying function to parse out the status has also been removed. The exit code is still accessible via the RunResultDetails
Update roachtests to check for SSH marker error instead of checking for !install.NonZeroExitCode

Resolves: #73542
Release note: None
Epic: CRDB-21386

cockroach-teamcity · 2022-10-21T16:16:41Z

This change is

tbg

Looks good! I didn't review in detail. Left some inline comments - apologies if some of these are things you're already doing that I simply missed.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @smg260, and @srosenberg)

pkg/cmd/roachtest/tests/django.go line 190 at r4 (raw file):

			result, err := c.RunWithDetailsSingleNode(ctx, t.L(), node, fmt.Sprintf(djangoRunTestCmd, testName))

			if errors.Is(err, rperrors.ErrSSH255) {

Why do we need this now? I would hope that your average roachtest wouldn't ever need to worry about SSH errors again, since if one occurs it would be caught by the test harness and the failure would be redirected away from the owning team.

True, something like this wouldn't get captured:

err := c.RunE(/* stuff */)
if err != nil {
  t.Fatalf("doing something stupid with this error: %s", err.Error())
}

but honestly I wouldn't worry about that - you've built things so that idiomatic failing as a result of an error will always be captured.

pkg/roachprod/install/cluster_synced.go line 125 at r4 (raw file):

// If the `shouldRetryFn` is not specified (nil), then no retries will be
// performed.
func runWithMaybeRetry(

Retries are only admissible if what we're invoking is idempotent, which is mostly not the case in practice (I think). I think it's a sane default policy to hope for the best (i.e. it's idempotent or the first execution didn't actually take place) but let's make sure the final error, if one results, is decorated with the eventual fact of a retry having been carried out. We don't want devs wondering why they get a particular error not realizing that there was an internal retry they didn't have on the radar. I see that you're logging, but also makes sense to wrap the reported error if it resulted from a retry.

We should also consider making the retry behavior optional for callers that want to opt out of it for a specific invocation.

smg260

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)

pkg/cmd/roachtest/tests/django.go line 190 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Why do we need this now? I would hope that your average roachtest wouldn't ever need to worry about SSH errors again, since if one occurs it would be caught by the test harness and the failure would be redirected away from the owning team.

True, something like this wouldn't get captured:
err := c.RunE(/* stuff */)
if err != nil {
  t.Fatalf("doing something stupid with this error: %s", err.Error())
}
but honestly I wouldn't worry about that - you've built things so that idiomatic failing as a result of an error will always be captured.

I've added a paraphrased version of the previous explanatory comment. Basically, c.RunWithDetailsSingleNode may encounter an expected command error, in which case the test can proceed. We want to mark abort on an unexpected roachprod or SSH error

pkg/roachprod/install/cluster_synced.go line 125 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Retries are only admissible if what we're invoking is idempotent, which is mostly not the case in practice (I think). I think it's a sane default policy to hope for the best (i.e. it's idempotent or the first execution didn't actually take place) but let's make sure the final error, if one results, is decorated with the eventual fact of a retry having been carried out. We don't want devs wondering why they get a particular error not realizing that there was an internal retry they didn't have on the radar. I see that you're logging, but also makes sense to wrap the reported error if it resulted from a retry.

We should also consider making the retry behavior optional for callers that want to opt out of it for a specific invocation.

Now wrapping in a new error, and marking it with a retry reference error.

Agreed on optionality - internally it is optional, externally it could be optional as per example in my first comment. If it makes sense, I will add as additional functionality in a follow up.

renatolabs

Nice work! 👏 Left a few comments, most of them optional.

Reviewed 12 of 16 files at r1, 1 of 5 files at r5, 2 of 4 files at r6, 4 of 4 files at r8.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @smg260)

pkg/roachprod/install/cluster_synced.go line 631 at r8 (raw file):

// Error vs result
// An error is an unexpected state within roachprod
// A result is the output of running a cmd (could be interpreted as an error)

Nit: I generally prefer (and seems to be the pattern in other function comments in the codebase as well) to see documentation in the form of full sentences. The documentation here reads more as a note and it took me a while to connect what's here with the function itself.

pkg/roachprod/install/cluster_synced.go line 699 at r8 (raw file):

//
// stdout: Where stdout messages are written
// stderr: Where stderr messages areå written

🔍

pkg/roachprod/install/cluster_synced.go line 715 at r8 (raw file):

	results := make([]RunResultDetails, len(nodes))
	if _, err := c.ParallelE(l, display, len(nodes), 0, func(i int) (RunResultDetails, error) {
		//err is a non command failure

Nit lvl 100: space between // and comment.

pkg/roachprod/install/cluster_synced.go line 726 at r8 (raw file):

}

func selectError(results []RunResultDetails, stream bool, stdout io.Writer) error {

Do we need this separate function? It's unexpected, to me, that selectError is the function that prints errors to stdout.

pkg/roachprod/install/cluster_synced.go line 752 at r8 (raw file):

		return result, err
	}); err != nil {
		l.Printf("%v", err)

Is this intended? If so, we should add more context.

pkg/roachprod/install/cluster_synced.go line 828 at r8 (raw file):

}

func getRunResult(f func() error, res *RunResultDetails) {

Nits: I find these two functions not very idiomatic. First, taking a function as parameter seems overly abstract, in reality there's just one (or a couple) function that should be called that makes this function meaningful. Why not take in (err) as parameter and let the caller call the function?

Also, do we need to reuse the RunResultDetails struct here? It's more natural to me if these were immutable and returned a new instance when called instead.

pkg/roachprod/install/cluster_synced.go line 741 at r15 (raw file):

}

// selectError returns the error from the RunResultDetails with the highest RemoteExitStatus

Maybe a comment as to why this is a good way to prioritize?

renatolabs · 2022-11-09T18:05:36Z

I kicked the tires a bit, and the retries seem to work well in a basic roachtest I wrote. I have a few questions that I think we should have an answer for before merging:

I don't see intermediary errors in the test logs; I only see the final error persisted after 3 retries.
I suspect a fair percentage of SSH-related problems (255) we see happen for long running processes (such as workload). For example, in this failure, the workload ran for 16 minutes before the process exited with exit code 255. What's the impact of this change in failures like that?
Do we understand the conditions in which a test time out can lead to 255 exit status? Retrying a command that timed out sounds bad.

pkg/cmd/roachtest/tests/BUILD.bazel

smg260

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)

pkg/roachprod/install/cluster_synced.go line 726 at r8 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Do we need this separate function? It's unexpected, to me, that selectError is the function that prints errors to stdout.

I think its a distinct enough unit of work to be separated. Having errors printed out is unexpected, but both the printing and selecting of an error require a full iteration of the slice.

pkg/roachprod/install/cluster_synced.go line 752 at r8 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Is this intended? If so, we should add more context.

It was there to satisfy a linter since we are ignoring both outputs of the c.ParallelE. Hopefully what's there now is ok.

pkg/roachprod/install/cluster_synced.go line 828 at r8 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Nits: I find these two functions not very idiomatic. First, taking a function as parameter seems overly abstract, in reality there's just one (or a couple) function that should be called that makes this function meaningful. Why not take in (err) as parameter and let the caller call the function?

Also, do we need to reuse the RunResultDetails struct here? It's more natural to me if these were immutable and returned a new instance when called instead.

Done. I'll defer to your experience regarding idiomatic golang code. Still a noob in that regard :) The intent of the func param initially was to show that this function was merely wrapping/enriching the original call.

pkg/roachprod/install/cluster_synced.go line 2220 at r8 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

I am only seeing two returns out of ParallelE,

	if len(failed) > 0 {
		return failed, errors.New("one or more parallel execution failure")
	}
	return nil, nil

Might have been looking at an old revision. I removed my comment - the confusion stemmed from me these function comments .

// ParallelE runs the given function in parallel across the given
// nodes, returning an error if function returns an error.
...
// If err is non-nil, the slice of ParallelResults will contain the
// results from any of the failed invocations.

It implies that if any node returns any error, then ParallelE will return an error. It actually will only do that for a roachprod error (as opposed to an error over SSH)

smg260

Thanks - I also tested with a simple roachtest.

You should see all errors in the failure.log. I've since also implemented CombineErrors to preserve previous failures
Failures for long running processes will retry, but ultimately will still be bound by the overall timeout configured for the test.
I could not reproduce that case, but like above, after a timeout, SIGQUIT and then SIGKILL will be sent to the process.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)

smg260

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)

pkg/cmd/roachtest/tests/BUILD.bazel line 50 at r19 (raw file):

Previously, renatolabs (Renato Costa) wrote…

👀

Done.

pkg/roachprod/install/cluster_synced.go line 699 at r8 (raw file):

Previously, renatolabs (Renato Costa) wrote…

🔍

Done.

pkg/roachprod/install/cluster_synced.go line 741 at r15 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Maybe a comment as to why this is a good way to prioritize?

Done.

srosenberg

The latest iteration looks good, only minor comments. Improvements look great, but there were also quite a few changes, hence the added caution.
Assuming (a) roachprod-stress didn't reveal any regressions and @renatolabs, @herkolategan feel confident about the changes, let's merge!

Reviewed 3 of 16 files at r1, 1 of 5 files at r5, 1 of 4 files at r6, 1 of 13 files at r12, 13 of 13 files at r16, 9 of 9 files at r17, 5 of 5 files at r18, 2 of 2 files at r19, 1 of 3 files at r20.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)

pkg/roachprod/install/cluster_synced.go line 726 at r8 (raw file):

Previously, smg260 (Miral Gadani) wrote…

I think its a distinct enough unit of work to be separated. Having errors printed out is unexpected, but both the printing and selecting of an error require a full iteration of the slice.

Maybe renaming would denote more clearly what this func is doing? E.g.,

// processRunOutput returns the error from the RunResultDetails with a _maximum_ RemoteExitStatus.
// Optionally, if stream==false, each node's combined output is logged to stdout.

func processRunOutput(results []*RunResultDetails, stream bool, stdout io.Writer) error

pkg/roachprod/install/cluster_synced.go line 766 at r20 (raw file):

// selectError returns the error from the RunResultDetails with the highest RemoteExitStatus.
// This preserves the heuristic in the previously implemented and used `rperrors.SelectPriorityError`.

I'd remove this comment line since it's not going to be helpful after this PR is merged.

herkolategan

Great addition! Looking forward to the increased stability this provides. Left one or two comments where a comment didn't make sense to me.

Reviewed 3 of 16 files at r1, 1 of 5 files at r5, 1 of 4 files at r6, 1 of 13 files at r12, 10 of 13 files at r16, 7 of 9 files at r17, 2 of 5 files at r18, 2 of 3 files at r20, 4 of 4 files at r21, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @renatolabs and @smg260)

pkg/cmd/roachtest/tests/gopg.go line 119 at r21 (raw file):

		)

		// Fatal for a roachprod err or SSH error. Roachprod error is when result.Err==nil

I'm struggling to make sense of the wording on the second sentence here, although I might be missing something. Saw it in a few other places, but only mentioning here.

pkg/roachprod/install/cluster_synced.go line 750 at r21 (raw file):

	results := make([]*RunResultDetails, len(nodes))

	// A result is the output of running a exitCode (could be interpreted as an error)

Nit: an exitCode

smg260

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @herkolategan, @renatolabs, and @srosenberg)

pkg/roachprod/install/cluster_synced.go line 726 at r8 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Maybe renaming would denote more clearly what this func is doing? E.g.,

// processRunOutput returns the error from the RunResultDetails with a _maximum_ RemoteExitStatus.
// Optionally, if stream==false, each node's combined output is logged to stdout.

func processRunOutput(results []*RunResultDetails, stream bool, stdout io.Writer) error

Done.

pkg/roachprod/install/cluster_synced.go line 766 at r20 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

I'd remove this comment line since it's not going to be helpful after this PR is merged.

Done.

…try when encountering an error of 255 (SSH). The retry could be exposed to callers fairly simply, but this PR does not introduce that. Today there appears the concept of a "roachprod" and a "command error", the former of which denotes an issue in roachprod handling code, the latter representing an error executing a command over SSH. This PR preserves this behaviour and introduces an updated function signature from `f(i int) ([]byte, error) to f(i int) (*RunResultDetails, error). RunResultDetails has been expanded to capture information about the execution of a command, such as stderr/our, exit code, error, attempt number. Any roachprod error will result in an error being returned, and set in RunResultDetails.Err. Any command error would be only set in RunResultDetails.Err with the returned error being nil. This allows callers to differentiate between a roachprod and a command error, which existing code depends on. Retry handling code can use a function's returned RunResultDetails to determine whether a retry should take place. The default retry handling occurs on `RunResultDetails.RemoteExitStatus == 255` Release note: None Epic: CRDB-21386

smg260 · 2022-11-15T17:10:24Z

bors r+

craig · 2022-11-15T18:00:20Z

Build succeeded:

Bazel Essential CI (Cockroach)

smg260 · 2022-12-07T15:47:19Z

blathers backport 22.1 22.2

blathers-crl · 2022-12-07T15:47:26Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from ae0ad5c to blathers/backport-release-22.1-90451: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1 failed. See errors above.

error creating merge commit from ae0ad5c to blathers/backport-release-22.2-90451: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2 failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

smg260 changed the title ~~roachtest: Reassign roachtest failures which may have been caused by~~ WIP - Retry Oct 21, 2022

smg260 force-pushed the roachtest_ssh_retry_simple branch 16 times, most recently from 55a8800 to 5cec027 Compare November 1, 2022 01:16

smg260 marked this pull request as ready for review November 1, 2022 02:43

smg260 requested a review from a team as a code owner November 1, 2022 02:43

smg260 requested review from herkolategan and renatolabs and removed request for a team November 1, 2022 02:43

smg260 changed the title ~~WIP - Retry~~ Roachtest SSH Retries Nov 1, 2022

smg260 requested review from srosenberg and tbg November 2, 2022 12:48

tbg reviewed Nov 7, 2022

View reviewed changes

smg260 force-pushed the roachtest_ssh_retry_simple branch 2 times, most recently from cc306a9 to 3fbed39 Compare November 7, 2022 21:01

smg260 commented Nov 7, 2022

View reviewed changes

smg260 force-pushed the roachtest_ssh_retry_simple branch from 3fbed39 to f2c44ff Compare November 9, 2022 15:20

renatolabs reviewed Nov 9, 2022

View reviewed changes

smg260 force-pushed the roachtest_ssh_retry_simple branch from f2c44ff to ce1dd39 Compare November 9, 2022 15:28

renatolabs reviewed Nov 9, 2022

View reviewed changes

pkg/cmd/roachtest/tests/BUILD.bazel Outdated Show resolved Hide resolved

smg260 force-pushed the roachtest_ssh_retry_simple branch from cf55c3c to 78a8c2d Compare November 9, 2022 20:12

smg260 commented Nov 9, 2022

View reviewed changes

smg260 commented Nov 10, 2022

View reviewed changes

srosenberg approved these changes Nov 11, 2022

View reviewed changes

smg260 force-pushed the roachtest_ssh_retry_simple branch from 78a8c2d to 3991bd5 Compare November 14, 2022 14:56

herkolategan approved these changes Nov 14, 2022

View reviewed changes

smg260 force-pushed the roachtest_ssh_retry_simple branch from 3991bd5 to 775beac Compare November 14, 2022 16:50

smg260 commented Nov 14, 2022

View reviewed changes

smg260 force-pushed the roachtest_ssh_retry_simple branch from 775beac to ae0ad5c Compare November 14, 2022 17:43

craig bot merged commit 5a24e11 into cockroachdb:master Nov 15, 2022

aliher1911 mentioned this pull request Dec 1, 2022

roachtest: tpccbench/nodes=3/cpu=4 failed #92804

Closed

This was referenced Dec 6, 2022

roachtest: schemachange/bulkingest failed #91802

Closed

roachtest: kv0/enc=true/nodes=1 failed #92795

Closed

roachtest: tpccbench/nodes=3/cpu=4 failed #92800

Closed

roachtest: tpcc/w=100/nodes=3/chaos=true failed #92797

Closed

smg260 mentioned this pull request Dec 9, 2022

roachtest: use structured errors #93328

Merged

This was referenced Jan 17, 2023

roachtest: fix zipping of artifacts to include other zips #95374

Closed

release-22.1: roachtest structured errors and ssh/scp retries #95448

Merged

srosenberg mentioned this pull request Jun 5, 2023

roachprod: audit and fix incorrect usages of c.Parallel wrt error handling #104312

Closed

Conversation

smg260 commented Oct 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Oct 21, 2022

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

smg260 left a comment

Choose a reason for hiding this comment

Uh oh!

renatolabs left a comment

Choose a reason for hiding this comment

Uh oh!

renatolabs commented Nov 9, 2022

Uh oh!

Uh oh!

smg260 left a comment

Choose a reason for hiding this comment

Uh oh!

smg260 left a comment

Choose a reason for hiding this comment

Uh oh!

smg260 left a comment

Choose a reason for hiding this comment

Uh oh!

srosenberg left a comment

Choose a reason for hiding this comment

Uh oh!

herkolategan left a comment

Choose a reason for hiding this comment

Uh oh!

smg260 left a comment

Choose a reason for hiding this comment

Uh oh!

smg260 commented Nov 15, 2022

Uh oh!

craig bot commented Nov 15, 2022

Uh oh!

smg260 commented Dec 7, 2022

Uh oh!

blathers-crl bot commented Dec 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

smg260 commented Oct 21, 2022 •

edited

Loading