Skip to content

Roachtest SSH Retries#90451

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
smg260:roachtest_ssh_retry_simple
Nov 15, 2022
Merged

Roachtest SSH Retries#90451
craig[bot] merged 1 commit intocockroachdb:masterfrom
smg260:roachtest_ssh_retry_simple

Conversation

@smg260
Copy link
Copy Markdown
Contributor

@smg260 smg260 commented Oct 21, 2022

roachtest: This change introduces a default retry when encountering an error of 255 (SSH). It is transparent to callers. If desired, it would be relatively simple to expose this via the cluster interface and allow callers to specify a retry handler/options on a per command basis, perhaps something like:

// retry if we get an exit code of 12
retryOpts { 
	opt: retry.Options { .. } , 
	shouldRetryFn: func(res RunResultDetails) bool { return res.RemoteExitStatus == 12 }
}
c.RunE(ctx, nodes, retryOpts, "my_cmd")

but this PR does not introduce that.

Today there appears the concept of a "roachprod" and a "command error", the former of which denotes an issue in roachprod handling code, the latter representing an error executing a command over SSH. This PR preserves this behaviour and introduces an updated function signature from f(i int) ([]byte, error) to f(i int) RunResultDetails, error. RunResultDetails has been expanded to capture information about the execution of a command, such as stderr/our, exit code, error, attempt number. Any roachprod error will result in an error being returned, and set in RunResultDetails.Err. Any command error would be only set in RunResultDetails.Err with the returned error being nil. This allows callers to differentiate between a roachprod and a command error - something which existing code depends on.

Retry handling code can use a function's returned RunResultDetails to determine whether a retry should take place. This PR has default retry handling on RunResultDetails.RemoteExitStatus == 255.

Notable changes

  • Retry on 255 exit code for any function executed via ParallelE
  • Consolidated Run/RunWithDetails (significant duplication of code)
  • Classify exit error at the session.go level so that it is available for all commands and to all callers
  • Remove "echo $?" option to print exit code at end of cmd. It is not clear why this was done. (Perhaps for visibility in logs?) Accompanying function to parse out the status has also been removed. The exit code is still accessible via the RunResultDetails
  • Update roachtests to check for SSH marker error instead of checking for !install.NonZeroExitCode

Resolves: #73542
Release note: None
Epic: CRDB-21386

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@smg260 smg260 changed the title roachtest: Reassign roachtest failures which may have been caused by WIP - Retry Oct 21, 2022
@smg260 smg260 force-pushed the roachtest_ssh_retry_simple branch 16 times, most recently from 55a8800 to 5cec027 Compare November 1, 2022 01:16
@smg260 smg260 marked this pull request as ready for review November 1, 2022 02:43
@smg260 smg260 requested a review from a team as a code owner November 1, 2022 02:43
@smg260 smg260 requested review from herkolategan and renatolabs and removed request for a team November 1, 2022 02:43
@smg260 smg260 changed the title WIP - Retry Roachtest SSH Retries Nov 1, 2022
@smg260 smg260 requested review from srosenberg and tbg November 2, 2022 12:48
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I didn't review in detail. Left some inline comments - apologies if some of these are things you're already doing that I simply missed.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @smg260, and @srosenberg)


pkg/cmd/roachtest/tests/django.go line 190 at r4 (raw file):

			result, err := c.RunWithDetailsSingleNode(ctx, t.L(), node, fmt.Sprintf(djangoRunTestCmd, testName))

			if errors.Is(err, rperrors.ErrSSH255) {

Why do we need this now? I would hope that your average roachtest wouldn't ever need to worry about SSH errors again, since if one occurs it would be caught by the test harness and the failure would be redirected away from the owning team.

True, something like this wouldn't get captured:

err := c.RunE(/* stuff */)
if err != nil {
  t.Fatalf("doing something stupid with this error: %s", err.Error())
}

but honestly I wouldn't worry about that - you've built things so that idiomatic failing as a result of an error will always be captured.


pkg/roachprod/install/cluster_synced.go line 125 at r4 (raw file):

// If the `shouldRetryFn` is not specified (nil), then no retries will be
// performed.
func runWithMaybeRetry(

Retries are only admissible if what we're invoking is idempotent, which is mostly not the case in practice (I think). I think it's a sane default policy to hope for the best (i.e. it's idempotent or the first execution didn't actually take place) but let's make sure the final error, if one results, is decorated with the eventual fact of a retry having been carried out. We don't want devs wondering why they get a particular error not realizing that there was an internal retry they didn't have on the radar. I see that you're logging, but also makes sense to wrap the reported error if it resulted from a retry.

We should also consider making the retry behavior optional for callers that want to opt out of it for a specific invocation.

@smg260 smg260 force-pushed the roachtest_ssh_retry_simple branch 2 times, most recently from cc306a9 to 3fbed39 Compare November 7, 2022 21:01
Copy link
Copy Markdown
Contributor Author

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)


pkg/cmd/roachtest/tests/django.go line 190 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Why do we need this now? I would hope that your average roachtest wouldn't ever need to worry about SSH errors again, since if one occurs it would be caught by the test harness and the failure would be redirected away from the owning team.

True, something like this wouldn't get captured:

err := c.RunE(/* stuff */)
if err != nil {
  t.Fatalf("doing something stupid with this error: %s", err.Error())
}

but honestly I wouldn't worry about that - you've built things so that idiomatic failing as a result of an error will always be captured.

I've added a paraphrased version of the previous explanatory comment. Basically, c.RunWithDetailsSingleNode may encounter an expected command error, in which case the test can proceed. We want to mark abort on an unexpected roachprod or SSH error


pkg/roachprod/install/cluster_synced.go line 125 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Retries are only admissible if what we're invoking is idempotent, which is mostly not the case in practice (I think). I think it's a sane default policy to hope for the best (i.e. it's idempotent or the first execution didn't actually take place) but let's make sure the final error, if one results, is decorated with the eventual fact of a retry having been carried out. We don't want devs wondering why they get a particular error not realizing that there was an internal retry they didn't have on the radar. I see that you're logging, but also makes sense to wrap the reported error if it resulted from a retry.

We should also consider making the retry behavior optional for callers that want to opt out of it for a specific invocation.

Now wrapping in a new error, and marking it with a retry reference error.

Agreed on optionality - internally it is optional, externally it could be optional as per example in my first comment. If it makes sense, I will add as additional functionality in a follow up.

@smg260 smg260 force-pushed the roachtest_ssh_retry_simple branch from 3fbed39 to f2c44ff Compare November 9, 2022 15:20
Copy link
Copy Markdown

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! 👏 Left a few comments, most of them optional.

Reviewed 12 of 16 files at r1, 1 of 5 files at r5, 2 of 4 files at r6, 4 of 4 files at r8.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @smg260)


pkg/roachprod/install/cluster_synced.go line 631 at r8 (raw file):

// Error vs result
// An error is an unexpected state within roachprod
// A result is the output of running a cmd (could be interpreted as an error)

Nit: I generally prefer (and seems to be the pattern in other function comments in the codebase as well) to see documentation in the form of full sentences. The documentation here reads more as a note and it took me a while to connect what's here with the function itself.


pkg/roachprod/install/cluster_synced.go line 699 at r8 (raw file):

//
// stdout: Where stdout messages are written
// stderr: Where stderr messages areå written

🔍


pkg/roachprod/install/cluster_synced.go line 715 at r8 (raw file):

	results := make([]RunResultDetails, len(nodes))
	if _, err := c.ParallelE(l, display, len(nodes), 0, func(i int) (RunResultDetails, error) {
		//err is a non command failure

Nit lvl 100: space between // and comment.


pkg/roachprod/install/cluster_synced.go line 726 at r8 (raw file):

}

func selectError(results []RunResultDetails, stream bool, stdout io.Writer) error {

Do we need this separate function? It's unexpected, to me, that selectError is the function that prints errors to stdout.


pkg/roachprod/install/cluster_synced.go line 752 at r8 (raw file):

		return result, err
	}); err != nil {
		l.Printf("%v", err)

Is this intended? If so, we should add more context.


pkg/roachprod/install/cluster_synced.go line 828 at r8 (raw file):

}

func getRunResult(f func() error, res *RunResultDetails) {

Nits: I find these two functions not very idiomatic. First, taking a function as parameter seems overly abstract, in reality there's just one (or a couple) function that should be called that makes this function meaningful. Why not take in (err) as parameter and let the caller call the function?

Also, do we need to reuse the RunResultDetails struct here? It's more natural to me if these were immutable and returned a new instance when called instead.


pkg/roachprod/install/cluster_synced.go line 741 at r15 (raw file):

}

// selectError returns the error from the RunResultDetails with the highest RemoteExitStatus

Maybe a comment as to why this is a good way to prioritize?

@smg260 smg260 force-pushed the roachtest_ssh_retry_simple branch from f2c44ff to ce1dd39 Compare November 9, 2022 15:28
@renatolabs
Copy link
Copy Markdown

I kicked the tires a bit, and the retries seem to work well in a basic roachtest I wrote. I have a few questions that I think we should have an answer for before merging:

  • I don't see intermediary errors in the test logs; I only see the final error persisted after 3 retries.
  • I suspect a fair percentage of SSH-related problems (255) we see happen for long running processes (such as workload). For example, in this failure, the workload ran for 16 minutes before the process exited with exit code 255. What's the impact of this change in failures like that?
  • Do we understand the conditions in which a test time out can lead to 255 exit status? Retrying a command that timed out sounds bad.

@smg260 smg260 force-pushed the roachtest_ssh_retry_simple branch from cf55c3c to 78a8c2d Compare November 9, 2022 20:12
Copy link
Copy Markdown
Contributor Author

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)


pkg/roachprod/install/cluster_synced.go line 726 at r8 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Do we need this separate function? It's unexpected, to me, that selectError is the function that prints errors to stdout.

I think its a distinct enough unit of work to be separated. Having errors printed out is unexpected, but both the printing and selecting of an error require a full iteration of the slice.


pkg/roachprod/install/cluster_synced.go line 752 at r8 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Is this intended? If so, we should add more context.

It was there to satisfy a linter since we are ignoring both outputs of the c.ParallelE. Hopefully what's there now is ok.


pkg/roachprod/install/cluster_synced.go line 828 at r8 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Nits: I find these two functions not very idiomatic. First, taking a function as parameter seems overly abstract, in reality there's just one (or a couple) function that should be called that makes this function meaningful. Why not take in (err) as parameter and let the caller call the function?

Also, do we need to reuse the RunResultDetails struct here? It's more natural to me if these were immutable and returned a new instance when called instead.

Done. I'll defer to your experience regarding idiomatic golang code. Still a noob in that regard :) The intent of the func param initially was to show that this function was merely wrapping/enriching the original call.


pkg/roachprod/install/cluster_synced.go line 2220 at r8 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

I am only seeing two returns out of ParallelE,

	if len(failed) > 0 {
		return failed, errors.New("one or more parallel execution failure")
	}
	return nil, nil

Might have been looking at an old revision. I removed my comment - the confusion stemmed from me these function comments .

// ParallelE runs the given function in parallel across the given
// nodes, returning an error if function returns an error.
...
// If err is non-nil, the slice of ParallelResults will contain the
// results from any of the failed invocations.

It implies that if any node returns any error, then ParallelE will return an error. It actually will only do that for a roachprod error (as opposed to an error over SSH)

Copy link
Copy Markdown
Contributor Author

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I also tested with a simple roachtest.

  • You should see all errors in the failure.log. I've since also implemented CombineErrors to preserve previous failures
  • Failures for long running processes will retry, but ultimately will still be bound by the overall timeout configured for the test.
  • I could not reproduce that case, but like above, after a timeout, SIGQUIT and then SIGKILL will be sent to the process.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)

Copy link
Copy Markdown
Contributor Author

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)


pkg/cmd/roachtest/tests/BUILD.bazel line 50 at r19 (raw file):

Previously, renatolabs (Renato Costa) wrote…

👀

Done.


pkg/roachprod/install/cluster_synced.go line 699 at r8 (raw file):

Previously, renatolabs (Renato Costa) wrote…

🔍

Done.


pkg/roachprod/install/cluster_synced.go line 741 at r15 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Maybe a comment as to why this is a good way to prioritize?

Done.

Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

The latest iteration looks good, only minor comments. Improvements look great, but there were also quite a few changes, hence the added caution.
Assuming (a) roachprod-stress didn't reveal any regressions and @renatolabs, @herkolategan feel confident about the changes, let's merge!

Reviewed 3 of 16 files at r1, 1 of 5 files at r5, 1 of 4 files at r6, 1 of 13 files at r12, 13 of 13 files at r16, 9 of 9 files at r17, 5 of 5 files at r18, 2 of 2 files at r19, 1 of 3 files at r20.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)


pkg/roachprod/install/cluster_synced.go line 726 at r8 (raw file):

Previously, smg260 (Miral Gadani) wrote…

I think its a distinct enough unit of work to be separated. Having errors printed out is unexpected, but both the printing and selecting of an error require a full iteration of the slice.

Maybe renaming would denote more clearly what this func is doing? E.g.,

// processRunOutput returns the error from the RunResultDetails with a _maximum_ RemoteExitStatus.
// Optionally, if stream==false, each node's combined output is logged to stdout.

func processRunOutput(results []*RunResultDetails, stream bool, stdout io.Writer) error 

pkg/roachprod/install/cluster_synced.go line 766 at r20 (raw file):

// selectError returns the error from the RunResultDetails with the highest RemoteExitStatus.
// This preserves the heuristic in the previously implemented and used `rperrors.SelectPriorityError`.

I'd remove this comment line since it's not going to be helpful after this PR is merged.

@smg260 smg260 force-pushed the roachtest_ssh_retry_simple branch from 78a8c2d to 3991bd5 Compare November 14, 2022 14:56
Copy link
Copy Markdown
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition! Looking forward to the increased stability this provides. Left one or two comments where a comment didn't make sense to me.
:lgtm:

Reviewed 3 of 16 files at r1, 1 of 5 files at r5, 1 of 4 files at r6, 1 of 13 files at r12, 10 of 13 files at r16, 7 of 9 files at r17, 2 of 5 files at r18, 2 of 3 files at r20, 4 of 4 files at r21, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @renatolabs and @smg260)


pkg/cmd/roachtest/tests/gopg.go line 119 at r21 (raw file):

		)

		// Fatal for a roachprod err or SSH error. Roachprod error is when result.Err==nil

I'm struggling to make sense of the wording on the second sentence here, although I might be missing something. Saw it in a few other places, but only mentioning here.


pkg/roachprod/install/cluster_synced.go line 750 at r21 (raw file):

	results := make([]*RunResultDetails, len(nodes))

	// A result is the output of running a exitCode (could be interpreted as an error)

Nit: an exitCode

@smg260 smg260 force-pushed the roachtest_ssh_retry_simple branch from 3991bd5 to 775beac Compare November 14, 2022 16:50
Copy link
Copy Markdown
Contributor Author

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @herkolategan, @renatolabs, and @srosenberg)


pkg/roachprod/install/cluster_synced.go line 726 at r8 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Maybe renaming would denote more clearly what this func is doing? E.g.,

// processRunOutput returns the error from the RunResultDetails with a _maximum_ RemoteExitStatus.
// Optionally, if stream==false, each node's combined output is logged to stdout.

func processRunOutput(results []*RunResultDetails, stream bool, stdout io.Writer) error 

Done.


pkg/roachprod/install/cluster_synced.go line 766 at r20 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

I'd remove this comment line since it's not going to be helpful after this PR is merged.

Done.

…try when

encountering an error of 255 (SSH). The retry could be exposed to
callers fairly simply, but this PR does not introduce that.

Today there appears the concept of a "roachprod" and a "command error",
the former of which denotes an issue in roachprod handling code, the
latter representing an error executing a command over SSH. This PR
preserves this behaviour and introduces an updated function signature
from `f(i int) ([]byte, error) to f(i int) (*RunResultDetails, error).

RunResultDetails has been expanded to capture information about the
execution of a command, such as stderr/our, exit code, error, attempt
number. Any roachprod error will result in an error being returned, and set
in RunResultDetails.Err. Any command error would be only set in
RunResultDetails.Err with the returned error being nil. This allows callers to
differentiate between a roachprod and a command error, which existing code
depends on.

Retry handling code can use a function's returned RunResultDetails to
determine whether a retry should take place. The default retry
handling occurs on `RunResultDetails.RemoteExitStatus == 255`

Release note: None
Epic: CRDB-21386
@smg260 smg260 force-pushed the roachtest_ssh_retry_simple branch from 775beac to ae0ad5c Compare November 14, 2022 17:43
@smg260
Copy link
Copy Markdown
Contributor Author

smg260 commented Nov 15, 2022

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Nov 15, 2022

Build succeeded:

@smg260
Copy link
Copy Markdown
Contributor Author

smg260 commented Dec 7, 2022

blathers backport 22.1 22.2

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Dec 7, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from ae0ad5c to blathers/backport-release-22.1-90451: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1 failed. See errors above.


error creating merge commit from ae0ad5c to blathers/backport-release-22.2-90451: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: auto retry operations when failing due to ssh-related issues

6 participants