Conversation
55a8800 to
5cec027
Compare
tbg
left a comment
There was a problem hiding this comment.
Looks good! I didn't review in detail. Left some inline comments - apologies if some of these are things you're already doing that I simply missed.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, @smg260, and @srosenberg)
pkg/cmd/roachtest/tests/django.go line 190 at r4 (raw file):
result, err := c.RunWithDetailsSingleNode(ctx, t.L(), node, fmt.Sprintf(djangoRunTestCmd, testName)) if errors.Is(err, rperrors.ErrSSH255) {
Why do we need this now? I would hope that your average roachtest wouldn't ever need to worry about SSH errors again, since if one occurs it would be caught by the test harness and the failure would be redirected away from the owning team.
True, something like this wouldn't get captured:
err := c.RunE(/* stuff */)
if err != nil {
t.Fatalf("doing something stupid with this error: %s", err.Error())
}but honestly I wouldn't worry about that - you've built things so that idiomatic failing as a result of an error will always be captured.
pkg/roachprod/install/cluster_synced.go line 125 at r4 (raw file):
// If the `shouldRetryFn` is not specified (nil), then no retries will be // performed. func runWithMaybeRetry(
Retries are only admissible if what we're invoking is idempotent, which is mostly not the case in practice (I think). I think it's a sane default policy to hope for the best (i.e. it's idempotent or the first execution didn't actually take place) but let's make sure the final error, if one results, is decorated with the eventual fact of a retry having been carried out. We don't want devs wondering why they get a particular error not realizing that there was an internal retry they didn't have on the radar. I see that you're logging, but also makes sense to wrap the reported error if it resulted from a retry.
We should also consider making the retry behavior optional for callers that want to opt out of it for a specific invocation.
cc306a9 to
3fbed39
Compare
smg260
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)
pkg/cmd/roachtest/tests/django.go line 190 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Why do we need this now? I would hope that your average roachtest wouldn't ever need to worry about SSH errors again, since if one occurs it would be caught by the test harness and the failure would be redirected away from the owning team.
True, something like this wouldn't get captured:
err := c.RunE(/* stuff */) if err != nil { t.Fatalf("doing something stupid with this error: %s", err.Error()) }but honestly I wouldn't worry about that - you've built things so that idiomatic failing as a result of an error will always be captured.
I've added a paraphrased version of the previous explanatory comment. Basically, c.RunWithDetailsSingleNode may encounter an expected command error, in which case the test can proceed. We want to mark abort on an unexpected roachprod or SSH error
pkg/roachprod/install/cluster_synced.go line 125 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Retries are only admissible if what we're invoking is idempotent, which is mostly not the case in practice (I think). I think it's a sane default policy to hope for the best (i.e. it's idempotent or the first execution didn't actually take place) but let's make sure the final error, if one results, is decorated with the eventual fact of a retry having been carried out. We don't want devs wondering why they get a particular error not realizing that there was an internal retry they didn't have on the radar. I see that you're logging, but also makes sense to wrap the reported error if it resulted from a retry.
We should also consider making the retry behavior optional for callers that want to opt out of it for a specific invocation.
Now wrapping in a new error, and marking it with a retry reference error.
Agreed on optionality - internally it is optional, externally it could be optional as per example in my first comment. If it makes sense, I will add as additional functionality in a follow up.
3fbed39 to
f2c44ff
Compare
renatolabs
left a comment
There was a problem hiding this comment.
Nice work! 👏 Left a few comments, most of them optional.
Reviewed 12 of 16 files at r1, 1 of 5 files at r5, 2 of 4 files at r6, 4 of 4 files at r8.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @smg260)
pkg/roachprod/install/cluster_synced.go line 631 at r8 (raw file):
// Error vs result // An error is an unexpected state within roachprod // A result is the output of running a cmd (could be interpreted as an error)
Nit: I generally prefer (and seems to be the pattern in other function comments in the codebase as well) to see documentation in the form of full sentences. The documentation here reads more as a note and it took me a while to connect what's here with the function itself.
pkg/roachprod/install/cluster_synced.go line 699 at r8 (raw file):
// // stdout: Where stdout messages are written // stderr: Where stderr messages areå written
🔍
pkg/roachprod/install/cluster_synced.go line 715 at r8 (raw file):
results := make([]RunResultDetails, len(nodes)) if _, err := c.ParallelE(l, display, len(nodes), 0, func(i int) (RunResultDetails, error) { //err is a non command failure
Nit lvl 100: space between // and comment.
pkg/roachprod/install/cluster_synced.go line 726 at r8 (raw file):
} func selectError(results []RunResultDetails, stream bool, stdout io.Writer) error {
Do we need this separate function? It's unexpected, to me, that selectError is the function that prints errors to stdout.
pkg/roachprod/install/cluster_synced.go line 752 at r8 (raw file):
return result, err }); err != nil { l.Printf("%v", err)
Is this intended? If so, we should add more context.
pkg/roachprod/install/cluster_synced.go line 828 at r8 (raw file):
} func getRunResult(f func() error, res *RunResultDetails) {
Nits: I find these two functions not very idiomatic. First, taking a function as parameter seems overly abstract, in reality there's just one (or a couple) function that should be called that makes this function meaningful. Why not take in (err) as parameter and let the caller call the function?
Also, do we need to reuse the RunResultDetails struct here? It's more natural to me if these were immutable and returned a new instance when called instead.
pkg/roachprod/install/cluster_synced.go line 741 at r15 (raw file):
} // selectError returns the error from the RunResultDetails with the highest RemoteExitStatus
Maybe a comment as to why this is a good way to prioritize?
f2c44ff to
ce1dd39
Compare
|
I kicked the tires a bit, and the retries seem to work well in a basic roachtest I wrote. I have a few questions that I think we should have an answer for before merging:
|
cf55c3c to
78a8c2d
Compare
smg260
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)
pkg/roachprod/install/cluster_synced.go line 726 at r8 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Do we need this separate function? It's unexpected, to me, that
selectErroris the function that prints errors to stdout.
I think its a distinct enough unit of work to be separated. Having errors printed out is unexpected, but both the printing and selecting of an error require a full iteration of the slice.
pkg/roachprod/install/cluster_synced.go line 752 at r8 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Is this intended? If so, we should add more context.
It was there to satisfy a linter since we are ignoring both outputs of the c.ParallelE. Hopefully what's there now is ok.
pkg/roachprod/install/cluster_synced.go line 828 at r8 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Nits: I find these two functions not very idiomatic. First, taking a function as parameter seems overly abstract, in reality there's just one (or a couple) function that should be called that makes this function meaningful. Why not take in
(err)as parameter and let the caller call the function?Also, do we need to reuse the
RunResultDetailsstruct here? It's more natural to me if these were immutable and returned a new instance when called instead.
Done. I'll defer to your experience regarding idiomatic golang code. Still a noob in that regard :) The intent of the func param initially was to show that this function was merely wrapping/enriching the original call.
pkg/roachprod/install/cluster_synced.go line 2220 at r8 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
I am only seeing two returns out of
ParallelE,if len(failed) > 0 { return failed, errors.New("one or more parallel execution failure") } return nil, nil
Might have been looking at an old revision. I removed my comment - the confusion stemmed from me these function comments .
// ParallelE runs the given function in parallel across the given
// nodes, returning an error if function returns an error.
...
// If err is non-nil, the slice of ParallelResults will contain the
// results from any of the failed invocations.
It implies that if any node returns any error, then ParallelE will return an error. It actually will only do that for a roachprod error (as opposed to an error over SSH)
smg260
left a comment
There was a problem hiding this comment.
Thanks - I also tested with a simple roachtest.
- You should see all errors in the failure.log. I've since also implemented
CombineErrorsto preserve previous failures - Failures for long running processes will retry, but ultimately will still be bound by the overall timeout configured for the test.
- I could not reproduce that case, but like above, after a timeout, SIGQUIT and then SIGKILL will be sent to the process.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)
smg260
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)
pkg/cmd/roachtest/tests/BUILD.bazel line 50 at r19 (raw file):
Previously, renatolabs (Renato Costa) wrote…
👀
Done.
pkg/roachprod/install/cluster_synced.go line 699 at r8 (raw file):
Previously, renatolabs (Renato Costa) wrote…
🔍
Done.
pkg/roachprod/install/cluster_synced.go line 741 at r15 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Maybe a comment as to why this is a good way to prioritize?
Done.
srosenberg
left a comment
There was a problem hiding this comment.
The latest iteration looks good, only minor comments. Improvements look great, but there were also quite a few changes, hence the added caution.
Assuming (a) roachprod-stress didn't reveal any regressions and @renatolabs, @herkolategan feel confident about the changes, let's merge!
Reviewed 3 of 16 files at r1, 1 of 5 files at r5, 1 of 4 files at r6, 1 of 13 files at r12, 13 of 13 files at r16, 9 of 9 files at r17, 5 of 5 files at r18, 2 of 2 files at r19, 1 of 3 files at r20.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @smg260)
pkg/roachprod/install/cluster_synced.go line 726 at r8 (raw file):
Previously, smg260 (Miral Gadani) wrote…
I think its a distinct enough unit of work to be separated. Having errors printed out is unexpected, but both the printing and selecting of an error require a full iteration of the slice.
Maybe renaming would denote more clearly what this func is doing? E.g.,
// processRunOutput returns the error from the RunResultDetails with a _maximum_ RemoteExitStatus.
// Optionally, if stream==false, each node's combined output is logged to stdout.
func processRunOutput(results []*RunResultDetails, stream bool, stdout io.Writer) error
pkg/roachprod/install/cluster_synced.go line 766 at r20 (raw file):
// selectError returns the error from the RunResultDetails with the highest RemoteExitStatus. // This preserves the heuristic in the previously implemented and used `rperrors.SelectPriorityError`.
I'd remove this comment line since it's not going to be helpful after this PR is merged.
78a8c2d to
3991bd5
Compare
herkolategan
left a comment
There was a problem hiding this comment.
Great addition! Looking forward to the increased stability this provides. Left one or two comments where a comment didn't make sense to me.
Reviewed 3 of 16 files at r1, 1 of 5 files at r5, 1 of 4 files at r6, 1 of 13 files at r12, 10 of 13 files at r16, 7 of 9 files at r17, 2 of 5 files at r18, 2 of 3 files at r20, 4 of 4 files at r21, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @renatolabs and @smg260)
pkg/cmd/roachtest/tests/gopg.go line 119 at r21 (raw file):
) // Fatal for a roachprod err or SSH error. Roachprod error is when result.Err==nil
I'm struggling to make sense of the wording on the second sentence here, although I might be missing something. Saw it in a few other places, but only mentioning here.
pkg/roachprod/install/cluster_synced.go line 750 at r21 (raw file):
results := make([]*RunResultDetails, len(nodes)) // A result is the output of running a exitCode (could be interpreted as an error)
Nit: an exitCode
3991bd5 to
775beac
Compare
smg260
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @herkolategan, @renatolabs, and @srosenberg)
pkg/roachprod/install/cluster_synced.go line 726 at r8 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
Maybe renaming would denote more clearly what this func is doing? E.g.,
// processRunOutput returns the error from the RunResultDetails with a _maximum_ RemoteExitStatus. // Optionally, if stream==false, each node's combined output is logged to stdout. func processRunOutput(results []*RunResultDetails, stream bool, stdout io.Writer) error
Done.
pkg/roachprod/install/cluster_synced.go line 766 at r20 (raw file):
Previously, srosenberg (Stan Rosenberg) wrote…
I'd remove this comment line since it's not going to be helpful after this PR is merged.
Done.
…try when encountering an error of 255 (SSH). The retry could be exposed to callers fairly simply, but this PR does not introduce that. Today there appears the concept of a "roachprod" and a "command error", the former of which denotes an issue in roachprod handling code, the latter representing an error executing a command over SSH. This PR preserves this behaviour and introduces an updated function signature from `f(i int) ([]byte, error) to f(i int) (*RunResultDetails, error). RunResultDetails has been expanded to capture information about the execution of a command, such as stderr/our, exit code, error, attempt number. Any roachprod error will result in an error being returned, and set in RunResultDetails.Err. Any command error would be only set in RunResultDetails.Err with the returned error being nil. This allows callers to differentiate between a roachprod and a command error, which existing code depends on. Retry handling code can use a function's returned RunResultDetails to determine whether a retry should take place. The default retry handling occurs on `RunResultDetails.RemoteExitStatus == 255` Release note: None Epic: CRDB-21386
775beac to
ae0ad5c
Compare
|
bors r+ |
|
Build succeeded: |
|
blathers backport 22.1 22.2 |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from ae0ad5c to blathers/backport-release-22.1-90451: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.1 failed. See errors above. error creating merge commit from ae0ad5c to blathers/backport-release-22.2-90451: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
roachtest: This change introduces a default retry when encountering an error of
255(SSH). It is transparent to callers. If desired, it would be relatively simple to expose this via theclusterinterface and allow callers to specify a retry handler/options on a per command basis, perhaps something like:but this PR does not introduce that.
Today there appears the concept of a "roachprod" and a "command error", the former of which denotes an issue in roachprod handling code, the latter representing an error executing a command over SSH. This PR preserves this behaviour and introduces an updated function signature from
f(i int) ([]byte, error)tof(i int) RunResultDetails, error.RunResultDetailshas been expanded to capture information about the execution of a command, such asstderr/our, exit code, error, attempt number. Any roachprod error will result in anerrorbeing returned, and set inRunResultDetails.Err. Any command error would be only set inRunResultDetails.Errwith the returned error being nil. This allows callers to differentiate between a roachprod and a command error - something which existing code depends on.Retry handling code can use a function's returned
RunResultDetailsto determine whether a retry should take place. This PR has default retry handling onRunResultDetails.RemoteExitStatus == 255.Notable changes
255exit code for any function executed viaParallelERun/RunWithDetails(significant duplication of code)"echo $?"option to print exit code at end of cmd. It is not clear why this was done. (Perhaps for visibility in logs?) Accompanying function to parse out the status has also been removed. The exit code is still accessible via theRunResultDetails!install.NonZeroExitCodeResolves: #73542
Release note: None
Epic: CRDB-21386