-
Notifications
You must be signed in to change notification settings - Fork 4.1k
roachtest: special-case SSH_PROBLEM as infra flakes #82398
Description
Is your feature request related to a problem? Please describe.
There's a steady trickle of nightly failures across teams that are caused by flaky SSH connections (which generally are thought to be caused by cloud networking hiccups and the like). For example, #82123 (comment)
This is a drag on the eng teams, who would rather only see more meaningful failures.
Describe the solution you'd like
We already have a way to post failures to an infra-failure test issue instead when a cluster creation operation fails:
We should use a similar mechanism to redirect (to test-eng) failures in which we see an indication of failed SSH connections, to shield the eng teams from such noise.
Ideally we do so in a way that isn't too hard-coded to SSH failures, so we can reuse it for other failures that are similarly "not the owning team's fault". For example, an apt-get flake should get similar special casing, though this may not necessarily be done as part of this issue. A good mechanism of action is to rely on cockroachdb/errors to return semantics with the error.
For example, we could have a marker error ErrSSHProblem, and could have code around here
cockroach/pkg/cmd/roachtest/cluster.go
Lines 1925 to 1941 in 17159f5
| // RunE runs a command on the specified node, returning an error. The output | |
| // will be redirected to a file which is logged via the cluster-wide logger in | |
| // case of an error. Logs will sort chronologically. Failing invocations will | |
| // have an additional marker file with a `.failed` extension instead of `.log`. | |
| func (c *clusterImpl) RunE(ctx context.Context, node option.NodeListOption, args ...string) error { | |
| if len(args) == 0 { | |
| return errors.New("No command passed") | |
| } | |
| l, logFile, err := c.loggerForCmd(node, args...) | |
| if err != nil { | |
| return err | |
| } | |
| if err := errors.Wrap(ctx.Err(), "cluster.RunE"); err != nil { | |
| return err | |
| } | |
| err = execCmd(ctx, l, c.MakeNodes(node), args...) |
that does this:
if isExitCode255Err(err) {
err = errors.Mark(err, something.ErrSSHProblem)
}and then the issue posting code could do
if errors.Is(err, something.ErrSSHProblem) {
// Override what issue to post
}in maybePostGithubIssue, whose main caller is here:
cockroach/pkg/cmd/roachtest/test_runner.go
Lines 785 to 798 in 5730899
| if t.Failed() { | |
| t.mu.Lock() | |
| output := fmt.Sprintf("test artifacts and logs in: %s\n", t.ArtifactsDir()) + string(t.mu.output) | |
| t.mu.Unlock() | |
| if teamCity { | |
| shout(ctx, l, stdout, "##teamcity[testFailed name='%s' details='%s' flowId='%s']", | |
| t.Name(), teamCityEscape(output), runID) | |
| } | |
| shout(ctx, l, stdout, "--- FAIL: %s (%s)\n%s", runID, durationStr, output) | |
| r.maybePostGithubIssue(ctx, l, t, stdout, output) | |
| } else { |
One minor change that needs to happen here is that the issue poster only sees a string, but it should see the full error. In particular, Fatalf should internally translate to t.mu.err = errors.Errorf(format, args...) instead of what these two methods do now:
cockroach/pkg/cmd/roachtest/test_impl.go
Lines 273 to 280 in 185e479
| func (t *testImpl) markFailedInner(format string, args ...interface{}) { | |
| // Skip two frames: our own and the caller. | |
| if format != "" { | |
| t.printfAndFail(2 /* skip */, format, args...) | |
| } else { | |
| t.printAndFail(2 /* skip */, args...) | |
| } | |
| } |
and we should stringify only when actually posting the issue. This would make for a good standalone, first, PR.
Describe alternatives you've considered
Retries might be able to alleviate some of the pain, but not all operations are idempotent and we'd have to do full passes and folks would have to "do it right" for all new code. I consider that an orthogonal approach and we might want to do both.
Additional context
the roachprod logic to test SSH flakes is here:
cockroach/pkg/roachprod/install/cluster_synced.go
Lines 609 to 616 in 5861187
| err = sess.Run(ctx, nodeCmd) | |
| result.Stderr = stderrBuffer.String() | |
| result.Stdout, result.RemoteExitStatus = processStdout(stdoutBuffer.String()) | |
| if err != nil { | |
| detailMsg := fmt.Sprintf("Node %d. Command with error:\n```\n%s\n```\n", node, cmd) | |
| err = errors.WithDetail(err, detailMsg) | |
| err = rperrors.ClassifyCmdError(err) |
ClassifyCmdError basically checks if err is an *exec.ExitError with code 255, which is then presumed to come from ssh. This assumption isn't right in entirety, if the remote command exits with 255 then we get 255 as well:
tobias@gceworker-tobias:~/go/src/github.com/cockroachdb/cockroach$ roachprod ssh $c:4 -- exit 12
Error: COMMAND_PROBLEM: exit status 12
(1) COMMAND_PROBLEM
Wraps: (2) Node 4. Command with error:
| ```
| exit 12
| ```
Wraps: (3) exit status 12
Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
tobias@gceworker-tobias:~/go/src/github.com/cockroachdb/cockroach$ roachprod ssh $c:4 -- exit 255
Error: SSH_PROBLEM: exit status 255
(1) SSH_PROBLEM
Wraps: (2) Node 4. Command with error:
| ```
| exit 255
| ```
Wraps: (3) exit status 255
Error types: (1) errors.SSH (2) *hintdetail.withDetail (3) *exec.ExitError
but apart from that it's mostly right (from man ssh):
EXIT STATUS
ssh exits with the exit status of the remote command or with 255 if an
error occurred.
Btw, we also do this thing where we try to get the remote exit status from a string that is printed at the end:
cockroach/pkg/roachprod/install/cluster_synced.go
Lines 556 to 569 in 5861187
| func processStdout(stdout string) (string, string) { | |
| retStdout := stdout | |
| exitStatusPattern := "LAST EXIT STATUS: " | |
| exitStatusIndex := strings.LastIndex(retStdout, exitStatusPattern) | |
| remoteExitStatus := "-1" | |
| // If exitStatusIndex is -1 then "echo LAST EXIT STATUS: $?" didn't run | |
| // mostly due to an ssh error but avoid speculation and temporarily | |
| // use "-1" for unknown error before checking if it's SSH related later. | |
| if exitStatusIndex != -1 { | |
| retStdout = stdout[:exitStatusIndex] | |
| remoteExitStatus = strings.TrimSpace(stdout[exitStatusIndex+len(exitStatusPattern):]) | |
| } | |
| return retStdout, remoteExitStatus | |
| } |
I'm not sure why this is needed, as the roachprod ssh $c:4 -- exit N experiment shows that we get the exit code directly from the returned error and this should be reliable.
Jira issue: CRDB-16353