roachtest: special-case SSH_PROBLEM as infra flakes

**Is your feature request related to a problem? Please describe.**

There's a steady trickle of nightly failures across teams that are caused by flaky SSH connections (which generally are thought to be caused by cloud networking hiccups and the like). For example, https://github.com/cockroachdb/cockroach/issues/82123#issuecomment-1145787900

This is a drag on the eng teams, who would rather only see more meaningful failures.

**Describe the solution you'd like**

We already have a way to post failures to an infra-failure test issue instead when a cluster creation operation fails:

https://github.com/cockroachdb/cockroach/pull/73706/files#diff-1e3ec1fc5bade2df4b8681c44959be2a390f703839f30a979da98f963769c9fcR589-R592

We should use a similar mechanism to redirect (to test-eng) failures in which we see an indication of failed SSH connections, to shield the eng teams from such noise.

Ideally we do so in a way that isn't too hard-coded to SSH failures, so we can reuse it for other failures that are similarly "not the owning team's fault". For example, an apt-get flake should get similar special casing, though this may not necessarily be done as part of this issue. A good mechanism of action is to rely on `cockroachdb/errors` to return semantics with the error.

For example, we could have a marker error `ErrSSHProblem`, and could have code around here

https://github.com/cockroachdb/cockroach/blob/17159f538b96180eb60f24682186ec5c80fec8dc/pkg/cmd/roachtest/cluster.go#L1925-L1941

that does this:

```go
if isExitCode255Err(err) {
  err = errors.Mark(err, something.ErrSSHProblem)
}
```

and then the issue posting code could do

```go
if errors.Is(err, something.ErrSSHProblem) {
  // Override what issue to post
}
```

in `maybePostGithubIssue`, whose main caller is here: https://github.com/cockroachdb/cockroach/blob/5730899b12d3c28d5bcf5b60ec6f40aae158d973/pkg/cmd/roachtest/test_runner.go#L785-L798

One minor change that needs to happen here is that the issue poster only sees a `string`, but it should see the full error. In particular, `Fatalf` should internally translate to `t.mu.err = errors.Errorf(format, args...)` instead of what these two methods do now: https://github.com/cockroachdb/cockroach/blob/185e479042a08c2af1e4c62547cc6cb067d21019/pkg/cmd/roachtest/test_impl.go#L273-L280

and we should stringify only when actually posting the issue. This would make for a good standalone, first, PR.

**Describe alternatives you've considered**

Retries might be able to alleviate some of the pain, but not all operations are idempotent and we'd have to do full passes and folks would have to "do it right" for all new code. I consider that an orthogonal approach and we might want to do both.

**Additional context**

the `roachprod` logic to test SSH flakes is here:

https://github.com/cockroachdb/cockroach/blob/5861187ce91caffdffa6babb43b818476fc0c8ce/pkg/roachprod/install/cluster_synced.go#L609-L616

`ClassifyCmdError` basically checks if `err` is an `*exec.ExitError` with code 255, which is then presumed to come from `ssh`. This assumption isn't right in entirety, if the remote command exits with 255 then we get 255 as well:

```
tobias@gceworker-tobias:~/go/src/github.com/cockroachdb/cockroach$ roachprod ssh $c:4 -- exit 12
Error: COMMAND_PROBLEM: exit status 12
(1) COMMAND_PROBLEM
Wraps: (2) Node 4. Command with error:
  | ```
  | exit 12
  | ```
Wraps: (3) exit status 12
Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
tobias@gceworker-tobias:~/go/src/github.com/cockroachdb/cockroach$ roachprod ssh $c:4 -- exit 255
Error: SSH_PROBLEM: exit status 255
(1) SSH_PROBLEM
Wraps: (2) Node 4. Command with error:
  | ```
  | exit 255
  | ```
Wraps: (3) exit status 255
Error types: (1) errors.SSH (2) *hintdetail.withDetail (3) *exec.ExitError
```

but apart from that it's mostly right (from `man ssh`):

```
EXIT STATUS
     ssh exits with the exit status of the remote command or with 255 if an
     error occurred.
```

Btw, we also do this thing where we try to get the remote exit status from a string that is printed at the end:

https://github.com/cockroachdb/cockroach/blob/5861187ce91caffdffa6babb43b818476fc0c8ce/pkg/roachprod/install/cluster_synced.go#L556-L569

I'm not sure why this is needed, as the `roachprod ssh $c:4 -- exit N` experiment shows that we get the exit code directly from the returned error and this should be reliable.

Jira issue: CRDB-16353

	// RunE runs a command on the specified node, returning an error. The output
	// will be redirected to a file which is logged via the cluster-wide logger in
	// case of an error. Logs will sort chronologically. Failing invocations will
	// have an additional marker file with a `.failed` extension instead of `.log`.
	func (c *clusterImpl) RunE(ctx context.Context, node option.NodeListOption, args ...string) error {
	if len(args) == 0 {
	return errors.New("No command passed")
	}
	l, logFile, err := c.loggerForCmd(node, args...)
	if err != nil {
	return err
	}

	if err := errors.Wrap(ctx.Err(), "cluster.RunE"); err != nil {
	return err
	}
	err = execCmd(ctx, l, c.MakeNodes(node), args...)

	if t.Failed() {
	t.mu.Lock()
	output := fmt.Sprintf("test artifacts and logs in: %s\n", t.ArtifactsDir()) + string(t.mu.output)
	t.mu.Unlock()

	if teamCity {
	shout(ctx, l, stdout, "##teamcity[testFailed name='%s' details='%s' flowId='%s']",
	t.Name(), teamCityEscape(output), runID)
	}

	shout(ctx, l, stdout, "--- FAIL: %s (%s)\n%s", runID, durationStr, output)

	r.maybePostGithubIssue(ctx, l, t, stdout, output)
	} else {

	func (t *testImpl) markFailedInner(format string, args ...interface{}) {
	// Skip two frames: our own and the caller.
	if format != "" {
	t.printfAndFail(2 /* skip */, format, args...)
	} else {
	t.printAndFail(2 /* skip */, args...)
	}
	}

	err = sess.Run(ctx, nodeCmd)
	result.Stderr = stderrBuffer.String()
	result.Stdout, result.RemoteExitStatus = processStdout(stdoutBuffer.String())

	if err != nil {
	detailMsg := fmt.Sprintf("Node %d. Command with error:\n```\n%s\n```\n", node, cmd)
	err = errors.WithDetail(err, detailMsg)
	err = rperrors.ClassifyCmdError(err)

	func processStdout(stdout string) (string, string) {
	retStdout := stdout
	exitStatusPattern := "LAST EXIT STATUS: "
	exitStatusIndex := strings.LastIndex(retStdout, exitStatusPattern)
	remoteExitStatus := "-1"
	// If exitStatusIndex is -1 then "echo LAST EXIT STATUS: $?" didn't run
	// mostly due to an ssh error but avoid speculation and temporarily
	// use "-1" for unknown error before checking if it's SSH related later.
	if exitStatusIndex != -1 {
	retStdout = stdout[:exitStatusIndex]
	remoteExitStatus = strings.TrimSpace(stdout[exitStatusIndex+len(exitStatusPattern):])
	}
	return retStdout, remoteExitStatus
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: special-case SSH_PROBLEM as infra flakes #82398

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

roachtest: special-case SSH_PROBLEM as infra flakes #82398

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions