roachprod: audit and fix incorrect usages of c.Parallel wrt error handling

After implementing SSH retries and refactoring `c.Parallel`, the new implementation expects a user-defined function to return no `error` unless there was an _internal_ (roachprod) error; instead, if the executed function resulted in an error, it's expected to be in `RunResultDetails.Err`. 

The above (tacit) contract is violated by several existing commands, e.g., `Wipe`, `Stop`. When either of these functions errors out, the actual details of the error are _not_ written to the logs. E.g., an attempt to reuse a cluster failed to execute either `Wipe` or `Stop`, but the actual details are unknown; all we get is the following message,

```
13:47:29 test_runner.go:605: [w4] Unable to reuse cluster: srosenberg-1685591519-18-n4cpu32 due to: COMMAND_PROBLEM: exit status 1. Will attempt to create a fresh one
```

The ask is as follows,

* audit all the existing commands which use `c.Parallel` and fix contract violations
* ensure that even internal errors are handled correctly by logging their (nested) context
* update the documentation to make it clear that `error` should be `nil` unless there was an _internal_ error
  - Note, the majority of the cases appear to be `return res, nil`; perhaps they could use a different API, one which has a single return of type `*RunResultDetails`?

[1] https://github.com/cockroachdb/cockroach/pull/90451

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachprod: audit and fix incorrect usages of c.Parallel wrt error handling #104312

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

roachprod: audit and fix incorrect usages of c.Parallel wrt error handling #104312

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions