Skip to content

roachprod: audit and fix incorrect usages of c.Parallel wrt error handling #104312

@srosenberg

Description

@srosenberg

After implementing SSH retries and refactoring c.Parallel, the new implementation expects a user-defined function to return no error unless there was an internal (roachprod) error; instead, if the executed function resulted in an error, it's expected to be in RunResultDetails.Err.

The above (tacit) contract is violated by several existing commands, e.g., Wipe, Stop. When either of these functions errors out, the actual details of the error are not written to the logs. E.g., an attempt to reuse a cluster failed to execute either Wipe or Stop, but the actual details are unknown; all we get is the following message,

13:47:29 test_runner.go:605: [w4] Unable to reuse cluster: srosenberg-1685591519-18-n4cpu32 due to: COMMAND_PROBLEM: exit status 1. Will attempt to create a fresh one

The ask is as follows,

  • audit all the existing commands which use c.Parallel and fix contract violations
  • ensure that even internal errors are handled correctly by logging their (nested) context
  • update the documentation to make it clear that error should be nil unless there was an internal error
    • Note, the majority of the cases appear to be return res, nil; perhaps they could use a different API, one which has a single return of type *RunResultDetails?

[1] #90451

Metadata

Metadata

Assignees

Labels

A-testingTesting tools and infrastructureC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-testengTestEng Team

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions