Skip to content

roachtest: test runner shouldn't stop when c.Extend fails #112509

@srosenberg

Description

@srosenberg

The lifetime of a cluster (e.g., one that's being reused by several tests) is likely to be extended to survive the 3-hour timeout of an individual roachtest run (see runTest in test_runner.go). The problem is as follows,

03:47:36 test_runner.go:814: [w9] test returned error: kv0/enc=false/nodes=3/cpu=32/size=64kb: failed to extend cluster: srosenberg-1697407709-21-n4cpu32: roachprod extend failed: failed to run: gcloud compute instances list --project cockroach-ephemeral --format json
stdout: []
stderr: ERROR: (gcloud.compute.instances.list) Some requests did not succeed:
 - Internal error. Please try again or contact Google Support. (Code: '607CD442FB529.6DE6388.C103F697'): exit status 1
03:47:39 test_runner.go:552: [w9] Worker exiting; no cluster to destroy.

That is, under the current implementation if runTest is unable to extend the lifetime of the given cluster, the error propagates into the worker, and it's deemed to be non-recoverable; i.e., the worker exits, interrupts other workers, thereby causing the whole run to abort. Instead, we should attempt to allocate a new cluster, and if that also fails, skip the test and continue.

Jira issue: CRDB-32456

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.E-quick-winLikely to be a quick win for someone experienced.T-testengTestEng Teamv23.1.14

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions