roachtest: fix panic in mixedversion when cockroach fails to start#103799
Conversation
When a mixed-version test (using the `mixedversion` package) starts, the `cockroach` binary is uploaded to every node, and then the cluster is started. If, for some reason, the cockroach process crashes in this startup phase, the `mixedversion` package would panic while generating an error message. The reason for that is that there was an assumption that the connection cache was initialized at that point, which does not hold when the test failure happened on test setup. This fixes this issue by making sure we check for the status of the connection cache when generating error messages. We also make sure concurrent accesses to the connection cache are safe; while this is not strictly necessary (no concurrent reads and writes to it right now), it will likely help in the future as this code changes. Epic: none Release note: None
srosenberg
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @renatolabs)
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 310 at r1 (raw file):
clusterVersionsBefore := tr.clusterVersions var clusterVersionsAfter atomic.Value if tr.connCacheInitialized() {
Should we log the fact that the failure occurred during the first step, i.e., before initialization finished? Or is it already captured somewhere else?
renatolabs
left a comment
There was a problem hiding this comment.
TFTR!
bors r=srosenberg
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)
pkg/cmd/roachtest/roachtestutil/mixedversion/runner.go line 310 at r1 (raw file):
The error itself already includes that. Here's the error I see when I forced roachtest to use bad fixtures:
[mixed-version-test/1_starting-cluster-from-fixtures-for-version-22.2.9] 16:28:48 runner.go:328: mixed-version test failure while running step 1 (starting cluster from fixtures for version "22.2.9"): ~ ./cockroach.sh
|
Build failed: |
|
Flaked: #103531 (comment). bors retry |
|
Build succeeded: |
When a mixed-version test (using the
mixedversionpackage) starts, thecockroachbinary is uploaded to every node, and then the cluster is started. If, for some reason, the cockroach process crashes in this startup phase, themixedversionpackage would panic while generating an error message. The reason for that is that there was an assumption that the connection cache was initialized at that point, which does not hold when the test failure happened on test setup.This fixes this issue by making sure we check for the status of the connection cache when generating error messages. We also make sure concurrent accesses to the connection cache are safe; while this is not strictly necessary (no concurrent reads and writes to it right now), it will likely help in the future as this code changes.
Epic: none
Release note: None