-
Notifications
You must be signed in to change notification settings - Fork 4.1k
roachtest: acceptance/cli/node-status is flaky #107791
Description
It's failed 17 of the last ~100 runs: https://teamcity.cockroachdb.com/test/-4112874617452834710?currentProjectId=Cockroach_Ci_Tests&expandTestHistoryChartSection=true
Failure is:
acceptance/cli/node-status
14:02:57 test_runner.go:956: [w0] --- FAIL: acceptance/cli/node-status (3896.57s)
(cli.go:91).func3: expected [is_available is_live false false false false false false], but found [] from:
(test_runner.go:1122).func1: 2 dead node(s) detected
test artifacts and logs in: /artifacts/acceptance/cli/node-status/run_1
--- FAIL: acceptance/cli/node-status (3896.57s)
(cli.go:91).func3: expected [is_available is_live false false false false false false], but found [] from:
(test_runner.go:1122).func1: 2 dead node(s) detected
test artifacts and logs in: /artifacts/acceptance/cli/node-status/run_1
@srosenberg points out that perhaps an even greater problem for our CI pipeline is that the cleanup logic after this failure takes up to an hour without proper timeouts:
looks like
acceptance/cli/node-statustook 1 hour longer than expected… test failed (dunno why yet), but on teardown we try toFetchTimeseriesDatawhich in turn callsgosql.Open("postgres", dataSourceName);. it’s unable to connect because nodes are down, so it retries until the timeout forcollectArtifacts(1hr) expires. The leaked goroutines [1] confirm the story.
It looks like a flake… few things to resolve,
- why no timeout for gosql.Open?
- postTestAssertions uses /health?ready=1 which timeouts immediately
- why should FetchTimeseriesData be allowed nearly a whole 1hr?
- why is the local roachtest job allowed to run > 1hr?
- lastly, why did the test fail?
Jira issue: CRDB-30197
Epic: CRDB-28893