Skip to content

roachtest: acceptance/cli/node-status is flaky #107791

@rafiss

Description

@rafiss

It's failed 17 of the last ~100 runs: https://teamcity.cockroachdb.com/test/-4112874617452834710?currentProjectId=Cockroach_Ci_Tests&expandTestHistoryChartSection=true

Failure is:

 acceptance/cli/node-status
14:02:57 test_runner.go:956: [w0] --- FAIL: acceptance/cli/node-status (3896.57s)
    (cli.go:91).func3: expected [is_available is_live false false false false false false], but found [] from:
    (test_runner.go:1122).func1: 2 dead node(s) detected
    test artifacts and logs in: /artifacts/acceptance/cli/node-status/run_1
    --- FAIL: acceptance/cli/node-status (3896.57s)
    (cli.go:91).func3: expected [is_available is_live false false false false false false], but found [] from:
    (test_runner.go:1122).func1: 2 dead node(s) detected
    test artifacts and logs in: /artifacts/acceptance/cli/node-status/run_1

@srosenberg points out that perhaps an even greater problem for our CI pipeline is that the cleanup logic after this failure takes up to an hour without proper timeouts:

looks like acceptance/cli/node-status took 1 hour longer than expected… test failed (dunno why yet), but on teardown we try to FetchTimeseriesData which in turn calls gosql.Open("postgres", dataSourceName);. it’s unable to connect because nodes are down, so it retries until the timeout for collectArtifacts (1hr) expires. The leaked goroutines [1] confirm the story.
It looks like a flake… few things to resolve,

  • why no timeout for gosql.Open?
  • postTestAssertions uses /health?ready=1 which timeouts immediately
  • why should FetchTimeseriesData be allowed nearly a whole 1hr?
  • why is the local roachtest job allowed to run > 1hr?
  • lastly, why did the test fail?

[1] https://teamcity.cockroachdb.com/repository/download/Cockroach_Ci_Tests_LocalRoachtest/11093729:id/_runner-logs/test_runner-1690507817.log

Jira issue: CRDB-30197
Epic: CRDB-28893

Metadata

Metadata

Assignees

Labels

C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.C-test-failureBroken test (automatically or manually discovered).T-kvKV Teamdb-cy-23skipped-test

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions