bazci: enable --flaky_test_attempts=3 for Unit tests on release branches by srosenberg · Pull Request #86891 · cockroachdb/cockroach

srosenberg · 2022-08-25T17:51:23Z

When a test fails, --flaky_test_attempts=3 will cause up to two retry
attempts by executing the corresponding test binary (shard).
If a test passes on either attempt, it is reported 'FLAKY' in the
test result summary. In addition, each attempt yields an (XML) log
under 'test_attempts' subdirectory. The attempts are now copied
into the artifactsDir.

Note, while the test result summary may say the test is 'FLAKY', it's
reported as 'PASS' as per main test.xml. Thus, a 'FLAKY' test will
not fail the CI; i.e., a test succeeding after up to two
retries is assumed to have passed whereas three consecutive
failures result in a failed test. Consequently, CI now has a higher
probability of passing than before this change. However, the two additional
attempts per test binary (shard) could slow down the build. Hence, initially
the plan is to enable only on release branches and not staging (i.e., bors).

This change supports [1]. In a follow-up PR, a test reporter could be
augmented to parse attempt logs thereby obtaining a set of FLAKY tests
which should be skipped on subsequent builds.

[1] #81293

Release justification: CI improvement
Release note: None
Epic: None

cockroach-teamcity · 2022-08-25T17:51:33Z

This change is

srosenberg · 2022-08-25T17:55:39Z

The potential slow down may not be as bad as it seems because (a) we run a lot of sharded tests, (b) many tests are IO-bound. Thus, some of the retry attempts would be concurrent. E.g., a test emulated to flake with p=0.5 results in a very modest slow down,

time bazel test //pkg/ccl/changefeedccl:changefeedccl_test --flaky_test_attempts=3 --cache_test_results=no
//pkg/ccl/changefeedccl:changefeedccl_test                               PASSED in 48.5s
  Stats over 16 runs: max = 48.5s, min = 10.1s, avg = 22.7s, dev = 9.6s

real	0m49.158s
user	0m0.010s
sys	0m0.018s


time bazel test //pkg/ccl/changefeedccl:changefeedccl_test --flaky_test_attempts=3 --cache_test_results=no
FAIL: //pkg/ccl/changefeedccl:changefeedccl_test (shard 1 of 16) (see /home/srosenberg/.cache/bazel/_bazel_srosenberg/3f1e38fca2391d213d04e1064a5fee62/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/testlogs/pkg/ccl/changefeedccl/changefeedccl_test/shard_1_of_16/test_attempts/attempt_1.log)

//pkg/ccl/changefeedccl:changefeedccl_test                                FLAKY, failed in 1 out of 17 in 50.9s
  Stats over 17 runs: max = 50.9s, min = 11.4s, avg = 23.3s, dev = 10.5s

real	0m51.495s
user	0m0.000s
sys	0m0.028s

build/teamcity/cockroach/ci/tests/unit_tests_impl.sh

renatolabs · 2022-08-25T19:09:10Z

Thus, this change doesn't change the status of CI, semantically

Not sure I'm missing what you were trying to say here, but I would believe this change does change the status of CI, in the sense that now they'll start "passing" a lot more frequently, giving the impression that things suddenly got a lot more stable (I believe that majority of our flakes have a flake rate < 33%, so chances are pretty high that at least one attempt will succeed).

srosenberg · 2022-08-25T19:13:27Z

Not sure I'm missing what you were trying to say here, but I would believe this change does change the status of CI, in the sense that now they'll start "passing" a lot more frequently, giving the impression that things suddenly got a lot more stable (I believe that majority of our flakes have a flake rate < 33%, so chances are pretty high that at least one attempt will succeed).

Good point! and poor phrasing on my part; amending.

rickystewart

After c8d66a9c334dd87acd62190f314a4f3ec236ce82 the changes to bazci are unnecessary. Also I'll be deleting the bazci testdata in #87116. So you can rebase and just keep the change to build/teamcity/cockroach/ci/tests/unit_tests_impl.sh.

srosenberg · 2022-12-16T01:12:15Z

@rickystewart @renatolabs PTAL. Ricky's PR to leverage BEP did the heavy lifting. So, the only change we're making at this time is adding --flaky_test_attempts=3 to the unit tests on release branches (this excludes bors).

Testing this locally I can see the retry attempt while the test passed.

func TestFlakes(t *testing.T) {
	prng, _ := randutil.NewPseudoRand()

	sample := prng.Float64()
	assert.True(t, sample >= .25, "Failed because the sample %f is outside", sample)
}

$(bazel info bazel-bin --config=ci)/pkg/cmd/bazci/bazci_/bazci -- test --cache_test_results=no --flaky_test_attempts=3 pkg/cmd/bazci:all --test_env=GOTRACEBACK=all --test_filter=TestFlakes

/test_attempts/attempt_1.xml
<testsuites>
	<testsuite errors="0" failures="1" skipped="0" tests="1" time="0.004" name="github.com/cockroachdb/cockroach/pkg/cmd/bazci">
		<testcase classname="bazci" name="TestFlakes" time="0.000">
			<failure message="Failed" type="">=== RUN   TestFlakes&#xA;    flaky_test.go:24: &#xA;        &#x9;Error Trace:&#x9;/home/srosenberg/.cache/bazel/_bazel_srosenberg/3f1e38fca2391d213d04e1064a5fee62/sandbox/linux-sandbox/1038/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/cmd/bazci/bazci_test_/bazci_test.runfiles/com_github_cockroachdb_cockroach/pkg/cmd/bazci/flaky_test.go:24&#xA;        &#x9;Error:      &#x9;Should be true&#xA;        &#x9;Test:       &#x9;TestFlakes&#xA;        &#x9;Messages:   &#x9;Failed because the sample 0.202662 is outside&#xA;--- FAIL: TestFlakes (0.00s)&#xA;</failure>
		</testcase>
	</testsuite>
</testsuites>

/test.xml
<testsuites>
	<testsuite errors="0" failures="0" skipped="0" tests="1" time="0.004" name="github.com/cockroachdb/cockroach/pkg/cmd/bazci">
		<testcase classname="bazci" name="TestFlakes" time="0.000"></testcase>
	</testsuite>
</testsuites>

renatolabs

LGTM.

After this is merged, is there any automated way to find out how often tests are retried?

renatolabs · 2022-12-16T21:18:58Z

I'd also mention in the commit message "on master and release branches".

healthy-pod · 2022-12-16T22:35:08Z

After this is merged, is there any automated way to find out how often tests are retried?

We can use data from beaver hub to create something in Snowflake. This should do it:

SELECT uuid,target_name,count(*) FROM test_targets Group By uuid,target_name HAVING count(*) > 1;

rickystewart

I have not validated that bazci works the way you describe, but I assume that your description is correct. Assuming that in each test.xml a failed test attempt followed by a success is not labeled as a failure, then this should be fine. AFAIC we can check this in and then keep an eye on CI results to make sure nothing looks wrong.

One question: with this change, flaky tests may become "invisible", i.e. we'll stop seeing failures for them, no GitHub issues will be posted, etc. Is this desirable?

build/teamcity/cockroach/ci/tests/unit_tests_impl.sh

When a test fails, --flaky_test_attempts=3 will cause up to two retry attempts by executing the corresponding test binary (shard). If a test passes on either attempt, it is reported 'FLAKY' in the test result summary. In addition, each attempt yields an (XML) log under 'test_attempts' subdirectory. The attempts are now copied into the artifactsDir. Note, while the test result summary may say the test is 'FLAKY', it's reported as 'PASS' as per main test.xml. Thus, a 'FLAKY' test will not fail the CI; i.e., a test succeeding after up to two retries is assumed to have _passed_ whereas three consecutive failures result in a failed test. Consequently, CI now has a higher probability of passing than before this change. However, the two additional attempts per test binary (shard) could slow down the build. Hence, initially the plan is to enable only on release branches and not staging (i.e., bors). This change supports [1]. In a follow-up PR, a test reporter could be augmented to parse attempt logs thereby obtaining a set of FLAKY tests which should be skipped on subsequent builds. [1] cockroachdb#81293 Release justification: CI improvement Release note: None

srosenberg · 2023-01-04T21:10:00Z

TFTR!

bors r=rickystewart,healthy-pod,renatolabs

craig · 2023-01-04T21:58:30Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-01-04T23:39:31Z

Build succeeded:

Bazel Essential CI (Cockroach)

srosenberg requested a review from a team as a code owner August 25, 2022 17:51

srosenberg requested review from renatolabs and rickystewart August 25, 2022 17:51

renatolabs reviewed Aug 25, 2022

View reviewed changes

build/teamcity/cockroach/ci/tests/unit_tests_impl.sh Outdated Show resolved Hide resolved

renatolabs reviewed Aug 25, 2022

View reviewed changes

build/teamcity/cockroach/ci/tests/unit_tests_impl.sh Outdated Show resolved Hide resolved

srosenberg force-pushed the sr/ci_enable_flaky_test_attempts branch from ebc557b to 9d2b3a5 Compare August 25, 2022 18:59

srosenberg force-pushed the sr/ci_enable_flaky_test_attempts branch from 9d2b3a5 to 93a206d Compare August 25, 2022 19:20

rickystewart requested changes Aug 30, 2022

View reviewed changes

srosenberg force-pushed the sr/ci_enable_flaky_test_attempts branch from 93a206d to e234e32 Compare December 16, 2022 01:05

srosenberg requested a review from healthy-pod December 16, 2022 20:43

renatolabs approved these changes Dec 16, 2022

View reviewed changes

healthy-pod approved these changes Dec 16, 2022

View reviewed changes

rickystewart approved these changes Jan 4, 2023

View reviewed changes

build/teamcity/cockroach/ci/tests/unit_tests_impl.sh Outdated Show resolved Hide resolved

srosenberg force-pushed the sr/ci_enable_flaky_test_attempts branch from e234e32 to c85a03f Compare January 4, 2023 19:56

craig bot merged commit cb3d6db into cockroachdb:master Jan 4, 2023

healthy-pod mentioned this pull request Feb 28, 2023

ci: retry unit tests on staging #97776

Merged

Conversation

srosenberg commented Aug 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Aug 25, 2022

Uh oh!

srosenberg commented Aug 25, 2022

Uh oh!

Uh oh!

Uh oh!

renatolabs commented Aug 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srosenberg commented Aug 25, 2022

Uh oh!

rickystewart left a comment

Choose a reason for hiding this comment

Uh oh!

srosenberg commented Dec 16, 2022

Uh oh!

renatolabs left a comment

Choose a reason for hiding this comment

Uh oh!

renatolabs commented Dec 16, 2022

Uh oh!

healthy-pod commented Dec 16, 2022

Uh oh!

rickystewart left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

srosenberg commented Jan 4, 2023

Uh oh!

craig bot commented Jan 4, 2023

Uh oh!

craig bot commented Jan 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

srosenberg commented Aug 25, 2022 •

edited

Loading

renatolabs commented Aug 25, 2022 •

edited

Loading