Skip to content

bazci: enable --flaky_test_attempts=3 for Unit tests on release branches#86891

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
srosenberg:sr/ci_enable_flaky_test_attempts
Jan 4, 2023
Merged

bazci: enable --flaky_test_attempts=3 for Unit tests on release branches#86891
craig[bot] merged 1 commit intocockroachdb:masterfrom
srosenberg:sr/ci_enable_flaky_test_attempts

Conversation

@srosenberg
Copy link
Copy Markdown
Member

@srosenberg srosenberg commented Aug 25, 2022

When a test fails, --flaky_test_attempts=3 will cause up to two retry
attempts by executing the corresponding test binary (shard).
If a test passes on either attempt, it is reported 'FLAKY' in the
test result summary. In addition, each attempt yields an (XML) log
under 'test_attempts' subdirectory. The attempts are now copied
into the artifactsDir.

Note, while the test result summary may say the test is 'FLAKY', it's
reported as 'PASS' as per main test.xml. Thus, a 'FLAKY' test will
not fail the CI; i.e., a test succeeding after up to two
retries is assumed to have passed whereas three consecutive
failures result in a failed test. Consequently, CI now has a higher
probability of passing than before this change. However, the two additional
attempts per test binary (shard) could slow down the build. Hence, initially
the plan is to enable only on release branches and not staging (i.e., bors).

This change supports [1]. In a follow-up PR, a test reporter could be
augmented to parse attempt logs thereby obtaining a set of FLAKY tests
which should be skipped on subsequent builds.

[1] #81293

Release justification: CI improvement
Release note: None
Epic: None

@srosenberg srosenberg requested a review from a team as a code owner August 25, 2022 17:51
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@srosenberg
Copy link
Copy Markdown
Member Author

The potential slow down may not be as bad as it seems because (a) we run a lot of sharded tests, (b) many tests are IO-bound. Thus, some of the retry attempts would be concurrent. E.g., a test emulated to flake with p=0.5 results in a very modest slow down,

time bazel test //pkg/ccl/changefeedccl:changefeedccl_test --flaky_test_attempts=3 --cache_test_results=no
//pkg/ccl/changefeedccl:changefeedccl_test                               PASSED in 48.5s
  Stats over 16 runs: max = 48.5s, min = 10.1s, avg = 22.7s, dev = 9.6s

real	0m49.158s
user	0m0.010s
sys	0m0.018s


time bazel test //pkg/ccl/changefeedccl:changefeedccl_test --flaky_test_attempts=3 --cache_test_results=no
FAIL: //pkg/ccl/changefeedccl:changefeedccl_test (shard 1 of 16) (see /home/srosenberg/.cache/bazel/_bazel_srosenberg/3f1e38fca2391d213d04e1064a5fee62/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/testlogs/pkg/ccl/changefeedccl/changefeedccl_test/shard_1_of_16/test_attempts/attempt_1.log)

//pkg/ccl/changefeedccl:changefeedccl_test                                FLAKY, failed in 1 out of 17 in 50.9s
  Stats over 17 runs: max = 50.9s, min = 11.4s, avg = 23.3s, dev = 10.5s

real	0m51.495s
user	0m0.000s
sys	0m0.028s

@srosenberg srosenberg force-pushed the sr/ci_enable_flaky_test_attempts branch from ebc557b to 9d2b3a5 Compare August 25, 2022 18:59
@renatolabs
Copy link
Copy Markdown

renatolabs commented Aug 25, 2022

Thus, this change doesn't change the status of CI, semantically

Not sure I'm missing what you were trying to say here, but I would believe this change does change the status of CI, in the sense that now they'll start "passing" a lot more frequently, giving the impression that things suddenly got a lot more stable (I believe that majority of our flakes have a flake rate < 33%, so chances are pretty high that at least one attempt will succeed).

@srosenberg
Copy link
Copy Markdown
Member Author

Not sure I'm missing what you were trying to say here, but I would believe this change does change the status of CI, in the sense that now they'll start "passing" a lot more frequently, giving the impression that things suddenly got a lot more stable (I believe that majority of our flakes have a flake rate < 33%, so chances are pretty high that at least one attempt will succeed).

Good point! and poor phrasing on my part; amending.

@srosenberg srosenberg force-pushed the sr/ci_enable_flaky_test_attempts branch from 9d2b3a5 to 93a206d Compare August 25, 2022 19:20
Copy link
Copy Markdown
Collaborator

@rickystewart rickystewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After c8d66a9c334dd87acd62190f314a4f3ec236ce82 the changes to bazci are unnecessary. Also I'll be deleting the bazci testdata in #87116. So you can rebase and just keep the change to build/teamcity/cockroach/ci/tests/unit_tests_impl.sh.

@srosenberg srosenberg force-pushed the sr/ci_enable_flaky_test_attempts branch from 93a206d to e234e32 Compare December 16, 2022 01:05
@srosenberg
Copy link
Copy Markdown
Member Author

@rickystewart @renatolabs PTAL. Ricky's PR to leverage BEP did the heavy lifting. So, the only change we're making at this time is adding --flaky_test_attempts=3 to the unit tests on release branches (this excludes bors).

Testing this locally I can see the retry attempt while the test passed.

func TestFlakes(t *testing.T) {
	prng, _ := randutil.NewPseudoRand()

	sample := prng.Float64()
	assert.True(t, sample >= .25, "Failed because the sample %f is outside", sample)
}
$(bazel info bazel-bin --config=ci)/pkg/cmd/bazci/bazci_/bazci -- test --cache_test_results=no --flaky_test_attempts=3 pkg/cmd/bazci:all --test_env=GOTRACEBACK=all --test_filter=TestFlakes
/test_attempts/attempt_1.xml
<testsuites>
	<testsuite errors="0" failures="1" skipped="0" tests="1" time="0.004" name="github.com/cockroachdb/cockroach/pkg/cmd/bazci">
		<testcase classname="bazci" name="TestFlakes" time="0.000">
			<failure message="Failed" type="">=== RUN   TestFlakes&#xA;    flaky_test.go:24: &#xA;        &#x9;Error Trace:&#x9;/home/srosenberg/.cache/bazel/_bazel_srosenberg/3f1e38fca2391d213d04e1064a5fee62/sandbox/linux-sandbox/1038/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/cmd/bazci/bazci_test_/bazci_test.runfiles/com_github_cockroachdb_cockroach/pkg/cmd/bazci/flaky_test.go:24&#xA;        &#x9;Error:      &#x9;Should be true&#xA;        &#x9;Test:       &#x9;TestFlakes&#xA;        &#x9;Messages:   &#x9;Failed because the sample 0.202662 is outside&#xA;--- FAIL: TestFlakes (0.00s)&#xA;</failure>
		</testcase>
	</testsuite>
</testsuites>
/test.xml
<testsuites>
	<testsuite errors="0" failures="0" skipped="0" tests="1" time="0.004" name="github.com/cockroachdb/cockroach/pkg/cmd/bazci">
		<testcase classname="bazci" name="TestFlakes" time="0.000"></testcase>
	</testsuite>
</testsuites>

Copy link
Copy Markdown

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

After this is merged, is there any automated way to find out how often tests are retried?

@renatolabs
Copy link
Copy Markdown

I'd also mention in the commit message "on master and release branches".

@healthy-pod
Copy link
Copy Markdown
Contributor

After this is merged, is there any automated way to find out how often tests are retried?

We can use data from beaver hub to create something in Snowflake. This should do it:

SELECT uuid,target_name,count(*) FROM test_targets Group By uuid,target_name HAVING count(*) > 1;

Copy link
Copy Markdown
Collaborator

@rickystewart rickystewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not validated that bazci works the way you describe, but I assume that your description is correct. Assuming that in each test.xml a failed test attempt followed by a success is not labeled as a failure, then this should be fine. AFAIC we can check this in and then keep an eye on CI results to make sure nothing looks wrong.

One question: with this change, flaky tests may become "invisible", i.e. we'll stop seeing failures for them, no GitHub issues will be posted, etc. Is this desirable?

When a test fails, --flaky_test_attempts=3 will cause up to two retry
attempts by executing the corresponding test binary (shard).
If a test passes on either attempt, it is reported 'FLAKY' in the
test result summary. In addition, each attempt yields an (XML) log
under 'test_attempts' subdirectory. The attempts are now copied
into the artifactsDir.

Note, while the test result summary may say the test is 'FLAKY', it's
reported as 'PASS' as per main test.xml. Thus, a 'FLAKY' test will
not fail the CI; i.e., a test succeeding after up to two
retries is assumed to have _passed_ whereas three consecutive
failures result in a failed test. Consequently, CI now has a higher
probability of passing than before this change. However, the two additional
attempts per test binary (shard) could slow down the build. Hence, initially
the plan is to enable only on release branches and not staging (i.e., bors).

This change supports [1]. In a follow-up PR, a test reporter could be
augmented to parse attempt logs thereby obtaining a set of FLAKY tests
which should be skipped on subsequent builds.

[1] cockroachdb#81293

Release justification: CI improvement
Release note: None
@srosenberg srosenberg force-pushed the sr/ci_enable_flaky_test_attempts branch from e234e32 to c85a03f Compare January 4, 2023 19:56
@srosenberg
Copy link
Copy Markdown
Member Author

TFTR!

bors r=rickystewart,healthy-pod,renatolabs

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 4, 2023

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 4, 2023

Build succeeded:

@craig craig bot merged commit cb3d6db into cockroachdb:master Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants