Skip to content

chore: Fix flaky testCancelOuterFutureAfterStart test#2006

Merged
lqiu96 merged 6 commits intomainfrom
fix-testCancelOuterFutureAfterStart
Sep 26, 2023
Merged

chore: Fix flaky testCancelOuterFutureAfterStart test#2006
lqiu96 merged 6 commits intomainfrom
fix-testCancelOuterFutureAfterStart

Conversation

@lqiu96
Copy link
Member

@lqiu96 lqiu96 commented Sep 18, 2023

Thank you for opening a Pull Request! For general contributing guidelines, please refer to contributing guide

Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #1498


Reproduced:
Had some difficulty reproducing this with 20k invocations. Lowered the InitialRetryDelay value to allow less variance for the Randomized Retry Delay and was able to reproduce this fairly easily.

Setting the InitialRetryDelay to 100ms and got:

Random Delay: 3
Random Delay: 7

java.lang.AssertionError
	at org.junit.Assert.fail(Assert.java:87)
	at org.junit.Assert.assertTrue(Assert.java:42)
	at org.junit.Assert.assertTrue(Assert.java:53)
	at com.google.api.gax.retrying.ScheduledRetryingExecutorTest.testCancelOuterFutureAfterStart(ScheduledRetryingExecutorTest.java:273)

Ran this with 20k invocations with 0 flakes with jitter off.

I believe the reason this test hasn't flaked as much is due to the extremely low (but non-zero) odds of from four RRD values.

@product-auto-label product-auto-label bot added the size: s Pull request size is small. label Sep 18, 2023
@lqiu96 lqiu96 marked this pull request as ready for review September 25, 2023 16:38
@lqiu96 lqiu96 requested a review from a team September 25, 2023 16:38
@lqiu96 lqiu96 added the owlbot:run Add this label to trigger the Owlbot post processor. label Sep 25, 2023
@gcf-owl-bot gcf-owl-bot bot removed the owlbot:run Add this label to trigger the Owlbot post processor. label Sep 25, 2023
Thread.sleep(30L);
// The test sleeps a duration long enough to ensure that the future has been submitted for
// execution
Thread.sleep(50L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the sleep time should be long enough to ensure the future has been submitted, but also short enough so the future is not completed(<275ms in the example)? If that's the case, can we make the retry delay longer and make this sleep time longer? So that we have higher chance of the future being submitted?

Copy link
Member Author

@lqiu96 lqiu96 Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible to make the retry delay longer, but that doesn't eliminate the possibility that RRD is calculated as multiple consecutive low values as it is calculated between 0...RETRY_DELAY. Increasing the retry delay only makes it less likely.

I believe removing Jitter (only for testing) would make the RRD values consistent and guarantee no flakes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the removal of jitter should make it much more stable, but I think there is still a chance that the future is not submitted after 50 ms if the CPU is overwhelmed.
If I'm understanding correctly, the first retry is at 25ms and the last retry is at 275ms, so the sleep time should between 25 and 275. However. there is no guarantee that the first retry is at 25ms since it's a different thread, so there is a chance that after 50ms, future is not started yet. Let me know if it makes sense.

Copy link
Member Author

@lqiu96 lqiu96 Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I can a pick a number closer to 175ms (right before the 4th retry occurs) ... say 150ms? Obviously, still no guarantee that the future runs before, but would be extremely unlikely the CPU is stuck that long.

I can add in another check the ensure that the future runs at least once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can a pick a number closer to 175ms (right before the 4th retry occurs) ... say 150ms?

If it's too close to the end, then we may have a problem on the other side, that the sleep may take too long so the future is already completed. That's why I was suggesting maybe we can have the retry delay to 1000ms, so that the sleep time 150ms is long enough for 25 ms, and short enough for 1000ms.

I can add in another check the ensure that the future runs at least once.

Actually I get this from your comment, I'm not sure if it's worth it to add one. I wouldn't add it if it would make the test more flaky and does not add much value.

Copy link
Member Author

@lqiu96 lqiu96 Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok yeah agreed that 150ms might cut it too close to the other end. I can change the retry multiplier so that the max retry delay will hit 1000ms and buy us enough time.

The reason for the check to ensure it runs once is to check that the future from executor.submit() was run. The first attempt is always scheduled to "run immediately" with 0ms delay (should be be able to be scheduled within say a sleep time of 150ms) so I don't think there would be any flakes caused from that.

@sonarqubecloud
Copy link

[gapic-generator-java-root] Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

@sonarqubecloud
Copy link

[java_showcase_integration_tests] Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

@sonarqubecloud
Copy link

[java_showcase_unit_tests] Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

@lqiu96 lqiu96 merged commit 3add10e into main Sep 26, 2023
@lqiu96 lqiu96 deleted the fix-testCancelOuterFutureAfterStart branch September 26, 2023 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size: s Pull request size is small.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gax-java(mac): Flaky ScheduledRetryingExecutorTest.testCancelOuterFutureAfterStart

2 participants