Improve Watcher test framework resiliency by AthenaEryma · Pull Request #40658 · elastic/elasticsearch

AthenaEryma · 2019-03-29T21:13:27Z

It is possible for the watches tracked by ScheduleTriggerEngineMock to
get out of sync with the Watches in the ScheduleTriggerEngine
production code, which can lead to watches failing to run.

This commit:

Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code.
Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change.
Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification.
Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.

This should fix:
#35503
#35506
#40587
#40631
#40682 (this one is only muted on 6.7 so I'll have to test + unmute on the backport)

I've run the entire Watcher test suite with these changes ~1000 times and have only seen two failures, which were unrelated (e.g. SocketTimeoutException) , which is much more reliable than before these changes.

It is possible for the watches tracked by `ScheduleTriggerEngineMock` to get out of sync with the Watches in the `ScheduleTriggerEngine` production code, which can lead to watches failing to run. This commit adds some additional checks and locking to the test code (production code is unaffected) to eliminate these problems.

elasticmachine · 2019-03-29T21:13:28Z

Pinging @elastic/es-core-features

AthenaEryma · 2019-04-02T19:43:38Z

@elasticmachine run elasticsearch-ci/2

Just running the tests a few times to make sure this is stable on CI.

AthenaEryma · 2019-04-02T20:53:59Z

I'm going to run these tests on CI several more times just to be sure, but I think this is ready for review and it's been very stable locally.

martijnvg

I think these are good improvements to the mock engine.

I think we should backport this change slowly, so that if unexpected failures occur in CI, it doesn't fail on all branches this change is targeted for.

martijnvg · 2019-04-03T11:58:01Z

@gwbrown I think this PR should also fix: #40631 Maybe unmute this test too in this pr?

AthenaEryma · 2019-04-03T14:56:12Z

Thanks for the review @martijnvg, and that's also a good idea to wait on the backports. I'll add the test you suggest and run the tests a couple more times before merging, then let it sit in master for a few days before backporting and check build-stats to make sure it's all good before backporting.

...r/src/test/java/org/elasticsearch/xpack/watcher/test/AbstractWatcherIntegrationTestCase.java

jakelandis · 2019-04-03T18:09:54Z

...watcher/src/test/java/org/elasticsearch/xpack/watcher/trigger/ScheduleTriggerEngineMock.java

    private static final Logger logger = LogManager.getLogger(ScheduleTriggerEngineMock.class);

-    private final ConcurrentMap<String, Watch> watches = new ConcurrentHashMap<>();
+    private final AtomicReference<Map<String, Watch>> watches = new AtomicReference<>(new ConcurrentHashMap<>());


Curious why using an AtomicReference here ?

Nvmd.. I see now that you are swapping out the instances below.

New question, why atomic swaps vs. shared lock (for this test code) ?

An earlier version of this used a lock and it made the tests very slow (this version takes about ~2-3m to run the whole suite, I killed the version with locks at ~10m), I'm not entirely sure why.

jakelandis · 2019-04-03T18:25:24Z

...watcher/src/test/java/org/elasticsearch/xpack/watcher/trigger/ScheduleTriggerEngineMock.java

@@ -50,37 +53,42 @@ public ScheduleTriggerEvent parseTriggerEvent(TriggerService service, String wat

    @Override
    public void start(Collection<Watch> jobs) {


this method by itself is not thread safe. I would advise to synchronize this method, or use a shared lock between this start/stop/add/remove, which negates the need for the atomic reference, or throw an exception if attempted to start/stop twice.

Good point, I'll take another look at this. There's a few cases where this could potentially have issues now that I look at it again.

This method is invoked from TriggerService#start(...) and that method is synchronized, so I don't think we need to synchronize this method.

I've added synchronized to the methods which modify watches to ensure this is thread safe. How does this look @jakelandis?

This was not the cause of the new failures This reverts commit d5dde77

AthenaEryma · 2019-04-08T16:19:06Z

I've run the most recent revision (with a very minor tweak over the last reviewed version) over the weekend and it hasn't failed once in ~4000 runs, so I'm pretty confident in this version and intend to merge this after a couple successful CI runs.

AthenaEryma · 2019-04-08T22:09:13Z

@elasticmachine run elasticsearch-ci/2

Run was successful, just retriggering to get more confidence

AthenaEryma · 2019-04-09T14:48:27Z

As suggested above, I'm going to merge this to master and wait a few days before backporting to make sure this is stable.

…forced-unsafe-publication * elastic/master: Improve Watcher test framework resiliency (elastic#40658) Fix order of request body search parameter names in documentation (elastic#40777) Node repurpose tool docs (elastic#40525) [Docs] Delete explanation for completion suggester default analyzer choice (elastic#36720) Revert "Revert "Change HLRC CCR response tests to use AbstractResponseTestCase base class. (elastic#40257)"" (elastic#40971) Short-circuit rebalancing when disabled (elastic#40966)

It is possible for the watches tracked by ScheduleTriggerEngineMock to get out of sync with the Watches in the ScheduleTriggerEngine production code, which can lead to watches failing to run. This commit: 1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code. 2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change. 3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification. 4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.

AthenaEryma · 2019-04-12T22:54:58Z

These changes have been in master since the beginning of the week with no failures in the unmuted tests, so I've merged the backports as well.

It is possible for the watches tracked by ScheduleTriggerEngineMock to get out of sync with the Watches in the ScheduleTriggerEngine production code, which can lead to watches failing to run. This commit: 1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code. 2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change. 3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification. 4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.

AthenaEryma added >test Issues or PRs that are addressing/adding tests :Distributed/Watcher labels Mar 29, 2019

AthenaEryma added 8 commits April 1, 2019 11:12

Replace locking with AtomicReference + track pause

7a01c22

Merge branch 'master' into watcher/test-resiliency

c81c129

Unmute additional test

908430f

Switch test to use TimeWarp as it's more reliable

26a6b5e

[?] Don't break if a watch is run on >1 scheduler

49f66e9

Re-mute test that isn't helped by these changes

d7374de

Make sure the watch is picked up immediately

3b89d6e

Re-mute another test to which isn't helped

44fce8b

AthenaEryma marked this pull request as ready for review April 2, 2019 16:13

Merge branch 'master' into watcher/test-resiliency

a47e774

AthenaEryma added v6.7.2 v7.2.0 v8.0.0 labels Apr 2, 2019

AthenaEryma requested a review from hub-cap April 2, 2019 20:49

AthenaEryma added the v7.0.0 label Apr 2, 2019

AthenaEryma requested a review from jakelandis April 2, 2019 20:53

martijnvg approved these changes Apr 3, 2019

View reviewed changes

jakelandis reviewed Apr 3, 2019

View reviewed changes

...r/src/test/java/org/elasticsearch/xpack/watcher/test/AbstractWatcherIntegrationTestCase.java Outdated Show resolved Hide resolved

jakelandis reviewed Apr 3, 2019

View reviewed changes

AthenaEryma added 2 commits April 4, 2019 13:16

Adjust synchronization per review

8091110

Merge branch 'master' into watcher/test-resiliency

5ea0551

[testing] Go back to no synchronization

d5dde77

jkakavas mentioned this pull request Apr 5, 2019

Source additional files correctly in elasticsearch-cli #40890

Merged

AthenaEryma added 3 commits April 5, 2019 09:34

Ensure new index is green in concrete index test

0fe11c4

Revert "[testing] Go back to no synchronization"

4756aee

This was not the cause of the new failures This reverts commit d5dde77

Merge branch 'master' into watcher/test-resiliency

b3f0025

AthenaEryma merged commit ff61bad into elastic:master Apr 9, 2019

AthenaEryma added the backport pending label Apr 9, 2019

AthenaEryma mentioned this pull request Apr 9, 2019

[7.x] Improve Watcher test framework resiliency (#40658) #41020

Merged

This was referenced Apr 9, 2019

[7.0] Improve Watcher test framework resiliency (#40658) #41021

Merged

[6.7] Improve Watcher test framework resiliency (#40658) #41023

Merged

AthenaEryma removed the backport pending label Apr 12, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

		@@ -50,37 +53,42 @@ public ScheduleTriggerEvent parseTriggerEvent(TriggerService service, String wat

		@Override
		public void start(Collection<Watch> jobs) {

Conversation

AthenaEryma commented Mar 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Mar 29, 2019

Uh oh!

AthenaEryma commented Apr 2, 2019

Uh oh!

AthenaEryma commented Apr 2, 2019

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

martijnvg commented Apr 3, 2019

Uh oh!

AthenaEryma commented Apr 3, 2019

Uh oh!

Uh oh!

jakelandis Apr 3, 2019

Choose a reason for hiding this comment

Uh oh!

jakelandis Apr 3, 2019

Choose a reason for hiding this comment

Uh oh!

AthenaEryma Apr 3, 2019

Choose a reason for hiding this comment

Uh oh!

jakelandis Apr 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AthenaEryma Apr 3, 2019

Choose a reason for hiding this comment

Uh oh!

martijnvg Apr 4, 2019

Choose a reason for hiding this comment

Uh oh!

AthenaEryma Apr 4, 2019

Choose a reason for hiding this comment

Uh oh!

AthenaEryma commented Apr 8, 2019

Uh oh!

AthenaEryma commented Apr 8, 2019

Uh oh!

AthenaEryma commented Apr 9, 2019

Uh oh!

AthenaEryma commented Apr 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AthenaEryma commented Mar 29, 2019 •

edited

Loading

jakelandis Apr 3, 2019 •

edited

Loading