Skip to content

Improve Watcher test framework resiliency#40658

Merged
AthenaEryma merged 18 commits intoelastic:masterfrom
AthenaEryma:watcher/test-resiliency
Apr 9, 2019
Merged

Improve Watcher test framework resiliency#40658
AthenaEryma merged 18 commits intoelastic:masterfrom
AthenaEryma:watcher/test-resiliency

Conversation

@AthenaEryma
Copy link
Copy Markdown
Contributor

@AthenaEryma AthenaEryma commented Mar 29, 2019

It is possible for the watches tracked by ScheduleTriggerEngineMock to
get out of sync with the Watches in the ScheduleTriggerEngine
production code, which can lead to watches failing to run.

This commit:

  1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code.
  2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change.
  3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification.
  4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.

This should fix:
#35503
#35506
#40587
#40631
#40682 (this one is only muted on 6.7 so I'll have to test + unmute on the backport)


I've run the entire Watcher test suite with these changes ~1000 times and have only seen two failures, which were unrelated (e.g. SocketTimeoutException) , which is much more reliable than before these changes.

It is possible for the watches tracked by `ScheduleTriggerEngineMock` to
get out of sync with the Watches in the `ScheduleTriggerEngine`
production code, which can lead to watches failing to run.

This commit adds some additional checks and locking to the test code
(production code is unaffected) to eliminate these problems.
@AthenaEryma AthenaEryma added >test Issues or PRs that are addressing/adding tests :Distributed/Watcher labels Mar 29, 2019
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-core-features

@AthenaEryma AthenaEryma marked this pull request as ready for review April 2, 2019 16:13
@AthenaEryma
Copy link
Copy Markdown
Contributor Author

@elasticmachine run elasticsearch-ci/2

Just running the tests a few times to make sure this is stable on CI.

@AthenaEryma AthenaEryma requested a review from hub-cap April 2, 2019 20:49
@AthenaEryma AthenaEryma requested a review from jakelandis April 2, 2019 20:53
@AthenaEryma
Copy link
Copy Markdown
Contributor Author

I'm going to run these tests on CI several more times just to be sure, but I think this is ready for review and it's been very stable locally.

Copy link
Copy Markdown
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are good improvements to the mock engine.

I think we should backport this change slowly, so that if unexpected failures occur in CI, it doesn't fail on all branches this change is targeted for.

@martijnvg
Copy link
Copy Markdown
Member

@gwbrown I think this PR should also fix: #40631 Maybe unmute this test too in this pr?

@AthenaEryma
Copy link
Copy Markdown
Contributor Author

Thanks for the review @martijnvg, and that's also a good idea to wait on the backports. I'll add the test you suggest and run the tests a couple more times before merging, then let it sit in master for a few days before backporting and check build-stats to make sure it's all good before backporting.

private static final Logger logger = LogManager.getLogger(ScheduleTriggerEngineMock.class);

private final ConcurrentMap<String, Watch> watches = new ConcurrentHashMap<>();
private final AtomicReference<Map<String, Watch>> watches = new AtomicReference<>(new ConcurrentHashMap<>());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why using an AtomicReference here ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvmd.. I see now that you are swapping out the instances below.

New question, why atomic swaps vs. shared lock (for this test code) ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An earlier version of this used a lock and it made the tests very slow (this version takes about ~2-3m to run the whole suite, I killed the version with locks at ~10m), I'm not entirely sure why.

@@ -50,37 +53,42 @@ public ScheduleTriggerEvent parseTriggerEvent(TriggerService service, String wat

@Override
public void start(Collection<Watch> jobs) {
Copy link
Copy Markdown
Contributor

@jakelandis jakelandis Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method by itself is not thread safe. I would advise to synchronize this method, or use a shared lock between this start/stop/add/remove, which negates the need for the atomic reference, or throw an exception if attempted to start/stop twice.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll take another look at this. There's a few cases where this could potentially have issues now that I look at it again.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is invoked from TriggerService#start(...) and that method is synchronized, so I don't think we need to synchronize this method.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added synchronized to the methods which modify watches to ensure this is thread safe. How does this look @jakelandis?

@AthenaEryma
Copy link
Copy Markdown
Contributor Author

I've run the most recent revision (with a very minor tweak over the last reviewed version) over the weekend and it hasn't failed once in ~4000 runs, so I'm pretty confident in this version and intend to merge this after a couple successful CI runs.

@AthenaEryma
Copy link
Copy Markdown
Contributor Author

@elasticmachine run elasticsearch-ci/2

Run was successful, just retriggering to get more confidence

@AthenaEryma
Copy link
Copy Markdown
Contributor Author

As suggested above, I'm going to merge this to master and wait a few days before backporting to make sure this is stable.

@AthenaEryma AthenaEryma merged commit ff61bad into elastic:master Apr 9, 2019
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Apr 9, 2019
…forced-unsafe-publication

* elastic/master:
  Improve Watcher test framework resiliency (elastic#40658)
  Fix order of request body search parameter names in documentation (elastic#40777)
  Node repurpose tool docs (elastic#40525)
  [Docs] Delete explanation for completion suggester default analyzer choice (elastic#36720)
  Revert "Revert "Change HLRC CCR response tests to use AbstractResponseTestCase base class. (elastic#40257)"" (elastic#40971)
  Short-circuit rebalancing when disabled (elastic#40966)
AthenaEryma added a commit to AthenaEryma/elasticsearch that referenced this pull request Apr 9, 2019
It is possible for the watches tracked by ScheduleTriggerEngineMock to
get out of sync with the Watches in the ScheduleTriggerEngine
production code, which can lead to watches failing to run.

This commit:

1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code.
2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change.
3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification.
4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
AthenaEryma added a commit to AthenaEryma/elasticsearch that referenced this pull request Apr 9, 2019
It is possible for the watches tracked by ScheduleTriggerEngineMock to
get out of sync with the Watches in the ScheduleTriggerEngine
production code, which can lead to watches failing to run.

This commit:

1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code.
2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change.
3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification.
4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
AthenaEryma added a commit to AthenaEryma/elasticsearch that referenced this pull request Apr 9, 2019
It is possible for the watches tracked by ScheduleTriggerEngineMock to
get out of sync with the Watches in the ScheduleTriggerEngine
production code, which can lead to watches failing to run.

This commit:

1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code.
2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change.
3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification.
4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
AthenaEryma added a commit that referenced this pull request Apr 12, 2019
It is possible for the watches tracked by ScheduleTriggerEngineMock to
get out of sync with the Watches in the ScheduleTriggerEngine
production code, which can lead to watches failing to run.

This commit:

1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code.
2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change.
3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification.
4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
AthenaEryma added a commit that referenced this pull request Apr 12, 2019
It is possible for the watches tracked by ScheduleTriggerEngineMock to
get out of sync with the Watches in the ScheduleTriggerEngine
production code, which can lead to watches failing to run.

This commit:

1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code.
2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change.
3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification.
4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
AthenaEryma added a commit that referenced this pull request Apr 12, 2019
It is possible for the watches tracked by ScheduleTriggerEngineMock to
get out of sync with the Watches in the ScheduleTriggerEngine
production code, which can lead to watches failing to run.

This commit:

1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code.
2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change.
3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification.
4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
@AthenaEryma
Copy link
Copy Markdown
Contributor Author

These changes have been in master since the beginning of the week with no failures in the unmuted tests, so I've merged the backports as well.

gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
It is possible for the watches tracked by ScheduleTriggerEngineMock to
get out of sync with the Watches in the ScheduleTriggerEngine
production code, which can lead to watches failing to run.

This commit:

1. Changes TimeWarp to try to run the watch on all schedulers, rather than stopping after one which claims to have the watch registered. This reduces the impact of desynchronization between the mocking code and the backing production code.
2. Makes ScheduleTriggerEngineMock respect pauses of execution again. This is necessary to prevent duplicate watch invocations due to the above change.
3. Tweaks how watches are registered in ScheduleTriggerEngineMock to prevent race conditions due to concurrent modification.
4. Tweaks WatcherConcreteIndexTests to use TimeWarp instead of waiting for watches to be triggered, as TimeWarp is more reliable and accomplishes the same goal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants