Resolve concurrency with watcher trigger service by tvernum · Pull Request #39092 · elastic/elasticsearch

tvernum · 2019-02-19T05:30:28Z

The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.

This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.

Resolves: #39087

The watcher trigger service could attempt to modify the perWatchStats map simultaneously from multiple threads. This would cause the internal state to become inconsistent, in particular the count() method may return an incorrect value for the number of watches. This changes replaces the implementation of the map with a ConcurrentHashMap so that its internal state remains consistent even when accessed from mutiple threads. Resolves: elastic#39087

elasticmachine · 2019-02-19T05:30:30Z

Pinging @elastic/es-core-features

tvernum · 2019-02-19T05:33:10Z

...n/watcher/src/test/java/org/elasticsearch/xpack/watcher/test/integration/BootStrapTests.java

    }

-    @AwaitsFix(bugUrl = "Supposedly fixed; https://github.com/elastic/x-pack-elasticsearch/issues/1915")
    public void testLoadExistingWatchesUponStartup() throws Exception {


This test would fail approximately 1 in 50 times due to this bug.
The final assertBusy would fail because the returned getWatchesCount() would be 1 less than expected (numWatches) due to the size of the internal map becoming corrupt.

If you iterated through the map and explicitly counted the number of items, it was correct, but the size() returned an incorrect value.

This test didn't set an id() on the Watch, but ConcurrentHashMap doesn't allow null keys. Note: Per Watch.equals and Watch.hashCode, id is not allowed to be null

tvernum · 2019-02-19T06:12:26Z

x-pack/plugin/watcher/src/test/java/org/elasticsearch/xpack/watcher/WatcherServiceTests.java

+        final String id = randomAlphaOfLengthBetween(3, 12);
        Watch watch = mock(Watch.class);
        when(watch.trigger()).thenReturn(trigger);
+        when(watch.id()).thenReturn(id);


ConcurrentHashMap requires a non-null key, so this test would otherwise fail.
Watch.hashCode & Watch.equals already assume id is non-null, so the requirement in ConcurrentHashMap is fine, it's just the test that was wrong.

jakelandis

@tvernum nice find ! LGTM

The watcher trigger service could attempt to modify the perWatchStats map simultaneously from multiple threads. This would cause the internal state to become inconsistent, in particular the count() method may return an incorrect value for the number of watches. This changes replaces the implementation of the map with a ConcurrentHashMap so that its internal state remains consistent even when accessed from mutiple threads. Backport of: elastic#39092

…follow * elastic/master: (37 commits) Enable test logging for TransformIntegrationTests#testSearchTransform. stronger wording for ilm+rollover in docs (elastic#39159) Mute SingleNodeTests (elastic#39156) AwaitsFix XPackUsageIT#testXPackCcrUsage. Resolve concurrency with watcher trigger service (elastic#39092) Fix median calculation in MedianAbsoluteDeviationAggregatorTests (elastic#38979) [DOCS] Edits the remote clusters documentation (elastic#38996) add version 6.6.2 Revert "Mute failing test 20_mix_typless_typefull (elastic#38781)" (elastic#38912) Rebuild remote connections on profile changes (elastic#37678) Document 'max_size' parameter as shard size for rollover (elastic#38750) Add some missing toString() implementations (elastic#39124) Migrate Streamable to Writeable for cluster block package (elastic#37391) fix RethrottleTests retry (elastic#38978) Disable date parsing test in non english locale (elastic#39052) Remove BCryptTests (elastic#39098) [ML] Stop the ML memory tracker before closing node (elastic#39111) Allow retention lease operations under blocks (elastic#39089) ML refactor DatafeedsConfig(Update) so defaults are not populated in queries or aggs (elastic#38822) Fix retention leases sync on recovery test ...

There is a strong indication that the test was originally failing for the same reason as testLoadExistingWatchesUponStartup. This was fixed in #39092 and the cause is explained in https://github.com/elastic/elasticsearch/pull/39092/files#r257895150

SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed a few times with various causes (not all of which we have logs for). This change enables the test again. 1. The fix from elastic#39092 should resolve any issues in assertWatchCount 2. In at least 1 case, getWatchHistoryEntry failed due to a ResponseException, which is not caught by assertBusy. This commit catches those and calls "fail" so that assertBusy will sleep and retry 3. Additional logging has been included to help diagnose any other failures causes.

The watcher trigger service could attempt to modify the perWatchStats map simultaneously from multiple threads. This would cause the internal state to become inconsistent, in particular the count() method may return an incorrect value for the number of watches. This changes replaces the implementation of the map with a ConcurrentHashMap so that its internal state remains consistent even when accessed from mutiple threads. Backport of: #39092

SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed a few times with various causes (not all of which we have logs for). This change enables the test again. 1. The fix from #39092 should resolve any issues in assertWatchCount 2. In at least 1 case, getWatchHistoryEntry failed due to a ResponseException, which is not caught by assertBusy. This commit catches those and calls "fail" so that assertBusy will sleep and retry 3. Additional logging has been included to help diagnose any other failures causes.

The watcher trigger service could attempt to modify the perWatchStats map simultaneously from multiple threads. This would cause the internal state to become inconsistent, in particular the count() method may return an incorrect value for the number of watches. This changes replaces the implementation of the map with a ConcurrentHashMap so that its internal state remains consistent even when accessed from mutiple threads. Backport of: elastic#39092

The watcher trigger service could attempt to modify the perWatchStats map simultaneously from multiple threads. This would cause the internal state to become inconsistent, in particular the count() method may return an incorrect value for the number of watches. This changes replaces the implementation of the map with a ConcurrentHashMap so that its internal state remains consistent even when accessed from mutiple threads. Backport of: #39092

SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed a few times with various causes (not all of which we have logs for). This change enables the test again. 1. The fix from elastic#39092 should resolve any issues in assertWatchCount 2. In at least 1 case, getWatchHistoryEntry failed due to a ResponseException, which is not caught by assertBusy. This commit catches those and calls "fail" so that assertBusy will sleep and retry 3. Additional logging has been included to help diagnose any other failures causes.

tvernum added >bug v7.0.0 :Distributed/Watcher v6.7.0 v8.0.0 v7.2.0 labels Feb 19, 2019

tvernum requested a review from jakelandis February 19, 2019 05:30

tvernum commented Feb 19, 2019

View reviewed changes

Fix broken test

331f05e

This test didn't set an id() on the Watch, but ConcurrentHashMap doesn't allow null keys. Note: Per Watch.equals and Watch.hashCode, id is not allowed to be null

tvernum commented Feb 19, 2019

View reviewed changes

jakelandis approved these changes Feb 19, 2019

View reviewed changes

jkakavas mentioned this pull request Feb 19, 2019

[CI] BootStrapTests.testTriggeredWatchLoading test failures #29846

Closed

tvernum merged commit e694473 into elastic:master Feb 19, 2019

tvernum added the backport pending label Feb 20, 2019

tvernum mentioned this pull request Feb 20, 2019

[BACKPORT 7.x] Resolve concurrency with watcher trigger service #39164

Merged

tvernum mentioned this pull request Feb 20, 2019

[BACKPORT 7.0] Resolve concurrency with watcher trigger service #39167

Merged

tvernum mentioned this pull request Feb 20, 2019

Enable smoke-test-watcher with logging #39169

Merged

tvernum mentioned this pull request Feb 21, 2019

[BACKPORT 6.7] Resolve concurrency with watcher trigger service #39219

Merged

tvernum removed the backport pending label Feb 26, 2019

jakelandis added v7.0.0-rc2 and removed v7.0.0 labels Apr 3, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve concurrency with watcher trigger service#39092

Resolve concurrency with watcher trigger service#39092
tvernum merged 2 commits intoelastic:masterfrom
tvernum:watcher-bootstrap-tests

tvernum commented Feb 19, 2019 •

edited

Loading

Uh oh!

elasticmachine commented Feb 19, 2019

Uh oh!

tvernum Feb 19, 2019

Uh oh!

tvernum Feb 19, 2019

Uh oh!

jakelandis left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tvernum commented Feb 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Feb 19, 2019

Uh oh!

tvernum Feb 19, 2019

Choose a reason for hiding this comment

Uh oh!

tvernum Feb 19, 2019

Choose a reason for hiding this comment

Uh oh!

jakelandis left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tvernum commented Feb 19, 2019 •

edited

Loading