Resolve concurrency with watcher trigger service#39092
Resolve concurrency with watcher trigger service#39092tvernum merged 2 commits intoelastic:masterfrom
Conversation
The watcher trigger service could attempt to modify the perWatchStats map simultaneously from multiple threads. This would cause the internal state to become inconsistent, in particular the count() method may return an incorrect value for the number of watches. This changes replaces the implementation of the map with a ConcurrentHashMap so that its internal state remains consistent even when accessed from mutiple threads. Resolves: elastic#39087
|
Pinging @elastic/es-core-features |
| } | ||
|
|
||
| @AwaitsFix(bugUrl = "Supposedly fixed; https://github.com/elastic/x-pack-elasticsearch/issues/1915") | ||
| public void testLoadExistingWatchesUponStartup() throws Exception { |
There was a problem hiding this comment.
This test would fail approximately 1 in 50 times due to this bug.
The final assertBusy would fail because the returned getWatchesCount() would be 1 less than expected (numWatches) due to the size of the internal map becoming corrupt.
If you iterated through the map and explicitly counted the number of items, it was correct, but the size() returned an incorrect value.
This test didn't set an id() on the Watch, but ConcurrentHashMap doesn't allow null keys. Note: Per Watch.equals and Watch.hashCode, id is not allowed to be null
| final String id = randomAlphaOfLengthBetween(3, 12); | ||
| Watch watch = mock(Watch.class); | ||
| when(watch.trigger()).thenReturn(trigger); | ||
| when(watch.id()).thenReturn(id); |
There was a problem hiding this comment.
ConcurrentHashMap requires a non-null key, so this test would otherwise fail.
Watch.hashCode & Watch.equals already assume id is non-null, so the requirement in ConcurrentHashMap is fine, it's just the test that was wrong.
jakelandis
left a comment
There was a problem hiding this comment.
@tvernum nice find ! LGTM
The watcher trigger service could attempt to modify the perWatchStats map simultaneously from multiple threads. This would cause the internal state to become inconsistent, in particular the count() method may return an incorrect value for the number of watches. This changes replaces the implementation of the map with a ConcurrentHashMap so that its internal state remains consistent even when accessed from mutiple threads. Backport of: elastic#39092
The watcher trigger service could attempt to modify the perWatchStats map simultaneously from multiple threads. This would cause the internal state to become inconsistent, in particular the count() method may return an incorrect value for the number of watches. This changes replaces the implementation of the map with a ConcurrentHashMap so that its internal state remains consistent even when accessed from mutiple threads. Backport of: elastic#39092
…follow * elastic/master: (37 commits) Enable test logging for TransformIntegrationTests#testSearchTransform. stronger wording for ilm+rollover in docs (elastic#39159) Mute SingleNodeTests (elastic#39156) AwaitsFix XPackUsageIT#testXPackCcrUsage. Resolve concurrency with watcher trigger service (elastic#39092) Fix median calculation in MedianAbsoluteDeviationAggregatorTests (elastic#38979) [DOCS] Edits the remote clusters documentation (elastic#38996) add version 6.6.2 Revert "Mute failing test 20_mix_typless_typefull (elastic#38781)" (elastic#38912) Rebuild remote connections on profile changes (elastic#37678) Document 'max_size' parameter as shard size for rollover (elastic#38750) Add some missing toString() implementations (elastic#39124) Migrate Streamable to Writeable for cluster block package (elastic#37391) fix RethrottleTests retry (elastic#38978) Disable date parsing test in non english locale (elastic#39052) Remove BCryptTests (elastic#39098) [ML] Stop the ML memory tracker before closing node (elastic#39111) Allow retention lease operations under blocks (elastic#39089) ML refactor DatafeedsConfig(Update) so defaults are not populated in queries or aggs (elastic#38822) Fix retention leases sync on recovery test ...
There is a strong indication that the test was originally failing for the same reason as testLoadExistingWatchesUponStartup. This was fixed in #39092 and the cause is explained in https://github.com/elastic/elasticsearch/pull/39092/files#r257895150
SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed a few times with various causes (not all of which we have logs for). This change enables the test again. 1. The fix from elastic#39092 should resolve any issues in assertWatchCount 2. In at least 1 case, getWatchHistoryEntry failed due to a ResponseException, which is not caught by assertBusy. This commit catches those and calls "fail" so that assertBusy will sleep and retry 3. Additional logging has been included to help diagnose any other failures causes.
The watcher trigger service could attempt to modify the perWatchStats map simultaneously from multiple threads. This would cause the internal state to become inconsistent, in particular the count() method may return an incorrect value for the number of watches. This changes replaces the implementation of the map with a ConcurrentHashMap so that its internal state remains consistent even when accessed from mutiple threads. Backport of: #39092
The watcher trigger service could attempt to modify the perWatchStats map simultaneously from multiple threads. This would cause the internal state to become inconsistent, in particular the count() method may return an incorrect value for the number of watches. This changes replaces the implementation of the map with a ConcurrentHashMap so that its internal state remains consistent even when accessed from mutiple threads. Backport of: #39092
SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed a few times with various causes (not all of which we have logs for). This change enables the test again. 1. The fix from #39092 should resolve any issues in assertWatchCount 2. In at least 1 case, getWatchHistoryEntry failed due to a ResponseException, which is not caught by assertBusy. This commit catches those and calls "fail" so that assertBusy will sleep and retry 3. Additional logging has been included to help diagnose any other failures causes.
The watcher trigger service could attempt to modify the perWatchStats map simultaneously from multiple threads. This would cause the internal state to become inconsistent, in particular the count() method may return an incorrect value for the number of watches. This changes replaces the implementation of the map with a ConcurrentHashMap so that its internal state remains consistent even when accessed from mutiple threads. Backport of: elastic#39092
The watcher trigger service could attempt to modify the perWatchStats map simultaneously from multiple threads. This would cause the internal state to become inconsistent, in particular the count() method may return an incorrect value for the number of watches. This changes replaces the implementation of the map with a ConcurrentHashMap so that its internal state remains consistent even when accessed from mutiple threads. Backport of: #39092
SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed a few times with various causes (not all of which we have logs for). This change enables the test again. 1. The fix from elastic#39092 should resolve any issues in assertWatchCount 2. In at least 1 case, getWatchHistoryEntry failed due to a ResponseException, which is not caught by assertBusy. This commit catches those and calls "fail" so that assertBusy will sleep and retry 3. Additional logging has been included to help diagnose any other failures causes.
SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed a few times with various causes (not all of which we have logs for). This change enables the test again. 1. The fix from elastic#39092 should resolve any issues in assertWatchCount 2. In at least 1 case, getWatchHistoryEntry failed due to a ResponseException, which is not caught by assertBusy. This commit catches those and calls "fail" so that assertBusy will sleep and retry 3. Additional logging has been included to help diagnose any other failures causes.
The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.
This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.
Resolves: #39087