Skip to content

Resolve concurrency with watcher trigger service#39092

Merged
tvernum merged 2 commits intoelastic:masterfrom
tvernum:watcher-bootstrap-tests
Feb 19, 2019
Merged

Resolve concurrency with watcher trigger service#39092
tvernum merged 2 commits intoelastic:masterfrom
tvernum:watcher-bootstrap-tests

Conversation

@tvernum
Copy link
Copy Markdown
Contributor

@tvernum tvernum commented Feb 19, 2019

The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.

This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.

Resolves: #39087

The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.

This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.

Resolves: elastic#39087
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-core-features

}

@AwaitsFix(bugUrl = "Supposedly fixed; https://github.com/elastic/x-pack-elasticsearch/issues/1915")
public void testLoadExistingWatchesUponStartup() throws Exception {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test would fail approximately 1 in 50 times due to this bug.
The final assertBusy would fail because the returned getWatchesCount() would be 1 less than expected (numWatches) due to the size of the internal map becoming corrupt.

If you iterated through the map and explicitly counted the number of items, it was correct, but the size() returned an incorrect value.

This test didn't set an id() on the Watch, but ConcurrentHashMap
doesn't allow null keys.

Note: Per Watch.equals and Watch.hashCode, id is not allowed to be
null
final String id = randomAlphaOfLengthBetween(3, 12);
Watch watch = mock(Watch.class);
when(watch.trigger()).thenReturn(trigger);
when(watch.id()).thenReturn(id);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConcurrentHashMap requires a non-null key, so this test would otherwise fail.
Watch.hashCode & Watch.equals already assume id is non-null, so the requirement in ConcurrentHashMap is fine, it's just the test that was wrong.

Copy link
Copy Markdown
Contributor

@jakelandis jakelandis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tvernum nice find ! LGTM

@tvernum tvernum merged commit e694473 into elastic:master Feb 19, 2019
tvernum added a commit to tvernum/elasticsearch that referenced this pull request Feb 20, 2019
The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.

This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.

Backport of: elastic#39092
tvernum added a commit to tvernum/elasticsearch that referenced this pull request Feb 20, 2019
The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.

This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.

Backport of: elastic#39092
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 20, 2019
…follow

* elastic/master: (37 commits)
  Enable test logging for TransformIntegrationTests#testSearchTransform.
  stronger wording for ilm+rollover in docs (elastic#39159)
  Mute SingleNodeTests (elastic#39156)
  AwaitsFix XPackUsageIT#testXPackCcrUsage.
  Resolve concurrency with watcher trigger service (elastic#39092)
  Fix median calculation in MedianAbsoluteDeviationAggregatorTests (elastic#38979)
  [DOCS] Edits the remote clusters documentation (elastic#38996)
  add version 6.6.2
  Revert "Mute failing test 20_mix_typless_typefull (elastic#38781)" (elastic#38912)
  Rebuild remote connections on profile changes (elastic#37678)
  Document 'max_size' parameter as shard size for rollover (elastic#38750)
  Add some missing toString() implementations (elastic#39124)
  Migrate Streamable to Writeable for cluster block package (elastic#37391)
  fix RethrottleTests retry (elastic#38978)
  Disable date parsing test in non english locale (elastic#39052)
  Remove BCryptTests (elastic#39098)
  [ML] Stop the ML memory tracker before closing node (elastic#39111)
  Allow retention lease operations under blocks (elastic#39089)
  ML refactor DatafeedsConfig(Update) so defaults are not populated in queries or aggs (elastic#38822)
  Fix retention leases sync on recovery test
  ...
jkakavas added a commit that referenced this pull request Feb 20, 2019
There is a strong indication that the test was originally failing
for the same reason as testLoadExistingWatchesUponStartup. This was
fixed in #39092 and the cause is explained in
https://github.com/elastic/elasticsearch/pull/39092/files#r257895150
tvernum added a commit to tvernum/elasticsearch that referenced this pull request Feb 20, 2019
SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed a few
times with various causes (not all of which we have logs for).

This change enables the test again.

1. The fix from elastic#39092 should resolve any issues in assertWatchCount
2. In at least 1 case, getWatchHistoryEntry failed due to a
   ResponseException, which is not caught by assertBusy.
   This commit catches those and calls "fail" so that assertBusy will
   sleep and retry
3. Additional logging has been included to help diagnose any other
   failures causes.
tvernum added a commit that referenced this pull request Feb 20, 2019
The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.

This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.

Backport of: #39092
tvernum added a commit that referenced this pull request Feb 20, 2019
The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.

This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.

Backport of: #39092
tvernum added a commit that referenced this pull request Feb 20, 2019
SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed a few
times with various causes (not all of which we have logs for).

This change enables the test again.

1. The fix from #39092 should resolve any issues in assertWatchCount
2. In at least 1 case, getWatchHistoryEntry failed due to a
   ResponseException, which is not caught by assertBusy.
   This commit catches those and calls "fail" so that assertBusy will
   sleep and retry
3. Additional logging has been included to help diagnose any other
   failures causes.
tvernum added a commit to tvernum/elasticsearch that referenced this pull request Feb 21, 2019
The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.

This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.

Backport of: elastic#39092
tvernum added a commit that referenced this pull request Feb 21, 2019
The watcher trigger service could attempt to modify the perWatchStats
map simultaneously from multiple threads. This would cause the
internal state to become inconsistent, in particular the count()
method may return an incorrect value for the number of watches.

This changes replaces the implementation of the map with a
ConcurrentHashMap so that its internal state remains consistent even
when accessed from mutiple threads.

Backport of: #39092
weizijun pushed a commit to weizijun/elasticsearch that referenced this pull request Feb 22, 2019
SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed a few
times with various causes (not all of which we have logs for).

This change enables the test again.

1. The fix from elastic#39092 should resolve any issues in assertWatchCount
2. In at least 1 case, getWatchHistoryEntry failed due to a
   ResponseException, which is not caught by assertBusy.
   This commit catches those and calls "fail" so that assertBusy will
   sleep and retry
3. Additional logging has been included to help diagnose any other
   failures causes.
weizijun pushed a commit to weizijun/elasticsearch that referenced this pull request Feb 22, 2019
SmokeTestWatcherTestSuiteIT.testMonitorClusterHealth has failed a few
times with various causes (not all of which we have logs for).

This change enables the test again.

1. The fix from elastic#39092 should resolve any issues in assertWatchCount
2. In at least 1 case, getWatchHistoryEntry failed due to a
   ResponseException, which is not caught by assertBusy.
   This commit catches those and calls "fail" so that assertBusy will
   sleep and retry
3. Additional logging has been included to help diagnose any other
   failures causes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BootStrapTests.testLoadExistingWatchesUponStartup fails

3 participants