Watcher: Fix race condition when reloading watches#33157
Merged
spinscale merged 3 commits intoelastic:masterfrom Aug 30, 2018
Merged
Watcher: Fix race condition when reloading watches#33157spinscale merged 3 commits intoelastic:masterfrom
spinscale merged 3 commits intoelastic:masterfrom
Conversation
The current watcher implementation had two issues on reload, that could lead to the existing watches not properly cleared out. One fix here was to ensure that when `TriggerService.start()` is called, we ensure in the trigger engine implementations that current watches are removed instead of adding to the existing ones in `TickerScheduleTriggerEngine.start()` The second fix is a bit more subtle, as the underlying issue is rooted more subtle due concurrent code. When `WatcherService.reload()` is called it calls in turn `WatcherService.reloadInner()`, which is synchronized. In the reload method we cleared out existing watches and executions. However there was still a small window of time, where the clearing could happen in relatively quick succession for two cluster states coming in, however due to timing issues the second clearing happened before the first starting of the trigger engine. This could lead due `TriggerEngine.start(Collection<Watch> watches)` being called twice with a different set of watches, without the existing ones being cleared out, resulting in the execution of watches on this node, that should not be executed. Also, there were two minor fixes 1. If the node is not a data node, we forgot to set the status to STARTING when watcher is being started. This should not be a big issue, because a non-data node does not spent a lot of time loading as there are no watches which need loading. 2. If a new cluster state came in during a reload, we had two checks in place to abort loading the current one. The first one before we load all the watches of the local node and the second before watcher is starting with those new watches. Turned out that the first check was not returning, which meant we always tried to load all the watches, and then would fail on the second check. This has been fixed here.
Collaborator
|
Pinging @elastic/es-core-infra |
hub-cap
approved these changes
Aug 29, 2018
Contributor
hub-cap
left a comment
There was a problem hiding this comment.
putAll considered evil :P Another good find!
Contributor
Author
|
I removed the 6.4.1 one label for this one, as the main issue cannot be triggered by the cluster state listener currently due to checking the watcher state before and only call |
spinscale
added a commit
that referenced
this pull request
Aug 30, 2018
This commit ensures that when `TriggerService.start()` is called, we ensure in the trigger engine implementations that current watches are removed instead of adding to the existing ones in `TickerScheduleTriggerEngine.start()` Two additional minor fixes, where the result remains the same but less code gets executed. 1. If the node is not a data node, we forgot to set the status to STARTING when watcher is being started. This should not be a big issue, because a non-data node does not spent a lot of time loading as there are no watches which need loading. 2. If a new cluster state came in during a reload, we had two checks in place to abort loading the current one. The first one before we load all the watches of the local node and the second before watcher is starting with those new watches. Turned out that the first check was not returning, which meant we always tried to load all the watches, and then would fail on the second check. This has been fixed here.
dnhatn
added a commit
that referenced
this pull request
Sep 1, 2018
* 6.x: Mute test watcher usage stats output [Rollup] Fix FullClusterRestart test TEST: Disable soft-deletes in ParentChildTestCase TEST: Disable randomized soft-deletes settings Integrates soft-deletes into Elasticsearch (#33222) drop `index.shard.check_on_startup: fix` (#32279) Fix AwaitsFix issue number Mute SmokeTestWatcherWithSecurityIT testsi [DOCS] Moves ml folder from x-pack/docs to docs (#33248) TEST: mute more SmokeTestWatcherWithSecurityIT tests [DOCS] Move rollup APIs to docs (#31450) [DOCS] Rename X-Pack Commands section (#33005) Fixes SecurityIntegTestCase so it always adds at least one alias (#33296) TESTS: Fix Random Fail in MockTcpTransportTests (#33061) (#33307) MINOR: Remove Dead Code from PathTrie (#33280) (#33306) Fix pom for build-tools (#33300) Lazy evaluate java9home (#33301) SQL: test coverage for JdbcResultSet (#32813) Work around to be able to generate eclipse projects (#33295) Different handling for security specific errors in the CLI. Fix for #33230 (#33255) [ML] Refactor delimited file structure detection (#33233) SQL: Support multi-index format as table identifier (#33278) Enable forbiddenapis server java9 (#33245) [MUTE] SmokeTestWatcherWithSecurityIT flaky tests Add region ISO code to GeoIP Ingest plugin (#31669) (#33276) Don't be strict for 6.x Update serialization versions for custom IndexMetaData backport Replace IndexMetaData.Custom with Map-based custom metadata (#32749) Painless: Fix Bindings Bug (#33274) SQL: prevent duplicate generation for repeated aggs (#33252) TEST: Mute testMonitorClusterHealth Fix serialization of empty field capabilities response (#33263) Fix nested _source retrieval with includes/excludes (#33180) [DOCS] TLS file resources are reloadable (#33258) Watcher: Ensure TriggerEngine start replaces existing watches (#33157) Ignore module-info in jar hell checks (#33011) Fix docs build after #33241 [DOC] Repository GCS ADC not supported (#33238) Upgrade to latest Gradle 4.10 (#32801) Fix/30904 cluster formation part2 (#32877) Move file-based discovery to core (#33241) HLRC: add client side RefreshPolicy (#33209) [Kerberos] Add unsupported languages for tests (#33253) Watcher: Reload properly on remote shard change (#33167) Fix classpath security checks for external tests. (#33066) [Rollup] Only allow aggregating on multiples of configured interval (#32052) Added deprecation warning for rescore in scroll queries (#33070) Apply settings filter to get cluster settings API (#33247) [Rollup] Re-factor Rollup Indexer into a generic indexer for re-usability (#32743) HLRC: create base timed request class (#33216) HLRC: Use Optional in validation logic (#33104) Painless: Add Bindings (#33042)
spinscale
added a commit
to spinscale/elasticsearch
that referenced
this pull request
Sep 3, 2018
This commit reverts most of elastic#33157 as it introduces another race condition and breaks a common case of watcher, when the first watch is added to the system and the index does not exist yet. This means, that the index will be created, which triggers a reload, but during this time the put watch operation that triggered this is not yet indexed, so that both processes finish roughly add the same time and should not overwrite each other but act complementary. This commit reverts the logic of cleaning out the ticker engine watches on start up, as this is done already when the execution is paused - which also gets paused on the cluster state listener again, as we can be sure here, that the watches index has not yet been created. This also adds a new test, that starts a one node cluster and emulates the case of a non existing watches index and a watch being added, which should result in proper execution. Closes elastic#33320
spinscale
added a commit
that referenced
this pull request
Sep 21, 2018
…3360) This commit reverts most of #33157 as it introduces another race condition and breaks a common case of watcher, when the first watch is added to the system and the index does not exist yet. This means, that the index will be created, which triggers a reload, but during this time the put watch operation that triggered this is not yet indexed, so that both processes finish roughly add the same time and should not overwrite each other but act complementary. This commit reverts the logic of cleaning out the ticker engine watches on start up, as this is done already when the execution is paused - which also gets paused on the cluster state listener again, as we can be sure here, that the watches index has not yet been created. This also adds a new test, that starts a one node cluster and emulates the case of a non existing watches index and a watch being added, which should result in proper execution. Closes #33320
spinscale
added a commit
that referenced
this pull request
Sep 21, 2018
…3360) This commit reverts most of #33157 as it introduces another race condition and breaks a common case of watcher, when the first watch is added to the system and the index does not exist yet. This means, that the index will be created, which triggers a reload, but during this time the put watch operation that triggered this is not yet indexed, so that both processes finish roughly add the same time and should not overwrite each other but act complementary. This commit reverts the logic of cleaning out the ticker engine watches on start up, as this is done already when the execution is paused - which also gets paused on the cluster state listener again, as we can be sure here, that the watches index has not yet been created. This also adds a new test, that starts a one node cluster and emulates the case of a non existing watches index and a watch being added, which should result in proper execution. Closes #33320
kcm
pushed a commit
that referenced
this pull request
Oct 30, 2018
…3360) This commit reverts most of #33157 as it introduces another race condition and breaks a common case of watcher, when the first watch is added to the system and the index does not exist yet. This means, that the index will be created, which triggers a reload, but during this time the put watch operation that triggered this is not yet indexed, so that both processes finish roughly add the same time and should not overwrite each other but act complementary. This commit reverts the logic of cleaning out the ticker engine watches on start up, as this is done already when the execution is paused - which also gets paused on the cluster state listener again, as we can be sure here, that the watches index has not yet been created. This also adds a new test, that starts a one node cluster and emulates the case of a non existing watches index and a watch being added, which should result in proper execution. Closes #33320
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The current watcher implementation had two issues on reload, that could
lead to the existing watches not properly cleared out.
One fix here was to ensure that when
TriggerService.start()is called,we ensure in the trigger engine implementations that current watches are
removed instead of adding to the existing ones in
TickerScheduleTriggerEngine.start()The second fix is a bit more subtle, as the underlying issue is rooted
more subtle due concurrent code.
When
WatcherService.reload()is called it calls in turnWatcherService.reloadInner(), which is synchronized. In the reloadmethod we cleared out existing watches and executions.
Also, there were two additional minor fixes
STARTING when watcher is being started. This should not be a big issue,
because a non-data node does not spent a lot of time loading as there
are no watches which need loading.
place to abort loading the current one. The first one before we load all
the watches of the local node and the second before watcher is starting
with those new watches. Turned out that the first check was not
returning, which meant we always tried to load all the watches, and then
would fail on the second check. This has been fixed here.