Watcher: Make start/stop cycle more predictable and synchronous#30118
Merged
spinscale merged 8 commits intoelastic:masterfrom May 3, 2018
Merged
Watcher: Make start/stop cycle more predictable and synchronous#30118spinscale merged 8 commits intoelastic:masterfrom
spinscale merged 8 commits intoelastic:masterfrom
Conversation
The current implementation starts/stops watcher using an executor. This can result in our of order operations. This commit reduces those executor calls to an absolute minimum in order to be able to do state changes within the cluster state listener method, which runs in sequence. When a state change occurs that forces the watcher service to pause (like no watcher index, no master node, no local shards), the service is now in a paused state. Pausing is a super lightweight operation, which marks the ExecutionService as paused and waits for the currently executing watches to finish in the background via an executor. The same applies for stopping, the potentially long running operation is outsourced in to an executor, as waiting for executed watches is decoupled from the current state. The only other long running operation is starting, where watches need to be loaded. This is also done via an executor, but has an additional protection by checking the cluster state version it was started with. If another cluster state version was trying to load the watches, then this loading will not take effect. This PR also cleans up some unused states, like the a simple boolean in the HistoryStore/TriggeredWatchStore marking it as started or stopped, as this can now be caught in the execution service. Another advantage of this approach is the fact, that now only triggered watches are not getting executed, while watches that are run via the Execute Watch API will still be executed regardless if watcher is stopped or not. Lastly the TickerScheduleTriggerEngine thread now only starts on data nodes.
Collaborator
|
Pinging @elastic/es-core-infra |
rjernst
approved these changes
Apr 25, 2018
Member
rjernst
left a comment
There was a problem hiding this comment.
This looks fine. One general thought is to move more into WatcherService from WatcherLifeCycleService, but I can do this in followups as we discussed.
| * @return the number of tasks that have been removed | ||
| */ | ||
| public synchronized int pauseExecution() { | ||
| public int pause() { |
Member
There was a problem hiding this comment.
Why is this no longer synchronized when the unPause() method is synchronized?
Contributor
Author
There was a problem hiding this comment.
removed synchronized from unPause(), no longer needed.
Contributor
Author
|
@elasticmachine retest this please |
spinscale
added a commit
that referenced
this pull request
May 3, 2018
The current implementation starts/stops watcher using an executor. This can result in our of order operations. This commit reduces those executor calls to an absolute minimum in order to be able to do state changes within the cluster state listener method, which runs in sequence. When a state change occurs that forces the watcher service to pause (like no watcher index, no master node, no local shards), the service is now in a paused state. Pausing is a super lightweight operation, which marks the ExecutionService as paused and waits for the currently executing watches to finish in the background via an executor. The same applies for stopping, the potentially long running operation is outsourced in to an executor, as waiting for executed watches is decoupled from the current state. The only other long running operation is starting, where watches need to be loaded. This is also done via an executor, but has an additional protection by checking the cluster state version it was started with. If another cluster state version was trying to load the watches, then this loading will not take effect. This PR also cleans up some unused states, like the a simple boolean in the HistoryStore/TriggeredWatchStore marking it as started or stopped, as this can now be caught in the execution service. Another advantage of this approach is the fact, that now only triggered watches are not getting executed, while watches that are run via the Execute Watch API will still be executed regardless if watcher is stopped or not. Lastly the TickerScheduleTriggerEngine thread now only starts on data nodes.
dnhatn
added a commit
that referenced
this pull request
May 4, 2018
* master: Set the new lucene version for 6.4.0 [ML][TEST] Clean up jobs in ModelPlotIT Upgrade to 7.4.0-snapshot-1ed95c097b (#30357) Watcher: Ensure trigger service pauses execution (#30363) [DOCS] Added coming qualifiers in changelog [DOCS] Commented out empty sections in the changelog to fix the doc build. (#30372) Security: reduce garbage during index resolution (#30180) Make RepositoriesMetaData contents unmodifiable (#30361) Change quad tree max levels to 29. Closes #21191 (#29663) Test: use trial license in qa tests with security [ML] Add integration test for model plots (#30359) SQL: Fix bug caused by empty composites (#30343) [ML] Account for gaps in data counts after job is reopened (#30294) InternalEngineTests.testConcurrentOutOfOrderDocsOnReplica should use two documents (#30121) Change signature of Get Repositories Response (#30333) Tests: Use different watch ids per test in smoke test (#30331) [Docs] Add term query with normalizer example Adds Eclipse config for xpack licence headers (#30299) Watcher: Make start/stop cycle more predictable and synchronous (#30118) [test] add debug logging for packaging test [DOCS] Removed X-Pack Breaking Changes [DOCS] Fixes link to TLS LDAP info Update versions for start_trial after backport (#30218) Packaging: Set elasticsearch user to have non-existent homedir (#29007) [DOCS] Fixes broken links to bootstrap user (#30349) Fix NPE when CumulativeSum agg encounters null/empty bucket (#29641) Make licensing FIPS-140 compliant (#30251) [DOCS] Reorganizes authentication details in Stack Overview (#30280) Network: Remove http.enabled setting (#29601) Fix merging logic of Suggester Options (#29514) [DOCS] Adds LDAP realm configuration details (#30214) [DOCS] Adds native realm configuration details (#30215) ReplicationTracker.markAllocationIdAsInSync may hang if allocation is cancelled (#30316) [DOCS] Enables edit links for X-Pack pages (#30278) Packaging: Unmark systemd service file as a config file (#29004) SQL: Reduce number of ranges generated for comparisons (#30267) Tests: Simplify VersionUtils released version splitting (#30322) Cancelling a peer recovery on the source can leak a primary permit (#30318) Added changelog entry for deb prerelease version change (#30184) Convert server javadoc to html5 (#30279) Create default ES_TMPDIR on Windows (#30325) [Docs] Clarify `fuzzy_like_this` redirect (#30183) Post backport of #29658. Fix docs of the `_ignored` meta field. Remove MapperService#types(). (#29617) Remove useless version checks in REST tests. (#30165) Add a new `_ignored` meta field. (#29658) Move repository-azure fixture test to QA project (#30253) # Conflicts: # buildSrc/version.properties # server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java
dnhatn
added a commit
that referenced
this pull request
May 8, 2018
* 6.x: Stop forking javac (#30462) Fix tribe tests Docs: Use task_id in examples of tasks (#30436) Security: Rename IndexLifecycleManager to SecurityIndexManager (#30442) Packaging: Set elasticsearch user to have non-existent homedir (#29007) [Docs] Fix typo in cardinality-aggregation.asciidoc (#30434) Avoid NPE in `more_like_this` when field has zero tokens (#30365) Build: Switch to building javadoc with html5 (#30440) Add a quick tour of the project to CONTRIBUTING (#30187) Add stricter geohash parsing (#30376) Reindex: Use request flavored methods (#30317) Silence SplitIndexIT.testSplitIndexPrimaryTerm test failure. (#30432) Auto-expand replicas when adding or removing nodes (#30423) Silence IndexUpgradeIT test failures. (#30430) Fix line length violation in cache tests Add failing test for core cache deadlock [DOCS] convert forcemerge snippet Update forcemerge.asciidoc (#30113) Added zentity to the list of API extension plugins (#29143) Fix the search request default operation behavior doc (#29302) (#29405) Watcher: Mark watcher as started only after loading watches (#30403) Correct wording in log message (#30336) Do not fail snapshot when deleting a missing snapshotted file (#30332) AwaitsFix testCreateShrinkIndexToN DOCS: Correct mapping tags in put-template api DOCS: Fix broken link in the put index template api Add put index template api to high level rest client (#30400) [Docs] Add snippets for POS stop tags default value Remove entry inadvertently picked into changelog Move respect accept header on no handler to 6.3.1 Respect accept header on no handler (#30383) [Test] Add analysis-nori plugin to the vagrant tests [Rollup] Validate timezone in range queries (#30338) [Docs] Fix bad link [Docs] Fix end of section in the korean plugin docs add the Korean nori plugin to the change logs Expose the Lucene Korean analyzer module in a plugin (#30397) Docs: remove transport_client from CCS role example (#30263) Test: remove cluster permission from CCS user (#30262) Watcher: Remove unneeded index deletion in tests fix docs branch version fix lucene snapshot version Upgrade to 7.4.0-snapshot-1ed95c097b (#30357) [ML][TEST] Clean up jobs in ModelPlotIT Watcher: Ensure trigger service pauses execution (#30363) [DOCS] Fixes ordering of changelog sections [DOCS] Commented out empty sections in the changelog to fix the doc build. (#30372) Make RepositoriesMetaData contents unmodifiable (#30361) Change signature of Get Repositories Response (#30333) 6.x Backport: Terms query validate bug (#30319) InternalEngineTests.testConcurrentOutOfOrderDocsOnReplica should use two documents (#30121) Security: reduce garbage during index resolution (#30180) Test: use trial license in qa tests with security [ML] Add integration test for model plots (#30359) SQL: Fix bug caused by empty composites (#30343) [ML] Account for gaps in data counts after job is reopened (#30294) [ML] Refactor DataStreamDiagnostics to use array (#30129) Make licensing FIPS-140 compliant (#30251) Do not load global state when deleting a snapshot (#29278) Don't load global state when only restoring indices (#29239) Tests: Use different watch ids per test in smoke test (#30331) Watcher: Make start/stop cycle more predictable and synchronous (#30118) [Docs] Add term query with normalizer example Adds Eclipse config for xpack licence headers (#30299) Fix message content in users tool (#30293) [DOCS] Removed X-Pack breaking changes page [DOCS] Added security breaking change [DOCS] Fixes link to TLS LDAP info [DOCS] Merges X-Pack release notes into changelog (#30350) [DOCS] Fixes broken links to bootstrap user (#30349) [Docs] Remove errant changelog line Fix NPE when CumulativeSum agg encounters null/empty bucket (#29641) [DOCS] Reorganizes authentication details in Stack Overview (#30280) Tests: Simplify VersionUtils released version splitting (#30322) Fix merging logic of Suggester Options (#29514) ReplicationTracker.markAllocationIdAsInSync may hang if allocation is cancelled (#30316) [DOCS] Adds LDAP realm configuration details (#30214) [DOCS] Adds native realm configuration details (#30215) Disable SSL on testing old BWC nodes (#30337) [DOCS] Enables edit links for X-Pack pages Cancelling a peer recovery on the source can leak a primary permit (#30318) SQL: Reduce number of ranges generated for comparisons (#30267) [DOCS] Adds links to changelog sections Convert server javadoc to html5 (#30279) REST Client: Add Request object flavored methods (#29623) Create default ES_TMPDIR on Windows (#30325) [Docs] Clarify `fuzzy_like_this` redirect (#30183) Fix docs of the `_ignored` meta field. Add a new `_ignored` meta field. (#29658) Move repository-azure fixture test to QA project (#30253)
Contributor
|
very cool @spinscale |
This was referenced Aug 21, 2018
jakelandis
added a commit
to jakelandis/elasticsearch
that referenced
this pull request
Jul 2, 2019
Watcher keeps track of which watches are currently running keyed by watcher name/id. If a watch is currently running it will not run the same watch and will result in a message : "Watch is already queued in thread pool" and a state: "not_executed_already_queued" When Watcher is stopped, it will stop watcher (rejecting any new watches), but allow the currently running watches to run to completion. Waiting for the currently running watches to complete is done async to the stopping of Watcher. Meaning that Watcher will report as fully stopped, but there is still a background thread waiting for all of the Watches to finish before it removes the watch from it's list of currently running Watches. The integration test start and stop watcher between each test. The goal to ensure a clean state between tests. However, since Watcher can report "yes - I am stopped", but there are still running Watches, the tests may bleed over into each other, especially on slow machines. This can result in errors related to "Watch is already queued in thread pool" and a state: "not_executed_already_queued", and is VERY difficult to reproduce. This commit changes the waiting for Watches on stop/pause from an aysnc waiting, back to a sync wait as it worked prior to elastic#30118. This help ensure that for testing testing scenario the stop much more predictable, such that after fully stopped, no Watches are running. This should have little impact if any on production code since Watcher isn't stopped/paused too often and when it stop/pause it has the same behavior is the same, it will just run on the calling thread, not a generic thread.
jakelandis
added a commit
to jakelandis/elasticsearch
that referenced
this pull request
Jul 2, 2019
Watcher keeps track of which watches are currently running keyed by watcher name/id. If a watch is currently running it will not run the same watch and will result in a message : "Watch is already queued in thread pool" and a state: "not_executed_already_queued" When Watcher is stopped, it will stop watcher (rejecting any new watches), but allow the currently running watches to run to completion. Waiting for the currently running watches to complete is done async to the stopping of Watcher. Meaning that Watcher will report as fully stopped, but there is still a background thread waiting for all of the Watches to finish before it removes the watch from it's list of currently running Watches. The integration test start and stop watcher between each test. The goal to ensure a clean state between tests. However, since Watcher can report "yes - I am stopped", but there are still running Watches, the tests may bleed over into each other, especially on slow machines. This can result in errors related to "Watch is already queued in thread pool" and a state: "not_executed_already_queued", and is VERY difficult to reproduce. This commit changes the waiting for Watches on stop/pause from an aysnc waiting, back to a sync wait as it worked prior to elastic#30118. This help ensure that for testing testing scenario the stop much more predictable, such that after fully stopped, no Watches are running. This should have little impact if any on production code since Watcher isn't stopped/paused too often and when it stop/pause it has the same behavior is the same, it will just run on the calling thread, not a generic thread.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The current implementation starts/stops watcher using an executor. This
can result in our of order operations.
This commit reduces those executor calls to an absolute minimum in order
to be able to do state changes within the cluster state listener method,
which runs in sequence.
When a state change occurs that forces the watcher service to pause
(like no watcher index, no master node, no local shards), the service is
now in a paused state.
Pausing is a super lightweight operation, which marks the
ExecutionService as paused and waits for the currently executing watches
to finish in the background via an executor. The same applies for
stopping, the potentially long running operation is outsourced in to an
executor, as waiting for executed watches is decoupled from the current
state.
The only other long running operation is starting, where watches need to
be loaded. This is also done via an executor, but has an additional
protection by checking the cluster state version it was started with. If
another cluster state version was trying to load the watches, then this
loading will not take effect.
This PR also cleans up some unused states, like the a simple boolean in
the HistoryStore/TriggeredWatchStore marking it as started or stopped,
as this can now be caught in the execution service.
Another advantage of this approach is the fact, that now only triggered
watches are not getting executed, while watches that are run via the
Execute Watch API will still be executed regardless if watcher is
stopped or not.
Lastly the TickerScheduleTriggerEngine thread now only starts on data nodes.