Skip to content

Clear watch schedules when starting trigger engine#145325

Merged
michalborek merged 8 commits intoelastic:mainfrom
michalborek:issue-131964
Apr 2, 2026
Merged

Clear watch schedules when starting trigger engine#145325
michalborek merged 8 commits intoelastic:mainfrom
michalborek:issue-131964

Conversation

@michalborek
Copy link
Copy Markdown
Contributor

@michalborek michalborek commented Mar 31, 2026

When new nodes appear in the cluster the watch may change the node it is running on.
If we don't reset the schedules for a given node, there is a chance multiple nodes will execute the watch.

This could be observed in #131964 where some watcher actions were unexpectedly throttled (it happened that the same watch was run on two nodes and one of the execution was throttled.

Solving this problem uncovers another issue, related to watchers. When a watcher service is being reloaded (e.g. due to watcher allocation changes), and at the same time a new watcher is added, the WatcherIndexingListener is directly modifying the TickerScheduleTriggerEngine and the trigger engine start method reverts this operation.

To fix that, the recently added watchers will be treated separately from the rest and will be re-added during the start operation even if the fresh watcher has not yet been searchable.

Closes #131964
Closes #137562

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @michalborek, I've created a changelog YAML for you.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

📝 Walkthrough

Walkthrough

This pull request addresses an issue in the Watcher trigger engine where watch schedules were not being properly cleared on restart. The TickerScheduleTriggerEngine.start() method now explicitly clears the existing schedules map before repopulating it with new schedules, instead of only using putAll() to merge entries. A corresponding test has been added to verify that previous watch schedules are cleaned up when the engine restarts with a new set of watches. Additionally, logging has been enhanced in PeriodThrottler to track throttling decisions. A changelog entry documents this bug fix.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The PR implementation directly addresses the root cause identified in #131964: clearing watch schedules on trigger engine start prevents duplicate concurrent executions when node allocations change.
Out of Scope Changes check ✅ Passed All changes are scoped to the stated objective: clear() in TickerScheduleTriggerEngine, debug logging in PeriodThrottler, test coverage, and changelog entry.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • 🛠️ Update Documentation: Commit on current branch
  • 🛠️ Update Documentation: Create PR

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

⏳ Starting custom recipe Update Documentation (Local Commit)...

@michalborek
Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

✅ Actions performed

Full review triggered.

@elastic elastic deleted a comment from coderabbitai bot Mar 31, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/watcher/actions/throttler/PeriodThrottler.java`:
- Around line 9-10: Replace the Log4j imports and logger usage with
Elasticsearch logging types: remove imports of
org.apache.logging.log4j.LogManager and org.apache.logging.log4j.Logger and
import org.elasticsearch.logging.LogManager and org.elasticsearch.logging.Logger
instead, then update the logger declaration (the static LOGGER variable created
via LogManager.getLogger(...) in PeriodThrottler) to use the Elasticsearch
LogManager/Logger types so the class uses org.elasticsearch.logging.Logger
throughout.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: ba7e98ea-aefb-4330-8cfb-90d1365f8003

📥 Commits

Reviewing files that changed from the base of the PR and between b1bfa5b and 30e3c49.

📒 Files selected for processing (4)
  • docs/changelog/145325.yaml
  • x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/watcher/actions/throttler/PeriodThrottler.java
  • x-pack/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/trigger/schedule/engine/TickerScheduleTriggerEngine.java
  • x-pack/plugin/watcher/src/test/java/org/elasticsearch/xpack/watcher/trigger/schedule/engine/TickerScheduleEngineTests.java

@masseyke masseyke self-requested a review March 31, 2026 13:14
@elastic elastic deleted a comment from coderabbitai bot Mar 31, 2026
@michalborek michalborek marked this pull request as draft March 31, 2026 14:26
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @michalborek, I've created a changelog YAML for you.

@michalborek michalborek marked this pull request as ready for review April 1, 2026 11:19
michalborek and others added 6 commits April 1, 2026 14:08
When new nodes appear in the cluster the watcher allocation may change
If we don't reset the schedules for a given node, there is a chance
multiple nodes will execute the watch.
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @michalborek, I've updated the changelog YAML for you.

Copy link
Copy Markdown
Member

@masseyke masseyke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a big improvement. I think there might be another incredibly unlikely race condition or two (for example, remove doesn't look in recentlyAddedSchedules, but that would only be an issue if you called remove and add quickly, and I think only while the engine was also pausing and restarting -- if you do all 4 of those things in rapid succession, you've got to expect some weirdness). But I think this fixes the ones we've seen in production and in tests.

@michalborek michalborek merged commit a60205c into elastic:main Apr 2, 2026
35 checks passed
@michalborek michalborek deleted the issue-131964 branch April 2, 2026 08:00
@michalborek
Copy link
Copy Markdown
Contributor Author

I think this is a big improvement. I think there might be another incredibly unlikely race condition or two (for example, remove doesn't look in recentlyAddedSchedules, but that would only be an issue if you called remove and add quickly, and I think only while the engine was also pausing and restarting -- if you do all 4 of those things in rapid succession, you've got to expect some weirdness). But I think this fixes the ones we've seen in production and in tests.

Yes, I agree. The symmetric to this one would be removing a watch while a reload happens. To fix this we'd need to have something like recentlyDeletedWatches...

mromaios pushed a commit to mromaios/elasticsearch that referenced this pull request Apr 9, 2026
When new nodes appear in the cluster the watcher allocation may change
If we don't reset the schedules for a given node, there is a chance
multiple nodes will execute the watch. 

This fix revealed a race condition on watch insertion which has been fixed by 
adding recentlyAddedSchedules collection that handles schedules added 
during the watcher service reload.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Duplicate Watch Executions [CI] FullClusterRestartIT testWatcherWithApiKey {cluster=UPGRADED} failing

3 participants