[Task Manager] time out work when it overruns in poller#74980
[Task Manager] time out work when it overruns in poller#74980gmmorris merged 3 commits intoelastic:masterfrom
Conversation
|
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
|
@elasticmachine merge upstream |
💚 Build SucceededBuild metrics
History
To update your PR or re-run it, just comment with: |
pmuellr
left a comment
There was a problem hiding this comment.
LGTM, but left some questions, mainly about potential timing issues
| // The `work` phase includes the prework needed *before* executing a task | ||
| // (such as polling for new work, marking tasks as running etc.) but does not | ||
| // include the time of actually running the task | ||
| workTimeout: opts.config.poll_interval * opts.config.max_poll_inactivity_cycles, |
There was a problem hiding this comment.
Curious why you used a "count" as the config bit here, instead of a duration - my guess is the "count" makes this setting "move" if the poll interval changes, which I think is a nice pattern - I'd been thinking some of our other explicit timeouts could be "relative" like this as well - eg, throttle based on alert interval (eg, throttle for 10 intervals, not 10 minutes, or probably allow both).
There was a problem hiding this comment.
Yup, Mike asked the same. :)
It's exactly that - to make it relative to the polling interval.
It also reduces the scenarios you need to support, such as a work duration that's lower than polling interval, and the potential complexity that introduces.
| pollRequests$.next(some(three)); | ||
|
|
||
| advance(workTimeout); | ||
| await sleep(workTimeout); |
There was a problem hiding this comment.
Should we account for a bit of slop here? Seems like it shouldn't be possible for this to resolve before the workTimeout, but you know ... node, time, etc, heh. Maybe just add 100ms or so?
There was a problem hiding this comment.
I wanted to start here and address if it does introduce flakiness.
In this case it's using fake timers, which I've found to be quite accurate, because it isn't actually time based, but rather is based on the queued timeouts.
There was a problem hiding this comment.
ya, I'm guessing even if we did see some timing flakiness, it would be fairly obvious that was the problem, given the context, so WORKSFORME
| await promiseResult<H, Error>( | ||
| timeoutPromiseAfter<H, Error>( | ||
| work(...pullFromSet(set, getCapacity())), | ||
| workTimeout ?? pollInterval, |
There was a problem hiding this comment.
What is the case for workTimeout being undefined? Tests? I worry that if there a path in production code to have this as undefined, the pollInterval value used would be too close to a valid cycle - should it at least be some kind of multiple of pollInterval, maybe like 2 or something?
There was a problem hiding this comment.
hmm it was really just a default... but yeah, you're right that if someone sets the interval to be lower than the latency of talking to Elasticsearch this will behave badly 🤔
That said... same is true for the actual polling....
Not sure about the * 2 approach, as we have no idea what the interval will be and it might be set to something ridiculous like 30m, in which case, timeout will be 60m....
That said... if your poll is 30m you don't care about immediate results anyway, do you 🤔
There was a problem hiding this comment.
As the default config is set to 10 I think we can leave it as is for now and see how it behaves.
Unless someone specifically changes it, it'll always be far longer than the interval itself, and as this is here as a safeguard against issues in marking tasks as running (this time does not include task execution) I feel comfortable waiting to see how it behaves in the real world.
If the work performed by the poller hangs, meaning the promise fails to resolve/reject, then the poller can get stuck in a mode where it just waits for ever and no longer polls for fresh work. This PR introduces a timeout after which the poller will automatically reject the work, freeing the poller to restart pulling fresh work.
* master: Skip failing test in CI (elastic#75266) [Task Manager] time out work when it overruns in poller (elastic#74980) [Drilldowns] misc improvements & fixes (elastic#75276) Small README note on bumping memory for builds (elastic#75247) [Security Solution][Detections] Adds exception modal tests (elastic#74596) [Dashboard] Sample data link does not work (elastic#75262) [Dashboard First] Unlink from Library Action With ReferenceOrValueEmbeddable (elastic#74905) [Form lib] Fix issue where serializer on fields are called on every change (elastic#75166) convert processor labels to sentence case (elastic#75278) [Monaco] Refactor the way XJSON grammar checker gets registered (elastic#75160) Clarify no documents error message when filtering by is_training (elastic#75227) [Lens] Fix crash when two layers xychart switches to pie (elastic#75267) [Observability Homepage] Fix console error because of side effect (elastic#75258) [Usage Collection] Add `legacy=true` option to the /api/stats request in the docs (elastic#75146) [ML] Functional tests - re-activate DFA test suites (elastic#75257) GS providers improvements (elastic#75174) [Visualize] First version of by-value visualize editor (elastic#72256)
…emove-header * saved-objects/version-on-create: (59 commits) remove version when loading sample data omit version from SO import/export Skip failing test in CI (elastic#75266) [Task Manager] time out work when it overruns in poller (elastic#74980) [Drilldowns] misc improvements & fixes (elastic#75276) Small README note on bumping memory for builds (elastic#75247) [Security Solution][Detections] Adds exception modal tests (elastic#74596) Revert "Revert "added missing core docs"" Revert "Revert "added version to saved object bulk creation"" [Dashboard] Sample data link does not work (elastic#75262) [Dashboard First] Unlink from Library Action With ReferenceOrValueEmbeddable (elastic#74905) [Form lib] Fix issue where serializer on fields are called on every change (elastic#75166) convert processor labels to sentence case (elastic#75278) [Monaco] Refactor the way XJSON grammar checker gets registered (elastic#75160) Clarify no documents error message when filtering by is_training (elastic#75227) [Lens] Fix crash when two layers xychart switches to pie (elastic#75267) [Observability Homepage] Fix console error because of side effect (elastic#75258) [Usage Collection] Add `legacy=true` option to the /api/stats request in the docs (elastic#75146) [ML] Functional tests - re-activate DFA test suites (elastic#75257) GS providers improvements (elastic#75174) ...
) If the work performed by the poller hangs, meaning the promise fails to resolve/reject, then the poller can get stuck in a mode where it just waits for ever and no longer polls for fresh work. This PR introduces a timeout after which the poller will automatically reject the work, freeing the poller to restart pulling fresh work.
* master: (112 commits) [Ingest Manager] Fix agent config rollout rate limit to use constants (elastic#75364) Update Node.js to version 10.22.0 (elastic#75254) [ML] Anomaly Explorer / Single Metric Viewer: Fix error reporting for annotations. (elastic#74953) [Discover] Fix histogram cloud tests (elastic#75268) Uiactions to navigate to visualize or maps (elastic#74121) Use prefix search invis editor field/agg combo box (elastic#75290) Fix docs in trigger alerting UI (elastic#75363) [SIEM] Fixes search bar Cypress test (elastic#74833) Add libnss3.so to Dockerfile template (reporting) (elastic#75370) [Discover] Create field_button and add popovers to sidebar (elastic#73226) [Reporting] Network Policy: Do not throw from the intercept handler (elastic#75105) [Reporting] Increase capture.timeouts.openUrl to 1 minute (elastic#75207) Allow routes to specify the idle socket timeout in addition to the payload timeout (elastic#73730) [src/dev/build] remove node-version from snapshots (elastic#75303) [ENDPOINT] Reintroduced tabs to endpoint management and migrated pages to use common security components (elastic#74886) [Canvas] Remove dependency on legacy expressions APIs (elastic#74885) Skip failing test in CI (elastic#75266) [Task Manager] time out work when it overruns in poller (elastic#74980) [Drilldowns] misc improvements & fixes (elastic#75276) Small README note on bumping memory for builds (elastic#75247) ...
Summary
This PR addresses a potential failure case discovered during the investigation of this issue: #74785
If the work performed by the poller hangs, meaning the promise fails to resolve/reject, then the poller can get stuck in a mode where it just waits for ever and no longer polls for fresh work.
This PR introduces a timeout after which the poller will automatically reject the work, freeing the poller to restart pulling fresh work.
Checklist
Delete any items that are not applicable to this PR.
For maintainers