[Task manager] Adds support for limited concurrency tasks#90365
[Task manager] Adds support for limited concurrency tasks#90365gmmorris merged 22 commits intoelastic:masterfrom
Conversation
* master: (244 commits) [maps] Top hits per entity--change to title to use recent, minor edits (elastic#89254) [DOCS] Update installation details (elastic#90354) RFC for automatically generated typescript API documentation for every plugins public services, types, and functionality (elastic#86704) Elastic Maps Server config is `host` not `hostname` (elastic#90234) Use doc link services in index pattern management (elastic#89937) [Fleet] Managed Agent Policy (elastic#88688) [Workplace Search] Fix Source Settings bug (elastic#90242) [Enterprise Search] Refactor MockRouter test helper to not store payload (elastic#90206) Use doc link service in more Stack Monitoring pages (elastic#89050) [App Search] Relevance Tuning logic - actions and selectors only, no listeners (elastic#89313) Remove UI filters from UI (elastic#89793) Use newfeed.service config for all newsfeeds (elastic#90252) skip flaky suite (elastic#85086) Add readme to geo containment alert covering test alert setup (elastic#89625) [APM] Enabling yesterday option when 24 hours is selected (elastic#90017) Test user for maps tests under import geoJSON tests (elastic#86015) [Lens] Hide column in table (elastic#88680) [Security Solution][Detections] Reduce detection engine reliance on _source (elastic#89371) [Discover] Minor cleanup (elastic#90260) [Search Session][Management] Rename "cancel" button and delete "Reload" button (elastic#90015) ...
💔 Build Failed
Failed CI StepsTest FailuresKibana Pipeline / xpackJest / Jest Tests.x-pack/plugins/task_manager/server/queries.TaskClaiming claimAvailableTasks it filters out running tasksStandard OutStack TraceKibana Pipeline / xpackJest / Jest Tests.x-pack/plugins/task_manager/server/queries.TaskClaiming claimAvailableTasks it filters out invalid tasks that arent SavedObjectsStandard OutStack TraceKibana Pipeline / xpackJest / Jest Tests.x-pack/plugins/task_manager/server/queries.TaskClaiming claimAvailableTasks it returns task objectsStandard OutStack Traceand 2 more failures, only showing the first 3. Metrics [docs]
History
To update your PR or re-run it, just comment with: |
* master: (55 commits) [APM-UI][E2E] use githubNotify step (elastic#90514) [APM] Export ProcessorEvent type (elastic#90540) [Lens] Retain column config (elastic#90048) [Data Table] Add unit tests (elastic#90173) Migrate most plugins to synchronous lifecycle (elastic#89562) skip flaky suite (elastic#90555) skip flaky suite (elastic#64473) [actions] improve email action doc (elastic#90020) [Fleet] Support Fleet server system indices (elastic#89372) skip flaky suite (elastic#90552) Bump immer dependencies (elastic#90267) Unrevert "Migrations v2: don't auto-create indices + FTR/esArchiver support (elastic#85778)" (elastic#89992) [Search Sessions] Use sync config (elastic#90138) chore(NA): add safe guard to remove bazelisk from yarn global at bootstrap (elastic#90538) [test] Await retry.waitFor (elastic#90456) chore(NA): integrate build buddy with our bazel setup and remote cache for ci (elastic#90116) Skip failing suite (elastic#90526) [Fleet] Fix incorrect conversion of string to numeric values in agent YAML (elastic#90371) [Docs] Update reporting troubleshooting for arm rhel/centos (elastic#90385) chore(NA): build bazel projects all at once in the distributable build process (elastic#90328) ...
| // The clusterMonitoring task occupies 2 workers, so if the system has 10 worker slots, | ||
| // 5 clusterMonitoring tasks could run concurrently per Kibana instance. This value is | ||
| // overridden by the `override_num_workers` config value, if specified. | ||
| numWorkers: 2, |
There was a problem hiding this comment.
This hasn't been supported in ages but the docs were missed.
| }; | ||
| }; | ||
|
|
||
| private async markAvailableTasksAsClaimed({ |
There was a problem hiding this comment.
This was extracted from TaskStore and moved here in order to keep the Store focused on ES APIs and keep domain specific business logic separated.
| /** | ||
| * Fetches tasks from the index, which are owned by the current Kibana instance | ||
| */ | ||
| private async sweepForClaimedTasks( |
There was a problem hiding this comment.
This too was extracted from Task Store.
| maxConcurrency: schema.maybe( | ||
| schema.number({ | ||
| min: 0, | ||
| }) | ||
| ), |
There was a problem hiding this comment.
I chose to keep 0 as an option in case a TaskType wanted to disable the ability to enable their task in a specific Kibana instance via configuration.
This came up in the original Reporting discussions around preventing tasks from running in certain circumstances (for example, an Alerting only Kibana instance that can't run any other task type).
It seemed straight forward enough that enabling this here made sense to me- but open for discussion if anyone thinks this should have a min of 1.
There was a problem hiding this comment.
Is there a good log message that indicates that this configuration might be the reason for a task not running? For debugging purposes?
|
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
ymao1
left a comment
There was a problem hiding this comment.
Code review looks good! Is there a way you suggest to manually validate (other than seeing the functional tests run?)
| maxConcurrency: schema.maybe( | ||
| schema.number({ | ||
| min: 0, | ||
| }) | ||
| ), |
There was a problem hiding this comment.
Is there a good log message that indicates that this configuration might be the reason for a task not running? For debugging purposes?
You could add this config to the Alerting tasks and see that only one alert of each type can run in parallel 🤔 |
That's a god question 🤔 |
Added in Task Claiming constructor, so this will get logged as part of Task Manager's Start phase, which is guaranteed to happen after all Task Type definitions are provided. |
ymao1
left a comment
There was a problem hiding this comment.
LGTM! I created a bunch of alerts with actions and then set the max concurrency to 1 and saw that only 1 alert/action was picked up at a time.
|
I have tested this change on top of #64853 and it works as expected :) I confirm that with |
Amazing, thanks! Once I get another review from the team this will be merged. 🎉 |
* master: (99 commits) [Fleet] Use Fleet Server indices in the search bar (elastic#90835) [Search Sessions] added an info flyout to session management (elastic#90559) [ILM] Revisit searchable snapshot field after new redesign (elastic#90793) [Alerting] License Errors on Alert List View (elastic#89920) RFC Improve saved object migrations algorithm (elastic#84333) [Lens] (Accessibility) Fix focus on drag and drop actions (elastic#90561) Use new shortcut links to Fleet discuss forums. (elastic#90786) Do not generate an ephemeral encryption key in production. (elastic#81511) [Fleet] Use staging registry for snapshot builds (elastic#90327) Actually deleting x-pack/tsconfig.refs.json (elastic#90898) Add deprecation warning to all Beats CM pages. (elastic#90741) skip flaky suite (elastic#90136) Revert "Revert "[Metrics UI] Add Metrics Anomaly Alert Type (elastic#89244)"" (elastic#90889) remove ref to removed tsconfig file [core.logging] Uses host timezone as default (elastic#90368) [Maps] remove maps_file_upload plugin and fold public folder into file_upload plugin (elastic#90292) Revert "[Metrics UI] Add Metrics Anomaly Alert Type (elastic#89244)" [dev-utils/ci-stats] support disabling ship errors (elastic#90851) Prefix with / (elastic#90836) [Metrics UI] Add Metrics Anomaly Alert Type (elastic#89244) ...
pmuellr
left a comment
There was a problem hiding this comment.
I just did a quick pass after seeing the two approvals, but was curious. LGTM.
Added a note about experimenting with shuffling the concurrency groups at startup, to perhaps avoid some version conflicts w/multiple Kibanas.
* master: (44 commits) [APM] Add experimental support for Data Streams (elastic#89650) [Search Session] Control "Kibana / Search Sessions" management section by privileges (elastic#90818) [Lens] Median as default function (elastic#90952) Implement custom global header banner (elastic#87438) [Fleet] Reduce permissions. (elastic#90302) Update dependency @elastic/charts to v24.5.1 (elastic#89822) [Create index pattern] Can't create single character index without wildcard (elastic#90919) [ts/build_ts_refs] add support for --clean flag (elastic#91060) Don't clean when running e2e tests (elastic#91057) Fixes track_total_hits in the body not having an effect when using search strategy (elastic#91068) [Security Solution][Detections] Adds list plugin Saved Objects to Security feature privilege (elastic#90895) Removing the code plugin entirely for 8.0 (elastic#77940) chore(NA): move the instruction to remove yarn global bazelisk package into the first place on install bazel tools (elastic#91026) [jest/ci] remove max-old-space-size override to use 4gb default (elastic#91020) [Fleet] Restrict integration changes for managed policies (elastic#90675) [CI] Fix auto-backport condditions so that it doesn't trigger for other labels (elastic#91042) [DOCS] Uses variable to refer to query profiler (elastic#90976) [App Search] Relevance Tuning logic listeners (elastic#89461) [Metrics UI] Fix saving/loading saved views from URL (elastic#90216) Limit cardinality of transaction.name (elastic#90955) ...
* master: [Alerting][Docs] adds documentation on NTP based synchronization (elastic#90747)
|
@elasticmachine merge upstream |
💚 Build SucceededMetrics [docs]
History
To update your PR or re-run it, just comment with: |

Summary
closes #54916
Adds support for limited concurrency on a Task Type.
Usage
When defining a Task Type a developer may specify the maximum number of concurrent tasks of this type that can be run on each Kibana server.
For example:
Given this
reportingTaskdefinition, each Kibana instance in the cluster can run up to 2 tasks of this type at a time.Technical Notes
This max concurrency is achieved by limiting the number of tasks claimed by each Task Manager at a time.
This means that when Task Manager is about to claim available tasks it runs several batches of back-to-back queries:
max_docsequal to the available workers.max_docsto equal either the available workers, or the max concurrency (which ever is lower).As these queries are ran in sequence, it could be possible for the first query to starve subsequent queries by using up all the available workers, and as a result it could cause certain types to full behind (effectively the type targeted by the first query would have a higher priority than the subsequent ones).
To address this we shift the order of these queries on each cycle, so that whichever type was queried first in one round, will be queried last in the next and so forth. This means at time we might pick up a task that expired sooner after a task that expired later, but this should be corrected by the next cycle.
Checklist
Delete any items that are not applicable to this PR.
For maintainers