Skip to content

[Task manager] Adds support for limited concurrency tasks#90365

Merged
gmmorris merged 22 commits intoelastic:masterfrom
gmmorris:task-manager/support-limited-concurrency
Feb 11, 2021
Merged

[Task manager] Adds support for limited concurrency tasks#90365
gmmorris merged 22 commits intoelastic:masterfrom
gmmorris:task-manager/support-limited-concurrency

Conversation

@gmmorris
Copy link
Copy Markdown
Contributor

@gmmorris gmmorris commented Feb 4, 2021

Summary

closes #54916

Adds support for limited concurrency on a Task Type.

Usage

When defining a Task Type a developer may specify the maximum number of concurrent tasks of this type that can be run on each Kibana server.

For example:

    taskManager.registerTaskDefinitions({
      reportingTask: {
        title: 'Reporting Task With Max Concurrency of 2',
        maxConcurrency: 2,
        timeout: '60s',
        description:
          'A Reporting task that can only be ran concurrently up to 2 instances *on each* Kibana instance.',
        createTaskRunner: () => ({
          async run() {
            // run a report why won't ya?
          },
        }),
      },
    });

Given this reportingTask definition, each Kibana instance in the cluster can run up to 2 tasks of this type at a time.

Technical Notes

This max concurrency is achieved by limiting the number of tasks claimed by each Task Manager at a time.
This means that when Task Manager is about to claim available tasks it runs several batches of back-to-back queries:

  1. One round claims tasks of all types that have an unlimited concurrency, using a max_docs equal to the available workers.
  2. One round per type of limited concurrency that targets that specific type, adjusting the max_docs to equal either the available workers, or the max concurrency (which ever is lower).

As these queries are ran in sequence, it could be possible for the first query to starve subsequent queries by using up all the available workers, and as a result it could cause certain types to full behind (effectively the type targeted by the first query would have a higher priority than the subsequent ones).
To address this we shift the order of these queries on each cycle, so that whichever type was queried first in one round, will be queried last in the next and so forth. This means at time we might pick up a task that expired sooner after a task that expired later, but this should be corrected by the next cycle.

Checklist

Delete any items that are not applicable to this PR.

For maintainers

* master: (244 commits)
  [maps] Top hits per entity--change to title to use recent, minor edits (elastic#89254)
  [DOCS] Update installation details (elastic#90354)
  RFC for automatically generated typescript API documentation for every plugins public services, types, and functionality (elastic#86704)
  Elastic Maps Server config is `host` not `hostname` (elastic#90234)
  Use doc link services in index pattern management (elastic#89937)
  [Fleet] Managed Agent Policy (elastic#88688)
  [Workplace Search] Fix Source Settings bug  (elastic#90242)
  [Enterprise Search] Refactor MockRouter test helper to not store payload (elastic#90206)
  Use doc link service in more Stack Monitoring pages (elastic#89050)
  [App Search] Relevance Tuning logic - actions and selectors only, no listeners (elastic#89313)
  Remove UI filters from UI (elastic#89793)
  Use newfeed.service config for all newsfeeds (elastic#90252)
  skip flaky suite (elastic#85086)
  Add readme to geo containment alert covering test alert setup (elastic#89625)
  [APM] Enabling yesterday option when 24 hours is selected (elastic#90017)
  Test user for maps tests under import geoJSON tests (elastic#86015)
  [Lens] Hide column in table (elastic#88680)
  [Security Solution][Detections] Reduce detection engine reliance on _source (elastic#89371)
  [Discover] Minor cleanup (elastic#90260)
  [Search Session][Management] Rename "cancel" button and delete "Reload" button (elastic#90015)
  ...
@kibanamachine
Copy link
Copy Markdown
Contributor

kibanamachine commented Feb 5, 2021

💔 Build Failed

Failed CI Steps


Test Failures

Kibana Pipeline / xpackJest / Jest Tests.x-pack/plugins/task_manager/server/queries.TaskClaiming claimAvailableTasks it filters out running tasks

Link to Jenkins

Standard Out

Failed Tests Reporter:
  - Test has not failed recently on tracked branches


Stack Trace

Error: expect(received).toContainEqual(expected) // deep equality

Expected value: {"bool": {"must": [{"term": {"task.ownerId": "5a2053f4-2f09-11e9-a525-478ed346e6fd"}}, {"term": {"task.status": "claiming"}}]}}
Received array: [{"term": {"type": "task"}}, {"bool": {"must": [{"bool": {"must": [{"term": {"task.ownerId": "5a2053f4-2f09-11e9-a525-478ed346e6fd"}}, {"term": {"task.status": "claiming"}}]}}, {"bool": {"filter": [{"bool": {"should": [[Object], [Object], [Object]]}}]}}]}}]
    at Object.<anonymous> (/dev/shm/workspace/kibana/x-pack/plugins/task_manager/server/queries/task_claiming.test.ts:821:33)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at _callCircusTest (/dev/shm/workspace/kibana/node_modules/jest-circus/build/run.js:212:5)
    at _runTest (/dev/shm/workspace/kibana/node_modules/jest-circus/build/run.js:149:3)
    at _runTestsForDescribeBlock (/dev/shm/workspace/kibana/node_modules/jest-circus/build/run.js:63:9)
    at _runTestsForDescribeBlock (/dev/shm/workspace/kibana/node_modules/jest-circus/build/run.js:57:9)
    at _runTestsForDescribeBlock (/dev/shm/workspace/kibana/node_modules/jest-circus/build/run.js:57:9)
    at run (/dev/shm/workspace/kibana/node_modules/jest-circus/build/run.js:25:3)
    at runAndTransformResultsToJestFormat (/dev/shm/workspace/kibana/node_modules/jest-circus/build/legacy-code-todo-rewrite/jestAdapterInit.js:176:21)
    at jestAdapter (/dev/shm/workspace/kibana/node_modules/jest-circus/build/legacy-code-todo-rewrite/jestAdapter.js:109:19)
    at runTestInternal (/dev/shm/workspace/kibana/node_modules/jest-runner/build/runTest.js:380:16)
    at runTest (/dev/shm/workspace/kibana/node_modules/jest-runner/build/runTest.js:472:34)
    at Object.worker (/dev/shm/workspace/kibana/node_modules/jest-runner/build/testWorker.js:133:12)

Kibana Pipeline / xpackJest / Jest Tests.x-pack/plugins/task_manager/server/queries.TaskClaiming claimAvailableTasks it filters out invalid tasks that arent SavedObjects

Link to Jenkins

Standard Out

Failed Tests Reporter:
  - Test has not failed recently on tracked branches


Stack Trace

Error: expect(received).toContainEqual(expected) // deep equality

Expected value: {"bool": {"must": [{"term": {"task.ownerId": "5a2053f5-2f09-11e9-a525-478ed346e6fd"}}, {"term": {"task.status": "claiming"}}]}}
Received array: [{"term": {"type": "task"}}, {"bool": {"must": [{"bool": {"must": [{"term": {"task.ownerId": "5a2053f5-2f09-11e9-a525-478ed346e6fd"}}, {"term": {"task.status": "claiming"}}]}}, {"bool": {"filter": [{"bool": {"should": [[Object], [Object], [Object]]}}]}}]}}]
    at Object.<anonymous> (/dev/shm/workspace/kibana/x-pack/plugins/task_manager/server/queries/task_claiming.test.ts:919:33)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at _callCircusTest (/dev/shm/workspace/kibana/node_modules/jest-circus/build/run.js:212:5)
    at _runTest (/dev/shm/workspace/kibana/node_modules/jest-circus/build/run.js:149:3)
    at _runTestsForDescribeBlock (/dev/shm/workspace/kibana/node_modules/jest-circus/build/run.js:63:9)
    at _runTestsForDescribeBlock (/dev/shm/workspace/kibana/node_modules/jest-circus/build/run.js:57:9)
    at _runTestsForDescribeBlock (/dev/shm/workspace/kibana/node_modules/jest-circus/build/run.js:57:9)
    at run (/dev/shm/workspace/kibana/node_modules/jest-circus/build/run.js:25:3)
    at runAndTransformResultsToJestFormat (/dev/shm/workspace/kibana/node_modules/jest-circus/build/legacy-code-todo-rewrite/jestAdapterInit.js:176:21)
    at jestAdapter (/dev/shm/workspace/kibana/node_modules/jest-circus/build/legacy-code-todo-rewrite/jestAdapter.js:109:19)
    at runTestInternal (/dev/shm/workspace/kibana/node_modules/jest-runner/build/runTest.js:380:16)
    at runTest (/dev/shm/workspace/kibana/node_modules/jest-runner/build/runTest.js:472:34)
    at Object.worker (/dev/shm/workspace/kibana/node_modules/jest-runner/build/testWorker.js:133:12)

Kibana Pipeline / xpackJest / Jest Tests.x-pack/plugins/task_manager/server/queries.TaskClaiming claimAvailableTasks it returns task objects

Link to Jenkins

Standard Out

Failed Tests Reporter:
  - Test has not failed recently on tracked branches


Stack Trace

Error: expect(received).toContainEqual(expected) // deep equality

Expected value: {"bool": {"must": [{"term": {"task.ownerId": "5a2053f6-2f09-11e9-a525-478ed346e6fd"}}, {"term": {"task.status": "claiming"}}]}}
Received array: [{"term": {"type": "task"}}, {"bool": {"must": [{"bool": {"must": [{"term": {"task.ownerId": "5a2053f6-2f09-11e9-a525-478ed346e6fd"}}, {"term": {"task.status": "claiming"}}]}}, {"bool": {"filter": [{"bool": {"should": [[Object], [Object], [Object]]}}]}}]}}]
    at Object.<anonymous> (/dev/shm/workspace/kibana/x-pack/plugins/task_manager/server/queries/task_claiming.test.ts:1017:33)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at _callCircusTest (/dev/shm/workspace/kibana/node_modules/jest-circus/build/run.js:212:5)
    at _runTest (/dev/shm/workspace/kibana/node_modules/jest-circus/build/run.js:149:3)
    at _runTestsForDescribeBlock (/dev/shm/workspace/kibana/node_modules/jest-circus/build/run.js:63:9)
    at _runTestsForDescribeBlock (/dev/shm/workspace/kibana/node_modules/jest-circus/build/run.js:57:9)
    at _runTestsForDescribeBlock (/dev/shm/workspace/kibana/node_modules/jest-circus/build/run.js:57:9)
    at run (/dev/shm/workspace/kibana/node_modules/jest-circus/build/run.js:25:3)
    at runAndTransformResultsToJestFormat (/dev/shm/workspace/kibana/node_modules/jest-circus/build/legacy-code-todo-rewrite/jestAdapterInit.js:176:21)
    at jestAdapter (/dev/shm/workspace/kibana/node_modules/jest-circus/build/legacy-code-todo-rewrite/jestAdapter.js:109:19)
    at runTestInternal (/dev/shm/workspace/kibana/node_modules/jest-runner/build/runTest.js:380:16)
    at runTest (/dev/shm/workspace/kibana/node_modules/jest-runner/build/runTest.js:472:34)
    at Object.worker (/dev/shm/workspace/kibana/node_modules/jest-runner/build/testWorker.js:133:12)

and 2 more failures, only showing the first 3.

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

* master: (55 commits)
  [APM-UI][E2E] use githubNotify step (elastic#90514)
  [APM] Export ProcessorEvent type (elastic#90540)
  [Lens] Retain column config (elastic#90048)
  [Data Table] Add unit tests (elastic#90173)
  Migrate most plugins to synchronous lifecycle (elastic#89562)
  skip flaky suite (elastic#90555)
  skip flaky suite (elastic#64473)
  [actions] improve email action doc (elastic#90020)
  [Fleet] Support Fleet server system indices (elastic#89372)
  skip flaky suite (elastic#90552)
  Bump immer dependencies (elastic#90267)
  Unrevert "Migrations v2: don't auto-create indices + FTR/esArchiver support (elastic#85778)" (elastic#89992)
  [Search Sessions] Use sync config (elastic#90138)
  chore(NA): add safe guard to remove bazelisk from yarn global at bootstrap (elastic#90538)
  [test] Await retry.waitFor (elastic#90456)
  chore(NA): integrate build buddy with our bazel setup and remote cache for ci (elastic#90116)
  Skip failing suite (elastic#90526)
  [Fleet] Fix incorrect conversion of string to numeric values in agent YAML (elastic#90371)
  [Docs] Update reporting troubleshooting for arm rhel/centos (elastic#90385)
  chore(NA): build bazel projects all at once in the distributable build process (elastic#90328)
  ...
Comment on lines -88 to -91
// The clusterMonitoring task occupies 2 workers, so if the system has 10 worker slots,
// 5 clusterMonitoring tasks could run concurrently per Kibana instance. This value is
// overridden by the `override_num_workers` config value, if specified.
numWorkers: 2,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hasn't been supported in ages but the docs were missed.

};
};

private async markAvailableTasksAsClaimed({
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was extracted from TaskStore and moved here in order to keep the Store focused on ES APIs and keep domain specific business logic separated.

/**
* Fetches tasks from the index, which are owned by the current Kibana instance
*/
private async sweepForClaimedTasks(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This too was extracted from Task Store.

Comment on lines +135 to +139
maxConcurrency: schema.maybe(
schema.number({
min: 0,
})
),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to keep 0 as an option in case a TaskType wanted to disable the ability to enable their task in a specific Kibana instance via configuration.
This came up in the original Reporting discussions around preventing tasks from running in certain circumstances (for example, an Alerting only Kibana instance that can't run any other task type).

It seemed straight forward enough that enabling this here made sense to me- but open for discussion if anyone thinks this should have a min of 1.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good log message that indicates that this configuration might be the reason for a task not running? For debugging purposes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 🙏

@gmmorris gmmorris changed the title Task manager/support limited concurrency [Task manager] Adds support for limited concurrency tasks Feb 8, 2021
@gmmorris gmmorris added Feature:Task Manager release_note:enhancement Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v7.12.0 v8.0.0 labels Feb 8, 2021
@gmmorris gmmorris marked this pull request as ready for review February 8, 2021 16:16
@gmmorris gmmorris requested a review from a team as a code owner February 8, 2021 16:16
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@gmmorris gmmorris requested a review from tsullivan February 9, 2021 09:45
Copy link
Copy Markdown
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review looks good! Is there a way you suggest to manually validate (other than seeing the functional tests run?)

Comment on lines +135 to +139
maxConcurrency: schema.maybe(
schema.number({
min: 0,
})
),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good log message that indicates that this configuration might be the reason for a task not running? For debugging purposes?

@gmmorris
Copy link
Copy Markdown
Contributor Author

gmmorris commented Feb 9, 2021

Is there a way you suggest to manually validate (other than seeing the functional tests run?)
Sadly no, as no Task type implements this other than the types in the tests.

You could add this config to the Alerting tasks and see that only one alert of each type can run in parallel 🤔

@gmmorris
Copy link
Copy Markdown
Contributor Author

gmmorris commented Feb 9, 2021

@ymao1

Is there a good log message that indicates that this configuration might be the reason for a task not running? For debugging purposes?

That's a god question 🤔
When would you expect to see this message? We don't reject these tasks but rather we don't pick them up.
Should this be logged on each cycle? Presumably not.
Perhaps once when TM's starts querying? At the end of the setup stage?

@gmmorris
Copy link
Copy Markdown
Contributor Author

gmmorris commented Feb 9, 2021

@ymao1

Is there a good log message that indicates that this configuration might be the reason for a task not running? For debugging purposes?

That's a god question 🤔
When would you expect to see this message? We don't reject these tasks but rather we don't pick them up.
Should this be logged on each cycle? Presumably not.
Perhaps once when TM's starts querying? At the end of the setup stage?

Added in Task Claiming constructor, so this will get logged as part of Task Manager's Start phase, which is guaranteed to happen after all Task Type definitions are provided.

Copy link
Copy Markdown
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I created a bunch of alerts with actions and then set the max concurrency to 1 and saw that only 1 alert/action was picked up at a time.

@tsullivan
Copy link
Copy Markdown
Member

I have tested this change on top of #64853 and it works as expected :)

I confirm that with maxConcurrency: 1 only a single job executes at a time, even if more are pending:
image

@tsullivan tsullivan mentioned this pull request Feb 10, 2021
7 tasks
@gmmorris
Copy link
Copy Markdown
Contributor Author

I have tested this change on top of #64853 and it works as expected :)

I confirm that with maxConcurrency: 1 only a single job executes at a time, even if more are pending:
image

Amazing, thanks!

Once I get another review from the team this will be merged. 🎉

* master: (99 commits)
  [Fleet] Use Fleet Server indices in the search bar (elastic#90835)
  [Search Sessions] added an info flyout to session management (elastic#90559)
  [ILM] Revisit searchable snapshot field after new redesign (elastic#90793)
  [Alerting] License Errors on Alert List View (elastic#89920)
  RFC Improve saved object migrations algorithm (elastic#84333)
  [Lens] (Accessibility) Fix focus on drag and drop actions (elastic#90561)
  Use new shortcut links to Fleet discuss forums. (elastic#90786)
  Do not generate an ephemeral encryption key in production. (elastic#81511)
  [Fleet] Use staging registry for snapshot builds (elastic#90327)
  Actually deleting x-pack/tsconfig.refs.json (elastic#90898)
  Add deprecation warning to all Beats CM pages. (elastic#90741)
  skip flaky suite (elastic#90136)
  Revert "Revert "[Metrics UI] Add Metrics Anomaly Alert Type (elastic#89244)"" (elastic#90889)
  remove ref to removed tsconfig file
  [core.logging] Uses host timezone as default (elastic#90368)
  [Maps] remove maps_file_upload plugin and fold public folder into file_upload plugin (elastic#90292)
  Revert "[Metrics UI] Add Metrics Anomaly Alert Type (elastic#89244)"
  [dev-utils/ci-stats] support disabling ship errors (elastic#90851)
  Prefix with / (elastic#90836)
  [Metrics UI] Add Metrics Anomaly Alert Type (elastic#89244)
  ...
Copy link
Copy Markdown
Contributor

@YulNaumenko YulNaumenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a quick pass after seeing the two approvals, but was curious. LGTM.

Added a note about experimenting with shuffling the concurrency groups at startup, to perhaps avoid some version conflicts w/multiple Kibanas.

* master: (44 commits)
  [APM] Add experimental support for Data Streams (elastic#89650)
  [Search Session] Control "Kibana / Search Sessions" management section by privileges (elastic#90818)
  [Lens] Median as default function (elastic#90952)
  Implement custom global header banner (elastic#87438)
  [Fleet] Reduce permissions. (elastic#90302)
  Update dependency @elastic/charts to v24.5.1 (elastic#89822)
  [Create index pattern] Can't create single character index without wildcard (elastic#90919)
  [ts/build_ts_refs] add support for --clean flag (elastic#91060)
  Don't clean when running e2e tests (elastic#91057)
  Fixes track_total_hits in the body not having an effect when using search strategy (elastic#91068)
  [Security Solution][Detections] Adds list plugin Saved Objects to Security feature privilege (elastic#90895)
  Removing the code plugin entirely for 8.0 (elastic#77940)
  chore(NA): move the instruction to remove yarn global bazelisk package into the first place on install bazel tools (elastic#91026)
  [jest/ci] remove max-old-space-size override to use 4gb default (elastic#91020)
  [Fleet] Restrict integration changes for managed policies (elastic#90675)
  [CI] Fix auto-backport condditions so that it doesn't trigger for other labels (elastic#91042)
  [DOCS] Uses variable to refer to query profiler (elastic#90976)
  [App Search] Relevance Tuning logic listeners (elastic#89461)
  [Metrics UI] Fix saving/loading saved views from URL (elastic#90216)
  Limit cardinality of transaction.name (elastic#90955)
  ...
* master:
  [Alerting][Docs] adds documentation on NTP based synchronization (elastic#90747)
@gmmorris
Copy link
Copy Markdown
Contributor Author

@elasticmachine merge upstream

@kibanamachine
Copy link
Copy Markdown
Contributor

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@gmmorris gmmorris merged commit 619db36 into elastic:master Feb 11, 2021
gmmorris added a commit to gmmorris/kibana that referenced this pull request Feb 11, 2021
)

Adds support for limited concurrency on a Task Type.
gmmorris added a commit that referenced this pull request Feb 11, 2021
…91141)

Adds support for limited concurrency on a Task Type.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature:Task Manager release_note:enhancement Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v7.12.0 v8.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Task Manager] Support for limited concurrency Task Types

7 participants