Skip to content

Improve Task Manager instrumentation#99160

Merged
ymao1 merged 16 commits intoelastic:masterfrom
dgieselaar:instrument-tm
Jun 4, 2021
Merged

Improve Task Manager instrumentation#99160
ymao1 merged 16 commits intoelastic:masterfrom
dgieselaar:instrument-tm

Conversation

@dgieselaar
Copy link
Copy Markdown
Contributor

@dgieselaar dgieselaar commented May 4, 2021

In #90955, I dialed down the cardinality of Task Manager transaction names due to the performance impact of breakdown metrics. Now that some performance issues are fixed in the APM Agent, and the need for better observability of the Task Manager is evident, it makes sense to revert those changes and add additional instrumentation.

The following changes were made:

  • Disparate transaction types for run, markAvailableTasksAsClaimed, markTaskAsRunning
  • Use the task type as the transaction name for run transaction types
  • Wrap task manager/action operations in custom spans, to group ES calls together
  • Store the traceparent in the task object, to enable distributed tracing for tasks (ie, if a task schedules another task, it will show up in the same trace)

Additionally, I've set refresh to false in some situations, as by default the SO client will use wait_for, and wait on a shard refresh. That could unnecessarily delay task completion, and consume resources that could otherwise be freed up. moved to #99919

For the screenshots, I disabled http/https instrumentation to prevent the http spans from showing up, and I locally worked around an instrumentation issue in the Node.js agent that messes up span relationships (see elastic/apm-agent-nodejs#1964 for a description of the problem).

Before:

image

After:

image

image

image

@dgieselaar dgieselaar requested a review from gmmorris May 4, 2021 08:33
@dgieselaar dgieselaar marked this pull request as ready for review May 4, 2021 10:00
@dgieselaar dgieselaar requested a review from a team as a code owner May 4, 2021 10:00
@dgieselaar dgieselaar added release_note:skip Skip the PR/issue when compiling release notes v7.14.0 labels May 4, 2021
Copy link
Copy Markdown
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left a comment about adding some tests. Do you think you can provide instructions for generating the charts you have in your PR locally? Is that possible? I'd love to see it all working.

@gmmorris
Copy link
Copy Markdown
Contributor

Hi @elastic/kibana-alerting-services

Did anyone discuss the state of this PR with @dgieselaar ?
Will Dario be continuing this when he's back from leave, or are we going to do something with it ourselves? 🤔

@ymao1
Copy link
Copy Markdown
Contributor

ymao1 commented May 24, 2021

Hi @elastic/kibana-alerting-services

Did anyone discuss the state of this PR with @dgieselaar ?
Will Dario be continuing this when he's back from leave, or are we going to do something with it ourselves? 🤔

Sorry, I dropped the ball on that. I'm happy to add unit tests for the apm calls but I still don't know how to get APM running to see the traces. I can look through the docs

@ymao1
Copy link
Copy Markdown
Contributor

ymao1 commented May 24, 2021

@elasticmachine merge upstream

Copy link
Copy Markdown
Contributor

@ymao1 ymao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@smith
Copy link
Copy Markdown
Contributor

smith commented May 27, 2021

@elasticmachine merge upstream

@smith smith added the auto-backport Deprecated - use backport:version if exact versions are needed label May 27, 2021
Comment on lines +260 to +264
/**
* The serialized traceparent string of the current APM transaction or span.
*/
traceparent?: string;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a note in the Readme.md about using this field to trace the execution of tasks in apm.
Other plugins might want to utilise this.

return await this.store.schedule(modifiedTask);
return await this.store.schedule({
...modifiedTask,
traceparent: agent.currentTraceparent ?? '',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does an empty string tell APM that there is no trace here? Or is this going to cause all tasks that don't have a traceparent to appear in APM as related because they all get grouped under an "empty string" transaction? 😬

@gmmorris
Copy link
Copy Markdown
Contributor

gmmorris commented Jun 2, 2021

@elasticmachine merge upstream

Copy link
Copy Markdown
Contributor

@gmmorris gmmorris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but added a couple of questions.
I'd appreciate if another member of @elastic/kibana-alerting-services took a look as @ymao1 as not just reviewer here but also contributer. :)

@chrisronline chrisronline self-requested a review June 2, 2021 11:41
@ymao1 ymao1 added the Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// label Jun 2, 2021
try {
this.task = this.definition.createTaskRunner(modifiedContext);
const result = await this.task.run();
const result = await withSpan({ name: 'run', type: 'task manager' }, () => this.task!.run());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why is this.task! necessary in this PR, but not in master? (on the above line)

Copy link
Copy Markdown
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Pretty cool to see how useful this integration is and will continue to be for us and users!

@ymao1
Copy link
Copy Markdown
Contributor

ymao1 commented Jun 4, 2021

@elasticmachine merge upstream

@kibanamachine
Copy link
Copy Markdown
Contributor

💚 Build Succeeded

Metrics [docs]

Unknown metric groups

API count

id before after diff
taskManager 43 44 +1

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@ymao1 ymao1 merged commit c79a29a into elastic:master Jun 4, 2021
kibanamachine added a commit to kibanamachine/kibana that referenced this pull request Jun 4, 2021
* Instrument task manager

* Don't refresh after SO updates

* Update snapshot test, remove beforeRun instrumentation

* Revert "Don't refresh after SO updates"

This reverts commit 54f848d.

* Fix task_store unit test

* Adding tests and updating ConcreteTaskInstance interface with traceparent

* Reverting unnecessary changes

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Ying Mao <ying.mao@elastic.co>
@kibanamachine
Copy link
Copy Markdown
Contributor

💚 Backport successful

Status Branch Result
7.x

This backport PR will be merged automatically after passing CI.

kibanamachine added a commit that referenced this pull request Jun 4, 2021
* Instrument task manager

* Don't refresh after SO updates

* Update snapshot test, remove beforeRun instrumentation

* Revert "Don't refresh after SO updates"

This reverts commit 54f848d.

* Fix task_store unit test

* Adding tests and updating ConcreteTaskInstance interface with traceparent

* Reverting unnecessary changes

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Ying Mao <ying.mao@elastic.co>

Co-authored-by: Dario Gieselaar <dario.gieselaar@elastic.co>
Co-authored-by: Ying Mao <ying.mao@elastic.co>
gmmorris added a commit to gmmorris/kibana that referenced this pull request Jun 7, 2021
* master: (90 commits)
  Fix UI breaks on providing long search keyword in 'Search Box' (elastic#101385)
  Adds css class to EuiDescriptionListDescription in order to break word on exception details card (elastic#101481)
  [Lens] Increase timings for drag and drop tests (elastic#101380)
  [Lens] Fix editor react error on configuration panel (elastic#101367)
  [Fleet] Move integrations to a separate app (elastic#99848)
  Fix incorrect message displayed on importing Timeline Templates (elastic#101288)
  [Cases] RBAC (elastic#95058)
  [APM] Visual improvements for new APM layout with left navigation (elastic#101360)
  [master] More precise alerts matching (elastic#99820)
  [Lens] Value in legend (elastic#101353)
  Revert "[Reporting] ILM policy for managing reporting indices (elastic#100130)" (elastic#101358)
  [Discover] Fix header row of data grid in Firefox (elastic#101374)
  Add link to advanced setting in Discover (elastic#101154)
  Url service locators (elastic#101045)
  [Timelion] Update the removal message to mention the exact version (elastic#100994)
  [Security Solution][Detection Engine] Test cases for alias failure test cases where we don't copy aliases correctly (elastic#101437)
  [Event Log] Adding `type_id` to saved object array in event log (elastic#100939)
  [Reporting] Add `location.url` info to console message logs (elastic#101427)
  [Security Solutions][Detection Engine] Fixes timestamp bugs within source indexes when the formats are not ISO8601 format (elastic#101349)
  Improve Task Manager instrumentation (elastic#99160)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Deprecated - use backport:version if exact versions are needed Feature:Actions Feature:Task Manager release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Platform ResponseOps team (formerly the Cases and Alerting teams) t// v7.14.0 v8.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants