Improve Task Manager instrumentation#99160
Conversation
ymao1
left a comment
There was a problem hiding this comment.
Looks good! Left a comment about adding some tests. Do you think you can provide instructions for generating the charts you have in your PR locally? Is that possible? I'd love to see it all working.
|
Hi @elastic/kibana-alerting-services Did anyone discuss the state of this PR with @dgieselaar ? |
Sorry, I dropped the ball on that. I'm happy to add unit tests for the apm calls but I still don't know how to get APM running to see the traces. I can look through the docs |
|
@elasticmachine merge upstream |
|
@elasticmachine merge upstream |
| /** | ||
| * The serialized traceparent string of the current APM transaction or span. | ||
| */ | ||
| traceparent?: string; | ||
|
|
There was a problem hiding this comment.
I think we should add a note in the Readme.md about using this field to trace the execution of tasks in apm.
Other plugins might want to utilise this.
| return await this.store.schedule(modifiedTask); | ||
| return await this.store.schedule({ | ||
| ...modifiedTask, | ||
| traceparent: agent.currentTraceparent ?? '', |
There was a problem hiding this comment.
Does an empty string tell APM that there is no trace here? Or is this going to cause all tasks that don't have a traceparent to appear in APM as related because they all get grouped under an "empty string" transaction? 😬
|
@elasticmachine merge upstream |
| try { | ||
| this.task = this.definition.createTaskRunner(modifiedContext); | ||
| const result = await this.task.run(); | ||
| const result = await withSpan({ name: 'run', type: 'task manager' }, () => this.task!.run()); |
There was a problem hiding this comment.
Just curious, why is this.task! necessary in this PR, but not in master? (on the above line)
chrisronline
left a comment
There was a problem hiding this comment.
LGTM!
Pretty cool to see how useful this integration is and will continue to be for us and users!
|
@elasticmachine merge upstream |
💚 Build SucceededMetrics [docs]History
To update your PR or re-run it, just comment with: |
* Instrument task manager * Don't refresh after SO updates * Update snapshot test, remove beforeRun instrumentation * Revert "Don't refresh after SO updates" This reverts commit 54f848d. * Fix task_store unit test * Adding tests and updating ConcreteTaskInstance interface with traceparent * Reverting unnecessary changes Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Ying Mao <ying.mao@elastic.co>
💚 Backport successful
This backport PR will be merged automatically after passing CI. |
* Instrument task manager * Don't refresh after SO updates * Update snapshot test, remove beforeRun instrumentation * Revert "Don't refresh after SO updates" This reverts commit 54f848d. * Fix task_store unit test * Adding tests and updating ConcreteTaskInstance interface with traceparent * Reverting unnecessary changes Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Ying Mao <ying.mao@elastic.co> Co-authored-by: Dario Gieselaar <dario.gieselaar@elastic.co> Co-authored-by: Ying Mao <ying.mao@elastic.co>
* master: (90 commits) Fix UI breaks on providing long search keyword in 'Search Box' (elastic#101385) Adds css class to EuiDescriptionListDescription in order to break word on exception details card (elastic#101481) [Lens] Increase timings for drag and drop tests (elastic#101380) [Lens] Fix editor react error on configuration panel (elastic#101367) [Fleet] Move integrations to a separate app (elastic#99848) Fix incorrect message displayed on importing Timeline Templates (elastic#101288) [Cases] RBAC (elastic#95058) [APM] Visual improvements for new APM layout with left navigation (elastic#101360) [master] More precise alerts matching (elastic#99820) [Lens] Value in legend (elastic#101353) Revert "[Reporting] ILM policy for managing reporting indices (elastic#100130)" (elastic#101358) [Discover] Fix header row of data grid in Firefox (elastic#101374) Add link to advanced setting in Discover (elastic#101154) Url service locators (elastic#101045) [Timelion] Update the removal message to mention the exact version (elastic#100994) [Security Solution][Detection Engine] Test cases for alias failure test cases where we don't copy aliases correctly (elastic#101437) [Event Log] Adding `type_id` to saved object array in event log (elastic#100939) [Reporting] Add `location.url` info to console message logs (elastic#101427) [Security Solutions][Detection Engine] Fixes timestamp bugs within source indexes when the formats are not ISO8601 format (elastic#101349) Improve Task Manager instrumentation (elastic#99160) ...
In #90955, I dialed down the cardinality of Task Manager transaction names due to the performance impact of breakdown metrics. Now that some performance issues are fixed in the APM Agent, and the need for better observability of the Task Manager is evident, it makes sense to revert those changes and add additional instrumentation.
The following changes were made:
run,markAvailableTasksAsClaimed,markTaskAsRunningruntransaction typesAdditionally, I've setmoved to #99919refreshtofalsein some situations, as by default the SO client will usewait_for, and wait on a shard refresh. That could unnecessarily delay task completion, and consume resources that could otherwise be freed up.For the screenshots, I disabled http/https instrumentation to prevent the http spans from showing up, and I locally worked around an instrumentation issue in the Node.js agent that messes up span relationships (see elastic/apm-agent-nodejs#1964 for a description of the problem).
Before:
After: