[APM] Collect telemetry about data/API performance#51612
[APM] Collect telemetry about data/API performance#51612dgieselaar merged 3 commits intoelastic:masterfrom
Conversation
07a0d9b to
0d39b5a
Compare
💔 Build Failed
|
|
Spoke to the telemetry folks, we should wrap this in a task that stores data to saved objects (similar to what we had), and we should investigate more aggressive pre-processing on our side (e.g., min/max/avg/percentiles) |
0d39b5a to
d83b37e
Compare
💔 Build Failed
|
d83b37e to
a2d96cc
Compare
💔 Build Failed
|
a2d96cc to
caee49b
Compare
|
Pinging @elastic/apm-ui (Team:apm) |
x-pack/legacy/plugins/apm/scripts/upload-telemetry-data/download-telemetry-mapping.ts
Outdated
Show resolved
Hide resolved
TinaHeiligers
left a comment
There was a problem hiding this comment.
I'm concerned over the current implementation, specifically about:
- writing directly to the
xpack-phone-homeindex - using a dynamic mapping for the
apmdata that's being overwritten in thexpack-phone-homeindex which is, on purpose, not mapped dynamically. - At the moment, the
xpack-home-phonetemplate is updated when new mappings are added - Adding yet another collector
I'd like some other members of the Pulse team to take a look before specifically approving/requesting changes and, while I realize the target is for 7.7, I hope you understand the concerns.
There was a problem hiding this comment.
Dynamic mappings aren't set within the current telemetry service. The properties have to be defined explicitly.
There was a problem hiding this comment.
As far as I understand these mappings are for the SavedObjects. It should not cause any effect on the telemetry service. But it's true I think it's considered an antipattern the dynamic: true mappings in the SavedObjects as well (maybe the @elastic/kibana-platform team can confirm).
@dgieselaar mentioned it was dynamic only during the development phase. Just please don't forget to change this to the actual mapping when done :)
There was a problem hiding this comment.
The reason for having dynamic:true was given in a Slack message:
...the dynamic mapping is temporary. Once the data that we are collecting is finalized I was planning on adding a static mapping
There was a problem hiding this comment.
The mapping for saved objects is no longer dynamic, thanks!
x-pack/legacy/plugins/apm/scripts/upload-telemetry-data/index.ts
Outdated
Show resolved
Hide resolved
afharo
left a comment
There was a problem hiding this comment.
Overall, it looks great! I only added some FYIs and nits.
The only main thing I'd personally don't do here is the upload-telemetry-data script. Unless there's a reason to have it I can't think of at the moment.
There was a problem hiding this comment.
| export function isAgentName(agentName: string): boolean { | |
| export function isAgentName(agentName: string): agentName is AgentName { |
There was a problem hiding this comment.
| ): agentName is AgentName { | |
| ): agentName is 'js-base' | 'rum-js' { |
There was a problem hiding this comment.
As far as I understand these mappings are for the SavedObjects. It should not cause any effect on the telemetry service. But it's true I think it's considered an antipattern the dynamic: true mappings in the SavedObjects as well (maybe the @elastic/kibana-platform team can confirm).
@dgieselaar mentioned it was dynamic only during the development phase. Just please don't forget to change this to the actual mapping when done :)
x-pack/legacy/plugins/apm/scripts/upload-telemetry-data/index.ts
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
I love this taskManager approach to get all the stats ready to consume from the savedObject at any point when the usage collector requires it.
Especially taking into account all the requests we do to retrieve this telemetry. That's a good amount of information! Well done! 👏
Just FYI, we usually report telemetry every 24h so, maybe running the collection logic every minute is a bit too much? ;)
There was a problem hiding this comment.
changed this to 24h - another artifact of "dev mode" 😅
There was a problem hiding this comment.
btw, let me know if it should be anything else than 24h
There was a problem hiding this comment.
Changed it to 720m (12 hours, the task manager only supports seconds and minutes)
There was a problem hiding this comment.
There is a risk we might delete existing telemetry because of
https://github.com/elastic/kibana/blob/caee49b307bf9a6c8c1c146b1a7c35fd044fd894/x-pack/legacy/plugins/apm/server/lib/apm_telemetry/collect_data_telemetry/index.ts#L64-L68
But the chances should be low so I wouldn't worry too much about it (just a heads up) ;)
There was a problem hiding this comment.
It's amazing to have this! This will help us much to create the mapping in the telemetry cluster! Thanks for being so thoughtful and strict with the types! ❤️
x-pack/plugins/apm/kibana.json
Outdated
There was a problem hiding this comment.
the recommendation is to import this plugin as optional:
https://github.com/elastic/kibana/blob/master/src/plugins/usage_collection/README.md#new-platform
From a Slack message:
|
5b0c188 to
2197df8
Compare
65f17d2 to
55355a6
Compare
There was a problem hiding this comment.
thank you! this is so awesome will be so useful when we add the mapping in the telemetry cluster. Please refer to this file when requesting it :)
There was a problem hiding this comment.
I wonder if the removal of the "null_value" might cause any issues?
Feel free to dismiss this if this change is intended :)
There was a problem hiding this comment.
good catch - I don't think we need a fallback anymore but I'll put it there to prevent any backwards compat issues.
There was a problem hiding this comment.
Maybe add a description to this script about what it does
There was a problem hiding this comment.
aye, I could do it in this file, in dev_docs or in a README.md file in this folder, any preferences?
There was a problem hiding this comment.
Why is githubToken not required? If the repo is private I'd expect this to require a token.
There was a problem hiding this comment.
Yeah, I'll just fail the process in that case. Can't remember why I made it optional.
There was a problem hiding this comment.
Perhaps it would be cleaner with safeLoad done separately?
Also: shouldn't elasticsearch.hosts be in the explicit type?
There was a problem hiding this comment.
I'll split it up a bit. elasticsearch.hosts is already in the type because it has a default value, but it doesn't hurt to have it in both, I'll add it there as well (it looks odd).
There was a problem hiding this comment.
I think SERVICE_AGENT_NAME and SERVICE_AGENT_VERSION should be called AGENT_NAME and AGENT_VERSION since they are no longer on the "service" context (they used to be afair)
There was a problem hiding this comment.
ah yeah was wondering why that was the case. I'll update
x-pack/plugins/apm/server/plugin.ts
Outdated
There was a problem hiding this comment.
Why no longer in the context of 'apm'?
There was a problem hiding this comment.
it's already scoped, so scoping it again leads to [apm] [apm] output
642543f to
cd8d753
Compare
cd8d753 to
72b7ae9
Compare
| serviceMapEnabled: Joi.boolean().default(true), | ||
|
|
||
| // telemetry | ||
| telemetryCollectionEnabled: Joi.boolean().default(true) |
There was a problem hiding this comment.
do we want a specific flag to disable APM-only telemetry (while the rest of the telemetry is enabled)? Wouldn't that complicate the analysis about the usage? aka: how can I tell APM is used but not reporting telemetry because it's disabled?
We could do that by checking the application_usage.apm stats... but I'd like @alexfrancoeur thoughts on this.
There was a problem hiding this comment.
Would be great to get some additional feedback 👍🏻the reason why I've added it because we want an escape hatch in case the queries are too expensive, without requiring our users to disable telemetry altogether in those cases.
There was a problem hiding this comment.
@afharo If you don't mind, I'll merge this in given the time constraints, but let me know if there's anything we need to change.
💚 Build SucceededHistory
To update your PR or re-run it, just comment with: |
spalger
left a comment
There was a problem hiding this comment.
yarn.lock config changes LGTM
|
This was causing type check failures in master once it was merged so it was reverted in 6de7f2a |
* master: (34 commits) [APM] add service map config options to legacy plugin (elastic#61002) [App Arch] migrate legacy CSS to new platform (core_plugins/kibana_react) (elastic#59882) Migrated styles for "share" plugin to new platform (elastic#59981) [ML] Module setup with dynamic model memory estimation (elastic#60656) Drilldowns (elastic#59632) Upgrade mocha dev-dependency from 6.2.2 to 7.1.1 (elastic#60779) [SIEM] Overview: Recent cases widget (elastic#60993) [ML] Functional tests - stabilize df analytics clone tests (elastic#60497) [SIEM] Updates process and TLS tables to use ECS 1.5 fields (elastic#60854) Migrate doc view part of discover (elastic#58094) Revert "[APM] Collect telemetry about data/API performance (elastic#51612)" fix(NA): log rotation watchers usage (elastic#60956) [SIEM] [CASES] Build lego blocks case details view (elastic#60864) Create Painless Lab app (elastic#57538) [SIEM] Move Timeline Template field to first step of rule creation (elastic#60840) [Reporting/New Platform Migration] Use a new config service on server-side (elastic#55882) [Alerting] allow email action to not require auth (elastic#60839) [Maps] Default ES document layer scaling type to clusters and show scaling UI in the create wizard (elastic#60668) [APM] Collect telemetry about data/API performance (elastic#51612) Implement Kibana Login Selector (elastic#53010) ...
…lastic#51612)"" This reverts commit 6de7f2a.
…ic#61030) * Revert "Revert "[APM] Collect telemetry about data/API performance (elastic#51612)"" This reverts commit 6de7f2a. * Update transaction mock data to reflect the type
Closes #50757.
An example of collected telemetry:
{ "counts": { "error": { "1d": 162227, "all": 6809905 }, "metric": { "1d": 975978, "all": 48222846 }, "span": { "1d": 10592884, "all": 483896330 }, "transaction": { "1d": 1950203, "all": 92890244 }, "onboarding": { "1d": 0, "all": 7 }, "sourcemap": { "1d": 0, "all": 0 }, "agent_configuration": { "all": 3 }, "max_error_groups_per_service": { "1d": 1838 }, "max_transaction_groups_per_service": { "1d": 25 }, "traces": { "1d": 1483642 }, "services": { "1d": 8 } }, "retainment": { "error": { "ms": 4445241078 }, "metric": { "ms": 4445248747 }, "span": { "ms": 4445259937 }, "transaction": { "ms": 4445275561 }, "onboarding": { "ms": 4445288673 } }, "tasks": { "processor_events": { "took": { "ms": 32468 } }, "agent_configuration": { "took": { "ms": 19 } }, "services": { "took": { "ms": 3339 } }, "versions": { "took": { "ms": 102 } }, "groupings": { "took": { "ms": 3462 } }, "integrations": { "took": { "ms": 51 } }, "agents": { "took": { "ms": 12967 } }, "indices_stats": { "took": { "ms": 71 } } }, "has_any_services": true, "services_per_agent": { "java": 1, "js-base": 1, "rum-js": 0, "dotnet": 1, "go": 2, "nodejs": 1, "python": 1, "ruby": 1 }, "versions": { "apm_server": { "major": 8, "minor": 0, "patch": 0 } }, "integrations": { "ml": { "all_jobs_count": 1 } }, "agents": { "java": { "agent": { "version": [ "1.13.1-SNAPSHOT" ] }, "service": { "framework": { "name": [], "version": [], "composite": [] }, "language": { "name": [ "Java" ], "version": [ "10.0.2" ], "composite": [ "Java/10.0.2" ] }, "runtime": { "name": [ "Java" ], "version": [ "10.0.2" ], "composite": [ "Java/10.0.2" ] } } }, "js-base": { "agent": { "version": [ "4.9.0" ] }, "service": { "framework": { "name": [], "version": [], "composite": [] }, "language": { "name": [ "javascript" ], "version": [], "composite": [] }, "runtime": { "name": [], "version": [], "composite": [] } } }, "rum-js": { "agent": { "version": [] }, "service": { "framework": { "name": [], "version": [], "composite": [] }, "language": { "name": [], "version": [], "composite": [] }, "runtime": { "name": [], "version": [], "composite": [] } } }, "dotnet": { "agent": { "version": [ "1.3.1" ] }, "service": { "framework": { "name": [ "ASP.NET Core" ], "version": [ "2.2.0.0" ], "composite": [ "ASP.NET Core/2.2.0.0" ] }, "language": { "name": [ "C#" ], "version": [], "composite": [] }, "runtime": { "name": [ ".NET Core" ], "version": [ "2.2.8" ], "composite": [ ".NET Core/2.2.8" ] } } }, "go": { "agent": { "version": [ "1.7.0", "1.7.1" ] }, "service": { "framework": { "name": [ "gin" ], "version": [ "v1.4.0" ], "composite": [ "gin/v1.4.0" ] }, "language": { "name": [ "go" ], "version": [ "go1.13.8", "go1.14" ], "composite": [ "go/go1.13.8", "go/go1.14" ] }, "runtime": { "name": [ "gc" ], "version": [ "go1.13.8", "go1.14" ], "composite": [ "gc/go1.13.8", "gc/go1.14" ] } } }, "nodejs": { "agent": { "version": [ "3.5.0" ] }, "service": { "framework": { "name": [ "express" ], "version": [ "4.17.1" ], "composite": [ "express/4.17.1" ] }, "language": { "name": [ "javascript" ], "version": [], "composite": [] }, "runtime": { "name": [ "node" ], "version": [ "12.16.1" ], "composite": [ "node/12.16.1" ] } } }, "python": { "agent": { "version": [ "5.4.3" ] }, "service": { "framework": { "name": [ "django" ], "version": [ "2.1.13" ], "composite": [ "django/2.1.13" ] }, "language": { "name": [ "python" ], "version": [ "3.6.10" ], "composite": [ "python/3.6.10" ] }, "runtime": { "name": [ "CPython" ], "version": [ "3.6.10" ], "composite": [ "CPython/3.6.10" ] } } }, "ruby": { "agent": { "version": [ "3.5.0" ] }, "service": { "framework": { "name": [ "Ruby on Rails" ], "version": [ "5.2.4.1" ], "composite": [ "Ruby on Rails/5.2.4.1" ] }, "language": { "name": [ "ruby" ], "version": [ "2.6.5" ], "composite": [ "ruby/2.6.5" ] }, "runtime": { "name": [ "ruby" ], "version": [ "2.6.5" ], "composite": [ "ruby/2.6.5" ] } } } }, "indices": { "shards": { "total": 24 }, "all": { "total": { "docs": { "count": 1135679324 }, "store": { "size_in_bytes": 570622947437 } } } } }