eIs your feature request related to a problem? Please describe.
APM delivered Agent Configuration in Kibana as beta in 7.3, intentionally deferring some aspects for later phases. The goal of this issue is to pick up where we left and have a more compelling feature towards GA. We are aiming 7.5.
Two lines of work stand out:
-
Provide support for more settings.
-
Provide feedback to the user in Kibana.
Describe the solution you'd like
On the first point, ideally we add support for 2-3 more fields. Some reasonable candidates are:
- SPAN_FRAMES_MIN_DURATION
On the second point, a good start would be to show in Kibana whether some configuration was applied by some agent or not. For this, we could use the Etags sent from the agents to know what is the last good value they have. More precisely:
- Kibana will be generating the Etags (either as content hashes or UIDs), instead of APM Server.
- APM Server will simply pass the Etags from agents to Kibana back and forth.
- Kibana will add a boolean field to the elasticsearch documents, initially
false, and flip it to true when it receives an Etag from any one agent matching the Etag in the document.
- That boolean indicates to the UI that some agent is up to date. When the user changes some setting in Kibana, it resets the boolean.
This approach entails a feedback delay of up to 2 times the POLL_INTERVAL, which is acceptable.
One downside is that if some agents succeed and others fail, those failures would be silently ignored. This is (hopefully!) an unlikely scenario.
Another downside is that it is not possible for Server/Kibana to distinguish failure from missing (agent never querying) unless we keep track of how many agents are around. This would add significant complexity. A workaround is trough documentation, warning users that if they don't see feedback in 2xPOLL_INTERVAL seconds it is probably because something went wrong.
Another option is that agents send the ephemeral_id upstream, and Kibana shows a count of agents that applied the last configuration based on the ephemeral ids (or maybe even show the ids themselves). This requires agents to calculate/generate their ephemeral_id. Alternatively, agents can use their IP address.
Describe alternatives you've considered
We could come up with a way to aggregate data coming from different agents. For instance:
- If all agents "so far" report success, show a "success" visual indicator the UI.
- If all agents "so far" report failure, show a "failed" visual indicator the UI.
- If some report success and some report failure, show both indicators.
- We could further help users to dig into which ones failed via the aforementioned
ephemeral_id, IP address, or similar.
This would require agents to send data to apm-server (probably to the same endpoint) and define a new schema. A new schema would also allows us to show more information than a boolean, eg: timestamp of config application, error messages, etc.
I suggest to not introduce a new data model until we know more how the feature is used, what problems users run into, how/why exactly agents might fail to apply configuration, etc.
We also need to decide if we want to support RUM in GA, later, or never. User feedback is probably not practical for the RUM agent.
Note that at the moment Agent Configuration in Kibana is unusable for customers with a Distributed Tracing setup, as the only available setting is SAMPLING_RATE, which in case of DT will be dictated by the RUM agent (almost always).
RUM status is tracked in elastic/apm-agent-rum-js#253
Finally, during the first design we briefly touched on auditing. Some sort of audit logs could be achieved eg. by creating one elasticsearch document per update, and adding information about the user that did that change. However if we do not plan to add an UI interface for it, it might be enough to simply log configuration updates in Kibana.
Implementation issues
Kibana
Server
Agents
See background in #4 and #76
eIs your feature request related to a problem? Please describe.
APM delivered Agent Configuration in Kibana as beta in 7.3, intentionally deferring some aspects for later phases. The goal of this issue is to pick up where we left and have a more compelling feature towards GA. We are aiming 7.5.
Two lines of work stand out:
Provide support for more settings.
Provide feedback to the user in Kibana.
Describe the solution you'd like
On the first point, ideally we add support for 2-3 more fields. Some reasonable candidates are:
ACTIVE/RECORDING(depends on [agents] definition of ACTIVE/DISABLED_INSTRUMENTATION #92)CAPTURE_BODYMETRICS_INTERVALIGNORE_URLSTRANSACTION_MAX_SPANS-SPAN_FRAMES_MIN_DURATIONOn the second point, a good start would be to show in Kibana whether some configuration was applied by some agent or not. For this, we could use the Etags sent from the agents to know what is the last good value they have. More precisely:
false, and flip it totruewhen it receives an Etag from any one agent matching the Etag in the document.This approach entails a feedback delay of up to 2 times the
POLL_INTERVAL, which is acceptable.One downside is that if some agents succeed and others fail, those failures would be silently ignored. This is (hopefully!) an unlikely scenario.
Another downside is that it is not possible for Server/Kibana to distinguish failure from missing (agent never querying) unless we keep track of how many agents are around. This would add significant complexity. A workaround is trough documentation, warning users that if they don't see feedback in 2x
POLL_INTERVALseconds it is probably because something went wrong.Another option is that agents send the
ephemeral_idupstream, and Kibana shows a count of agents that applied the last configuration based on the ephemeral ids (or maybe even show the ids themselves). This requires agents to calculate/generate theirephemeral_id. Alternatively, agents can use their IP address.Describe alternatives you've considered
We could come up with a way to aggregate data coming from different agents. For instance:
ephemeral_id, IP address, or similar.This would require agents to send data to apm-server (probably to the same endpoint) and define a new schema. A new schema would also allows us to show more information than a boolean, eg: timestamp of config application, error messages, etc.
I suggest to not introduce a new data model until we know more how the feature is used, what problems users run into, how/why exactly agents might fail to apply configuration, etc.
We also need to decide if we want to support RUM in GA, later, or never. User feedback is probably not practical for the RUM agent.
Note that at the moment Agent Configuration in Kibana is unusable for customers with a Distributed Tracing setup, as the only available setting is
SAMPLING_RATE, which in case of DT will be dictated by the RUM agent (almost always).RUM status is tracked in elastic/apm-agent-rum-js#253
Finally, during the first design we briefly touched on auditing. Some sort of audit logs could be achieved eg. by creating one elasticsearch document per update, and adding information about the user that did that change. However if we do not plan to add an UI interface for it, it might be enough to simply log configuration updates in Kibana.
Implementation issues
Kibana
Server
Agents
See background in #4 and #76