metrics: deprecate agent_bootstrap_seconds#43348
Conversation
350f7c2 to
2a2cc3b
Compare
This commit deprecates the metric `agent_bootstrap_seconds`. With the ongoing modularization of the legacy daemon initialization logic - the metric will eventually be deleted. Most of the long-running bootstrap tasks are/will be replaced with hive jobs. Therefore it's recommended to use the metric `cilium_hive_jobs_oneshot_last_run_duration_seconds` of the respective job instead. If this isn't enough we have to introduce specific metrics in the scope of the respective modules. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
2a2cc3b to
5503f04
Compare
|
/test |
gandro
left a comment
There was a problem hiding this comment.
ACK for "Hubble Metrics" as they are unaffected.
I do understand that cilium_agent_bootstrap_seconds is now broken anyhow, so no point in artificially keeping it alive.
But I do worry a bit that the recommended alternative is very low-level for users to understand and not really a stable API. Jobs can and will change between releases, where as the "phases" in agent_bootstrap_seconds are more conceptual.
I think an ideal alternative metric would keep the concept of various phases, like "local node information received", "endpoint regenerated" etc and expose them as metrics on which users can set up alerts.
Yep, some of these phases were already deleted while extracting functionality. The metric isn't "broken" yet (aside from the missing phases/labels) - but eventually we will delete it because no code will be left in IMO most of these phases were/are quite generic/meaningless anyway ( But I agree that it would be great to eventually come up with some better and more specific metrics for the critical and time-consuming phases 👍 |
|
And having the low-level / technical metrics at least helps to see & decide where more specific/higher level metrics are needed. For the same reason i think about adding a metric for the overall hive start time (all start hooks). |
This commit deprecates the metric
agent_bootstrap_seconds.With the ongoing modularization of the legacy daemon initialization logic - the metric will eventually be deleted.
Most of the long-running bootstrap tasks are/will be replaced with hive jobs. Therefore it's recommended to use the metric
cilium_hive_jobs_oneshot_last_run_duration_secondsof the respective job instead. If this isn't enough we have to introduce specific metrics in the scope of the respective modules.Short-running tasks are/will be handled directly in hive lifecycle start hooks and reported in the log if they exceed a given max time (
--hive-log-threshold(default 100ms))Related PR that updates the hive job metrics documentation: #43347