Skip to content

metrics: deprecate agent_bootstrap_seconds#43348

Merged
qmonnet merged 1 commit intocilium:mainfrom
mhofstetter:pr/mhofstetter/deprecate-bootstrap-stats
Dec 16, 2025
Merged

metrics: deprecate agent_bootstrap_seconds#43348
qmonnet merged 1 commit intocilium:mainfrom
mhofstetter:pr/mhofstetter/deprecate-bootstrap-stats

Conversation

@mhofstetter
Copy link
Copy Markdown
Member

@mhofstetter mhofstetter commented Dec 15, 2025

This commit deprecates the metric agent_bootstrap_seconds.

With the ongoing modularization of the legacy daemon initialization logic - the metric will eventually be deleted.

Most of the long-running bootstrap tasks are/will be replaced with hive jobs. Therefore it's recommended to use the metric cilium_hive_jobs_oneshot_last_run_duration_seconds of the respective job instead. If this isn't enough we have to introduce specific metrics in the scope of the respective modules.

Short-running tasks are/will be handled directly in hive lifecycle start hooks and reported in the log if they exceed a given max time (--hive-log-threshold (default 100ms))

Related PR that updates the hive job metrics documentation: #43347

@mhofstetter mhofstetter added area/documentation Impacts the documentation, including textual changes, sphinx, or other doc generation code. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. release-note/misc This PR makes changes that have no direct user impact. area/modularization Relates to code modularization and maintenance. labels Dec 15, 2025
@mhofstetter mhofstetter force-pushed the pr/mhofstetter/deprecate-bootstrap-stats branch from 350f7c2 to 2a2cc3b Compare December 15, 2025 15:17
This commit deprecates the metric `agent_bootstrap_seconds`.

With the ongoing modularization of the legacy daemon initialization logic - the
metric will eventually be deleted.

Most of the long-running bootstrap tasks are/will be replaced with hive jobs.
Therefore it's recommended to use the metric `cilium_hive_jobs_oneshot_last_run_duration_seconds`
of the respective job instead. If this isn't enough we have to introduce specific
metrics in the scope of the respective modules.

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
@mhofstetter mhofstetter force-pushed the pr/mhofstetter/deprecate-bootstrap-stats branch from 2a2cc3b to 5503f04 Compare December 15, 2025 15:20
@mhofstetter
Copy link
Copy Markdown
Member Author

/test

@mhofstetter mhofstetter marked this pull request as ready for review December 15, 2025 15:25
@mhofstetter mhofstetter requested review from a team as code owners December 15, 2025 15:25
Copy link
Copy Markdown
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK for "Hubble Metrics" as they are unaffected.

I do understand that cilium_agent_bootstrap_seconds is now broken anyhow, so no point in artificially keeping it alive.

But I do worry a bit that the recommended alternative is very low-level for users to understand and not really a stable API. Jobs can and will change between releases, where as the "phases" in agent_bootstrap_seconds are more conceptual.

I think an ideal alternative metric would keep the concept of various phases, like "local node information received", "endpoint regenerated" etc and expose them as metrics on which users can set up alerts.

@mhofstetter
Copy link
Copy Markdown
Member Author

I do understand that cilium_agent_bootstrap_seconds is now broken anyhow, so no point in artificially keeping it alive.

But I do worry a bit that the recommended alternative is very low-level for users to understand and not really a stable API. Jobs can and will change between releases, where as the "phases" in agent_bootstrap_seconds are more conceptual.

I think an ideal alternative metric would keep the concept of various phases, like "local node information received", "endpoint regenerated" etc and expose them as metrics on which users can set up alerts.

Yep, some of these phases were already deleted while extracting functionality. The metric isn't "broken" yet (aside from the missing phases/labels) - but eventually we will delete it because no code will be left in configureDaemon.

IMO most of these phases were/are quite generic/meaningless anyway (overall, earlyInit, k8sInit, bpfBase, daemonInit, ...) - even cleanup isn't specific to what "subsystem" is restored (e.g. endpoints, identity, ...). That's why it was/is probably not worth to migrate them 1:1.

But I agree that it would be great to eventually come up with some better and more specific metrics for the critical and time-consuming phases 👍

@mhofstetter
Copy link
Copy Markdown
Member Author

And having the low-level / technical metrics at least helps to see & decide where more specific/higher level metrics are needed.

For the same reason i think about adding a metric for the overall hive start time (all start hooks).

@qmonnet qmonnet added this pull request to the merge queue Dec 16, 2025
Merged via the queue into cilium:main with commit 1966853 Dec 16, 2025
83 checks passed
@mhofstetter mhofstetter deleted the pr/mhofstetter/deprecate-bootstrap-stats branch December 16, 2025 10:38
@cilium-release-bot cilium-release-bot bot moved this to Released in cilium v1.19.0 Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/documentation Impacts the documentation, including textual changes, sphinx, or other doc generation code. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. area/modularization Relates to code modularization and maintenance. release-note/misc This PR makes changes that have no direct user impact.

Projects

No open projects
Status: Released

Development

Successfully merging this pull request may close these issues.

4 participants