hive: add metrics for hive jobs (total, errors, duration)#42762
Merged
pippolo84 merged 1 commit intocilium:mainfrom Nov 16, 2025
Merged
hive: add metrics for hive jobs (total, errors, duration)#42762pippolo84 merged 1 commit intocilium:mainfrom
pippolo84 merged 1 commit intocilium:mainfrom
Conversation
5298b24 to
e5d4ba7
Compare
As part of ongoing modularization, much of the old daemon initialization logic has been redesigned as Hive lifecycle hooks or Hive jobs. As a result, parts of the agent bootstrap metrics (`cilium_agent_bootstrap_seconds`) are gradually being lost. Therefore, this commit adds metrics to hive jobs. * `runs_total` (counter): Total number of runs * `runs_failed` (counter): Number of failed runs (returned error) * `oneshot_last_run_duration_seconds` (gauge): Duration of the last run of a oneshot job in seconds * `observer_last_run_duration_seconds` (gauge): Duration of the last run of a observer job in seconds * `timer_last_run_duration_seconds` (gauge): Duration of the last run of a timer job in seconds * `observer_run_duration_seconds` (gauge): Duration of a run of an observer job in seconds * `timer_last_run_duration_seconds` (gauge): Duration of a run of an timer job in seconds IMO it does not make that much sense to create a histogram for the oneshot jobs (even if retries would be configured). The metrics contain the labels `module_id` (hive cell) and `job_name`. Example: ``` root@kind-worker:/home/cilium# cilium-dbg shell metrics hive_jobs Metric Labels Value cilium_hive_jobs_observer_last_run_duration_seconds job_name=auth-gc-identity-events module_id=auth 0.000010 cilium_hive_jobs_observer_last_run_duration_seconds job_name=default-gateway-route-change-tracker module_id=bgp-control-plane 0.000000 cilium_hive_jobs_observer_last_run_duration_seconds job_name=device-change-device-change-tracker module_id=bgp-control-plane 0.000062 cilium_hive_jobs_observer_last_run_duration_seconds job_name=k8s-secrets-resource-events-cilium-secrets module_id=envoy-proxy 0.000045 cilium_hive_jobs_observer_last_run_duration_seconds job_name=nat-map-next4 module_id=ct-nat-map-gc 0.000008 cilium_hive_jobs_observer_last_run_duration_seconds job_name=nat-map-next6 module_id=ct-nat-map-gc 0.000011 cilium_hive_jobs_observer_run_duration_seconds job_name=auth-gc-identity-events module_id=auth 250µs / 450µs / 495µs cilium_hive_jobs_observer_run_duration_seconds job_name=default-gateway-route-change-tracker module_id=bgp-control-plane 250µs / 450µs / 495µs cilium_hive_jobs_observer_run_duration_seconds job_name=device-change-device-change-tracker module_id=bgp-control-plane 250µs / 450µs / 495µs cilium_hive_jobs_observer_run_duration_seconds job_name=k8s-secrets-resource-events-cilium-secrets module_id=envoy-proxy 250µs / 450µs / 495µs cilium_hive_jobs_observer_run_duration_seconds job_name=nat-map-next4 module_id=ct-nat-map-gc 250µs / 450µs / 495µs cilium_hive_jobs_observer_run_duration_seconds job_name=nat-map-next6 module_id=ct-nat-map-gc 250µs / 450µs / 495µs cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=certloader-server-tls module_id=hubble 0.000886 cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=cleanup module_id=maps-cleanup 8.345183 cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=clustermesh-nodemanager-notifier module_id=clustermesh 0.000001 cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=cni-deletion-queue module_id=endpoint-api 4.245811 cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=enable-gc module_id=ct-nat-map-gc 8.473397 cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=endpoint-cleanup module_id=stale-endpoint-cleanup 8.347281 cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=hubble module_id=hubble 0.001395 cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=ipset-init-finalizer module_id=ipset 0.007239 cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=legacy-start module_id=daemon 4.125838 cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=per-endpoint-route-initializer module_id=loader 8.514638 cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=proxy-bootstrapper module_id=dns-proxy 1.651428 cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=proxy-ports-restore module_id=l7-proxy 0.000208 cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=release-local-identities module_id=identity-restoration 38.462400 cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=start-reconciler module_id=loadbalancer-reconciler 0.501552 cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=unlock-lockfile module_id=endpoint-api 4.119883 cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=update-config-metric module_id=enabled-features 0.000158 cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=wait-for-endpoint-restore module_id=ep-bpf-prog-watchdog 8.347350 cilium_hive_jobs_oneshot_last_run_duration_seconds job_name=wait-for-endpoint-restore module_id=namemanager 8.520708 cilium_hive_jobs_runs_total job_name=auth-gc-identity-events module_id=auth 4.000000 cilium_hive_jobs_runs_total job_name=certloader-server-tls module_id=hubble 1.000000 cilium_hive_jobs_runs_total job_name=cleanup module_id=maps-cleanup 1.000000 cilium_hive_jobs_runs_total job_name=clustermesh-nodemanager-notifier module_id=clustermesh 1.000000 cilium_hive_jobs_runs_total job_name=cni-deletion-queue module_id=endpoint-api 1.000000 cilium_hive_jobs_runs_total job_name=default-gateway-route-change-tracker module_id=bgp-control-plane 42.000000 cilium_hive_jobs_runs_total job_name=device-change-device-change-tracker module_id=bgp-control-plane 8.000000 cilium_hive_jobs_runs_total job_name=enable-gc module_id=ct-nat-map-gc 1.000000 cilium_hive_jobs_runs_total job_name=endpoint-cleanup module_id=stale-endpoint-cleanup 1.000000 cilium_hive_jobs_runs_total job_name=ep-bpf-prog-watchdog module_id=ep-bpf-prog-watchdog 1.000000 cilium_hive_jobs_runs_total job_name=hubble module_id=hubble 1.000000 cilium_hive_jobs_runs_total job_name=ipset-init-finalizer module_id=ipset 1.000000 cilium_hive_jobs_runs_total job_name=k8s-secrets-resource-events-cilium-secrets module_id=envoy-proxy 1.000000 cilium_hive_jobs_runs_total job_name=legacy-start module_id=daemon 1.000000 cilium_hive_jobs_runs_total job_name=nat-map-next4 module_id=ct-nat-map-gc 117.000000 cilium_hive_jobs_runs_total job_name=nat-map-next6 module_id=ct-nat-map-gc 26.000000 cilium_hive_jobs_runs_total job_name=nat-stats module_id=nat-stats 2.000000 cilium_hive_jobs_runs_total job_name=per-endpoint-route-initializer module_id=loader 1.000000 cilium_hive_jobs_runs_total job_name=pressure-metric-throttle module_id=bwmap 1.000000 cilium_hive_jobs_runs_total job_name=proxy-bootstrapper module_id=dns-proxy 1.000000 cilium_hive_jobs_runs_total job_name=proxy-ports-checkpoint module_id=l7-proxy 1.000000 cilium_hive_jobs_runs_total job_name=proxy-ports-restore module_id=l7-proxy 1.000000 cilium_hive_jobs_runs_total job_name=release-local-identities module_id=identity-restoration 1.000000 cilium_hive_jobs_runs_total job_name=start-reconciler module_id=loadbalancer-reconciler 1.000000 cilium_hive_jobs_runs_total job_name=sync module_id=link-cache 2.000000 cilium_hive_jobs_runs_total job_name=sync-userspace-and-datapath module_id=utime 1.000000 cilium_hive_jobs_runs_total job_name=unlock-lockfile module_id=endpoint-api 1.000000 cilium_hive_jobs_runs_total job_name=update-config-metric module_id=enabled-features 1.000000 cilium_hive_jobs_runs_total job_name=wait-for-endpoint-restore module_id=ep-bpf-prog-watchdog 1.000000 cilium_hive_jobs_runs_total job_name=wait-for-endpoint-restore module_id=namemanager 1.000000 cilium_hive_jobs_timer_last_run_duration_seconds job_name=ep-bpf-prog-watchdog module_id=ep-bpf-prog-watchdog 0.000570 cilium_hive_jobs_timer_last_run_duration_seconds job_name=nat-stats module_id=nat-stats 0.002096 cilium_hive_jobs_timer_last_run_duration_seconds job_name=pressure-metric-throttle module_id=bwmap 0.000002 cilium_hive_jobs_timer_last_run_duration_seconds job_name=proxy-ports-checkpoint module_id=l7-proxy 0.000721 cilium_hive_jobs_timer_last_run_duration_seconds job_name=sync module_id=link-cache 0.000352 cilium_hive_jobs_timer_last_run_duration_seconds job_name=sync-userspace-and-datapath module_id=utime 0.000201 cilium_hive_jobs_timer_run_duration_seconds job_name=ep-bpf-prog-watchdog module_id=ep-bpf-prog-watchdog 750µs / 950µs / 995µs cilium_hive_jobs_timer_run_duration_seconds job_name=nat-stats module_id=nat-stats 1.75ms / 2.35ms / 2.485ms cilium_hive_jobs_timer_run_duration_seconds job_name=pressure-metric-throttle module_id=bwmap 250µs / 450µs / 495µs cilium_hive_jobs_timer_run_duration_seconds job_name=proxy-ports-checkpoint module_id=l7-proxy 750µs / 950µs / 995µs cilium_hive_jobs_timer_run_duration_seconds job_name=sync module_id=link-cache 250µs / 450µs / 495µs cilium_hive_jobs_timer_run_duration_seconds job_name=sync-userspace-and-datapath module_id=utime 250µs / 450µs / 495µs ``` Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
e5d4ba7 to
1c5feb8
Compare
Member
Author
|
/test |
mhofstetter
added a commit
to mhofstetter/cilium
that referenced
this pull request
Dec 15, 2025
This commit updates the documentation about hive job metrics. * Initially introduced and documented with cilium#26077 * Deleted when cilium/hive has been extracted with cilium#32020 * Re-introduced via integration with cilium#42762 (not aware that they existed in the past) Therefore this commit updates the docs to reflect the current implementation. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This was referenced Dec 15, 2025
github-merge-queue bot
pushed a commit
that referenced
this pull request
Dec 16, 2025
This commit updates the documentation about hive job metrics. * Initially introduced and documented with #26077 * Deleted when cilium/hive has been extracted with #32020 * Re-introduced via integration with #42762 (not aware that they existed in the past) Therefore this commit updates the docs to reflect the current implementation. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
As part of ongoing modularization, much of the old daemon initialization logic
has been redesigned as Hive lifecycle hooks or Hive jobs (for the asynchronous parts). As a result, parts
of the agent bootstrap metrics (
cilium_agent_bootstrap_seconds) are graduallybeing lost.
Therefore, this commit adds metrics to hive jobs.
cilium_hive_jobs_runs_total(counter): Total number of runscilium_hive_jobs_runs_failed(counter): Number of failed runs (returned error)cilium_hive_jobs_oneshot_last_run_duration_seconds(gauge): Duration of the last run of a oneshot job in seconds (from the one that finished (successfully or with error)cilium_hive_jobs_observer_last_run_duration_seconds(gauge): Duration of the last run of a observer job in secondscilium_hive_jobs_timer_last_run_duration_seconds(gauge): Duration of the last run of a timer job in secondscilium_hive_jobs_observer_run_duration_seconds(histogram): Duration of a run of an observer job in secondscilium_hive_jobs_timer_run_duration_seconds(histogram): Duration of a run of an timer job in secondsIMO it does not make that much sense to create a histogram for the oneshot jobs
(even if retries would be configured).
The metrics contain the labels
module_id(hive cell) andjob_name.Example:
Note: this doesn't cover the jobs that are created from an injected
job.Registry(it would require to decorate the registry which currently isn't possible due to using unexported types in the API)