endpoint/restore: introduce metrics#43748
Merged
julianwiedmann merged 1 commit intocilium:mainfrom Jan 16, 2026
Merged
Conversation
fd118d1 to
96f1eb7
Compare
96f1eb7 to
22c5e7d
Compare
Member
Author
|
/test |
In the past, the metric `agent_bootstrap_seconds` with label `scope=restore` was used to publish some endpoint restoration related metrics. With the modularization of the legacy daemon init logic, some of the endpoint restoration logic no longer reported its duration to that metric. In addition to this, the label `scope=restore` was pretty generic, because other restoration logic was using the same label (e.g. identity restoration). And the metric didn't take into account that the actual regeneration of the restored endpoints was executed asynchronously. To shed some more light into the endpoint restoration process, this commit introduces two endpoint restoration specific metrics. * `endpoint_restoration_endpoints` - with label `phase` & `outcome` * `endpoint_restoration_duration_seconds` - with label `phase` The following phases of the endpoint restoration process report the metric. * `read_from_disk`: Reads old endpoints from state dir * `restoration`: Restore old endpoints. includes validation & IP re-allocation * `prepare_regeneration`: Trigger asynchronous regeneration * `initial_policy_computation`: Duration until the initial policy for all restored endpoints is computed * `regeneration`: Duration until all restored endpoints are regenerated e.g. ``` root@kind-control-plane:/home/cilium# cilium-dbg shell metrics endpoint_restoration Metric Labels Value cilium_endpoint_restoration_duration_seconds phase=initial_policy_computation 0.006262 cilium_endpoint_restoration_duration_seconds phase=prepare_regeneration 2.506187 cilium_endpoint_restoration_duration_seconds phase=read_from_disk 0.002105 cilium_endpoint_restoration_duration_seconds phase=regeneration 2.270423 cilium_endpoint_restoration_duration_seconds phase=restoration 0.101983 cilium_endpoint_restoration_endpoints outcome=failed phase=read_from_disk 0.000000 cilium_endpoint_restoration_endpoints outcome=failed phase=regeneration 0.000000 cilium_endpoint_restoration_endpoints outcome=failed phase=restoration 1.000000 cilium_endpoint_restoration_endpoints outcome=skipped phase=restoration 1.000000 cilium_endpoint_restoration_endpoints outcome=successful phase=initial_policy_computation 4.000000 cilium_endpoint_restoration_endpoints outcome=successful phase=prepare_regeneration 4.000000 cilium_endpoint_restoration_endpoints outcome=successful phase=read_from_disk 6.000000 cilium_endpoint_restoration_endpoints outcome=successful phase=regeneration 4.000000 cilium_endpoint_restoration_endpoints outcome=successful phase=restoration 4.000000 cilium_endpoint_restoration_endpoints outcome=total phase=initial_policy_computation 4.000000 cilium_endpoint_restoration_endpoints outcome=total phase=prepare_regeneration 4.000000 cilium_endpoint_restoration_endpoints outcome=total phase=read_from_disk 6.000000 cilium_endpoint_restoration_endpoints outcome=total phase=regeneration 4.000000 cilium_endpoint_restoration_endpoints outcome=total phase=restoration 6.000000 ``` Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
22c5e7d to
56c67cf
Compare
Member
Author
|
rebased on top of |
Member
Author
|
/test |
Member
Author
|
i suggest to backport this to v1.19 for better analysis. |
qmonnet
approved these changes
Jan 15, 2026
chancez
approved these changes
Jan 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In the past, the metric
agent_bootstrap_secondswith labelscope=restorewas used to publish some endpoint restoration related metrics.With the modularization of the legacy daemon init logic, some of the endpoint restoration logic no longer reported its duration to that metric.
In addition to this, the label
scope=restorewas pretty generic, because other restoration logic was using the same label (e.g. identity restoration). And the metric didn't take into account that the actual regeneration of the restored endpoints was executed asynchronously.To shed some more light into the endpoint restoration process, this commit introduces two endpoint restoration specific metrics.
endpoint_restoration_endpoints- with labelphase&outcomeendpoint_restoration_duration_seconds- with labelphaseThe following phases of the endpoint restoration process report the metric.
read_from_disk: Reads old endpoints from state dirrestoration: Restore old endpoints. includes validation & IP re-allocationprepare_regeneration: Trigger asynchronous regeneration (and node endpoint restoration)initial_policy_computation: Initial policy computation for all restored endpointsregeneration: Regeneration of restored endpointse.g.
Related PR: #43348 (deprecation of metric
agent_bootstrap_seconds) cc @gandro