Skip to content

endpoint/restore: introduce metrics#43748

Merged
julianwiedmann merged 1 commit intocilium:mainfrom
mhofstetter:pr/mhofstetter/endpoint-restoration-metrics
Jan 16, 2026
Merged

endpoint/restore: introduce metrics#43748
julianwiedmann merged 1 commit intocilium:mainfrom
mhofstetter:pr/mhofstetter/endpoint-restoration-metrics

Conversation

@mhofstetter
Copy link
Copy Markdown
Member

@mhofstetter mhofstetter commented Jan 14, 2026

In the past, the metric agent_bootstrap_seconds with label scope=restore was used to publish some endpoint restoration related metrics.

With the modularization of the legacy daemon init logic, some of the endpoint restoration logic no longer reported its duration to that metric.

In addition to this, the label scope=restore was pretty generic, because other restoration logic was using the same label (e.g. identity restoration). And the metric didn't take into account that the actual regeneration of the restored endpoints was executed asynchronously.

To shed some more light into the endpoint restoration process, this commit introduces two endpoint restoration specific metrics.

  • endpoint_restoration_endpoints - with label phase & outcome
  • endpoint_restoration_duration_seconds - with label phase

The following phases of the endpoint restoration process report the metric.

  • read_from_disk: Reads old endpoints from state dir
  • restoration: Restore old endpoints. includes validation & IP re-allocation
  • prepare_regeneration: Trigger asynchronous regeneration (and node endpoint restoration)
  • initial_policy_computation: Initial policy computation for all restored endpoints
  • regeneration: Regeneration of restored endpoints

e.g.

root@kind-control-plane:/home/cilium# cilium-dbg shell metrics endpoint_restoration
Metric                                         Labels                                                Value
cilium_endpoint_restoration_duration_seconds   phase=initial_policy_computation                      0.006262
cilium_endpoint_restoration_duration_seconds   phase=prepare_regeneration                            2.506187
cilium_endpoint_restoration_duration_seconds   phase=read_from_disk                                  0.002105
cilium_endpoint_restoration_duration_seconds   phase=regeneration                                    2.270423
cilium_endpoint_restoration_duration_seconds   phase=restoration                                     0.101983
cilium_endpoint_restoration_endpoints          outcome=failed phase=read_from_disk                   0.000000
cilium_endpoint_restoration_endpoints          outcome=failed phase=regeneration                     0.000000
cilium_endpoint_restoration_endpoints          outcome=failed phase=restoration                      1.000000
cilium_endpoint_restoration_endpoints          outcome=skipped phase=restoration                     1.000000
cilium_endpoint_restoration_endpoints          outcome=successful phase=initial_policy_computation   4.000000
cilium_endpoint_restoration_endpoints          outcome=successful phase=prepare_regeneration         4.000000
cilium_endpoint_restoration_endpoints          outcome=successful phase=read_from_disk               6.000000
cilium_endpoint_restoration_endpoints          outcome=successful phase=regeneration                 4.000000
cilium_endpoint_restoration_endpoints          outcome=successful phase=restoration                  4.000000
cilium_endpoint_restoration_endpoints          outcome=total phase=initial_policy_computation        4.000000
cilium_endpoint_restoration_endpoints          outcome=total phase=prepare_regeneration              4.000000
cilium_endpoint_restoration_endpoints          outcome=total phase=read_from_disk                    6.000000
cilium_endpoint_restoration_endpoints          outcome=total phase=regeneration                      4.000000
cilium_endpoint_restoration_endpoints          outcome=total phase=restoration                       6.000000

Related PR: #43348 (deprecation of metric agent_bootstrap_seconds) cc @gandro

@mhofstetter mhofstetter added kind/enhancement This would improve or streamline existing functionality. area/metrics Impacts statistics / metrics gathering, eg via Prometheus. release-note/misc This PR makes changes that have no direct user impact. area/modularization Relates to code modularization and maintenance. labels Jan 14, 2026
@mhofstetter mhofstetter force-pushed the pr/mhofstetter/endpoint-restoration-metrics branch 3 times, most recently from fd118d1 to 96f1eb7 Compare January 14, 2026 12:43
@mhofstetter mhofstetter changed the title endpoint/restore: introduce hive metrics endpoint/restore: introduce metrics Jan 14, 2026
@mhofstetter mhofstetter force-pushed the pr/mhofstetter/endpoint-restoration-metrics branch from 96f1eb7 to 22c5e7d Compare January 14, 2026 12:47
@mhofstetter
Copy link
Copy Markdown
Member Author

/test

@mhofstetter mhofstetter added dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs and removed dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs labels Jan 14, 2026
@mhofstetter mhofstetter marked this pull request as ready for review January 14, 2026 13:21
@mhofstetter mhofstetter requested review from a team as code owners January 14, 2026 13:21
Copy link
Copy Markdown
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

In the past, the metric `agent_bootstrap_seconds` with label
`scope=restore` was used to publish some endpoint restoration
related metrics.

With the modularization of the legacy daemon init logic, some of the
endpoint restoration logic no longer reported its duration to that metric.

In addition to this, the label `scope=restore` was pretty generic, because
other restoration logic was using the same label (e.g. identity restoration).
And the metric didn't take into account that the actual regeneration of
the restored endpoints was executed asynchronously.

To shed some more light into the endpoint restoration process, this
commit introduces two endpoint restoration specific metrics.

* `endpoint_restoration_endpoints` - with label `phase` & `outcome`
* `endpoint_restoration_duration_seconds` - with label `phase`

The following phases of the endpoint restoration process report the metric.

* `read_from_disk`: Reads old endpoints from state dir
* `restoration`: Restore old endpoints. includes validation & IP re-allocation
* `prepare_regeneration`: Trigger asynchronous regeneration
* `initial_policy_computation`: Duration until the initial policy for all restored endpoints is computed
* `regeneration`: Duration until all restored endpoints are regenerated

e.g.

```
root@kind-control-plane:/home/cilium# cilium-dbg shell metrics endpoint_restoration
Metric                                         Labels                                                Value
cilium_endpoint_restoration_duration_seconds   phase=initial_policy_computation                      0.006262
cilium_endpoint_restoration_duration_seconds   phase=prepare_regeneration                            2.506187
cilium_endpoint_restoration_duration_seconds   phase=read_from_disk                                  0.002105
cilium_endpoint_restoration_duration_seconds   phase=regeneration                                    2.270423
cilium_endpoint_restoration_duration_seconds   phase=restoration                                     0.101983
cilium_endpoint_restoration_endpoints          outcome=failed phase=read_from_disk                   0.000000
cilium_endpoint_restoration_endpoints          outcome=failed phase=regeneration                     0.000000
cilium_endpoint_restoration_endpoints          outcome=failed phase=restoration                      1.000000
cilium_endpoint_restoration_endpoints          outcome=skipped phase=restoration                     1.000000
cilium_endpoint_restoration_endpoints          outcome=successful phase=initial_policy_computation   4.000000
cilium_endpoint_restoration_endpoints          outcome=successful phase=prepare_regeneration         4.000000
cilium_endpoint_restoration_endpoints          outcome=successful phase=read_from_disk               6.000000
cilium_endpoint_restoration_endpoints          outcome=successful phase=regeneration                 4.000000
cilium_endpoint_restoration_endpoints          outcome=successful phase=restoration                  4.000000
cilium_endpoint_restoration_endpoints          outcome=total phase=initial_policy_computation        4.000000
cilium_endpoint_restoration_endpoints          outcome=total phase=prepare_regeneration              4.000000
cilium_endpoint_restoration_endpoints          outcome=total phase=read_from_disk                    6.000000
cilium_endpoint_restoration_endpoints          outcome=total phase=regeneration                      4.000000
cilium_endpoint_restoration_endpoints          outcome=total phase=restoration                       6.000000
```

Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
@mhofstetter mhofstetter force-pushed the pr/mhofstetter/endpoint-restoration-metrics branch from 22c5e7d to 56c67cf Compare January 15, 2026 14:58
@mhofstetter
Copy link
Copy Markdown
Member Author

rebased on top of main to resolve conflicts

@mhofstetter
Copy link
Copy Markdown
Member Author

/test

@mhofstetter
Copy link
Copy Markdown
Member Author

i suggest to backport this to v1.19 for better analysis.

@mhofstetter mhofstetter added the needs-backport/1.19 This PR / issue needs backporting to the v1.19 branch label Jan 15, 2026
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jan 16, 2026
@julianwiedmann julianwiedmann added this pull request to the merge queue Jan 16, 2026
Merged via the queue into cilium:main with commit ae1e694 Jan 16, 2026
76 checks passed
@giorio94 giorio94 mentioned this pull request Jan 19, 2026
8 tasks
@giorio94 giorio94 added backport-pending/1.19 The backport for Cilium 1.19.x for this PR is in progress. and removed needs-backport/1.19 This PR / issue needs backporting to the v1.19 branch labels Jan 19, 2026
@mhofstetter mhofstetter deleted the pr/mhofstetter/endpoint-restoration-metrics branch January 19, 2026 12:44
@github-actions github-actions bot added backport-done/1.19 The backport for Cilium 1.19.x for this PR is done. and removed backport-pending/1.19 The backport for Cilium 1.19.x for this PR is in progress. labels Jan 19, 2026
@cilium-release-bot cilium-release-bot bot moved this to Released in cilium v1.19.0 Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/metrics Impacts statistics / metrics gathering, eg via Prometheus. area/modularization Relates to code modularization and maintenance. backport-done/1.19 The backport for Cilium 1.19.x for this PR is done. kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.

Projects

No open projects
Status: Released

Development

Successfully merging this pull request may close these issues.

7 participants