node/manager: prevent MeshNodeSync from pruning cluster (local) nodes by 0xch4z · Pull Request #41039 · cilium/cilium

0xch4z · 2025-08-08T19:39:25Z

Fixes an issue in NodeManager where restored cluster nodes can be pruned before the initial node listing completes.

Currently, with clustermesh, when all remote clusters have finished initial synchronization, MeshNodeSync is called, which considers all mesh and non-mesh nodes for pruning. If this happens before we finish listing initial cluster (local) nodes, we could end up considering the restored cluster nodes as stale for pruning. This also causes some miscounting for metrics, where for example we decrement nodes_all_num (in NodeDeleted for the synthetic delete event) for nodes we never accounted for in the first place.

This patch modifies MeshNodeSync to only consider pruning mesh nodes.

Fixes an issue in NodeManager where restored cluster nodes can be pruned before the initial node listing completes.

bimmlerd

looks like a nice find! One code suggestion from my side, but cc @giorio94 who may have more insight

pkg/node/manager/manager.go

maintainer-s-little-helper · 2025-08-12T02:50:06Z

Commits 8bbad0e, 9da9811 do not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

giorio94

Thanks @0xch4z for finding and fixing this! The changes look good to me, just a couple of nits inline about the test.

pkg/node/manager/manager_test.go

maintainer-s-little-helper · 2025-08-12T14:05:47Z

Commit 14f71bd does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

0xch4z · 2025-08-12T14:09:32Z

Could we also label needs-backport/1.17 🙏

0xch4z · 2025-08-12T14:23:49Z

Apologies for the rebase, commit message failed the linter: https://github.com/cilium/cilium/actions/runs/16897661036/job/47882751543

bimmlerd · 2025-08-12T14:23:54Z

/test

bimmlerd · 2025-08-12T14:25:59Z

Actually, please squash your commits into one (or maybe two, if that makes more sense given the metric changes) before we merge this 🙏

bimmlerd · 2025-08-12T14:28:09Z

Could we also label needs-backport/1.17 🙏

We don't really backport further than one minor according to the https://docs.cilium.io/en/stable/contributing/release/backports/#backport-criteria, unless there's a strong reason for it. Vendors do provide support for older branches though, and if you do the work of performing the backport yourself there's a chance it gets accepted anyway.

Fixes an issue in NodeManager where restored cluster nodes can be pruned before the initial node listing completes. Currently, with clustermesh, when all remote clusters have finished initial synchronization, `MeshNodeSync` is called, which considers all mesh and non-mesh nodes for pruning. If this happens before we finish listing initial cluster (local) nodes, we could end up considering the restored cluster nodes as stale for pruning. This also causes some miscounting for metrics, where for example we decrement `nodes_all_num` (in `NodeDeleted` for the synthetic delete event) for nodes we never accounted for in the first place. This patch modifies MeshNodeSync to only consider pruning mesh nodes. Signed-off-by: Charlie Kenney <charles.kenney@isovalent.com>

DeleteNodes is called with synthetic delete events for nodes that have been deleted between NodeManager restarts. Now, we check that the given node is not a stale node being pruned (that we haven't accounted for) before we decrement num_nodes_all. Signed-off-by: Charlie Kenney <charles.kenney@isovalent.com>

0xch4z · 2025-08-12T14:46:08Z

Squashed into 2 commits, separating the metrics fix

bimmlerd · 2025-08-12T14:53:30Z

/test

0xch4z · 2025-08-18T14:08:52Z

reran CI and got everything but the mergeability check passing

bimmlerd · 2025-08-18T14:19:14Z

Sorry - I forgot to remove the don't-merge/discussion label I put on to ensure you'd squash, my bad!

0xch4z · 2025-08-18T15:12:47Z

no worries, thank you!

joamaki · 2025-09-05T10:19:29Z

pkg/node/manager/manager.go

-		if n.Cluster == m.conf.ClusterName || includeMeshed {
-			m.NodeDeleted(*n)
-			delete(m.restoredNodes, id)
+	for _, n := range toDelete {


Could the fact that we're calling this after releasing the mutex be causing #41543? E.g. we call pruneClusterNodes before local node has been added and thus plan to synthesize the delete for it and then when we release the mutex the local node gets added and then we get here and immediately delete it?

0xch4z requested a review from a team as a code owner August 8, 2025 19:39

0xch4z requested a review from bimmlerd August 8, 2025 19:39

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 8, 2025

github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Aug 8, 2025

bimmlerd requested changes Aug 11, 2025

View reviewed changes

pkg/node/manager/manager.go Outdated Show resolved Hide resolved

maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Aug 12, 2025

0xch4z force-pushed the pr/ck/nodemanager-prunenodes-fix branch from 9da9811 to 4292f64 Compare August 12, 2025 02:53

maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Aug 12, 2025

giorio94 approved these changes Aug 12, 2025

View reviewed changes

pkg/node/manager/manager_test.go Outdated Show resolved Hide resolved

pkg/node/manager/manager_test.go Outdated Show resolved Hide resolved

giorio94 added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Aug 12, 2025

maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Aug 12, 2025

giorio94 added affects/v1.16 This issue affects v1.16 branch affects/v1.17 This issue affects v1.17 branch labels Aug 12, 2025

bimmlerd approved these changes Aug 12, 2025

View reviewed changes

maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Aug 12, 2025

0xch4z force-pushed the pr/ck/nodemanager-prunenodes-fix branch from 14f71bd to b257b21 Compare August 12, 2025 14:06

maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Aug 12, 2025

0xch4z force-pushed the pr/ck/nodemanager-prunenodes-fix branch from b257b21 to be70a20 Compare August 12, 2025 14:22

bimmlerd added the dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. label Aug 12, 2025

0xch4z force-pushed the pr/ck/nodemanager-prunenodes-fix branch from be70a20 to f9ef0f4 Compare August 12, 2025 14:42

bimmlerd removed the dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. label Aug 18, 2025

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 18, 2025

giorio94 enabled auto-merge August 18, 2025 14:30

giorio94 added this pull request to the merge queue Aug 18, 2025

Merged via the queue into cilium:main with commit d2dfc60 Aug 18, 2025
68 of 70 checks passed

joamaki mentioned this pull request Aug 19, 2025

v1.18 Backports 2025-08-19 #41267

Merged

19 tasks

joamaki added backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. and removed needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Aug 19, 2025

github-actions bot added backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. and removed backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. labels Aug 21, 2025

joamaki mentioned this pull request Sep 5, 2025

CI: cilium-health checks failing to probe local node #41543

Closed

joamaki reviewed Sep 5, 2025

View reviewed changes

cilium-release-bot bot added this to cilium v1.19.0 Feb 3, 2026

cilium-release-bot bot moved this to Released in cilium v1.19.0 Feb 3, 2026

Conversation

0xch4z commented Aug 8, 2025

Uh oh!

bimmlerd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maintainer-s-little-helper bot commented Aug 12, 2025

Uh oh!

giorio94 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

maintainer-s-little-helper bot commented Aug 12, 2025

Uh oh!

0xch4z commented Aug 12, 2025

Uh oh!

0xch4z commented Aug 12, 2025

Uh oh!

bimmlerd commented Aug 12, 2025

Uh oh!

bimmlerd commented Aug 12, 2025

Uh oh!

bimmlerd commented Aug 12, 2025

Uh oh!

0xch4z commented Aug 12, 2025

Uh oh!

bimmlerd commented Aug 12, 2025

Uh oh!

0xch4z commented Aug 18, 2025

Uh oh!

bimmlerd commented Aug 18, 2025

Uh oh!

0xch4z commented Aug 18, 2025

Uh oh!

Uh oh!

joamaki Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants