Skip to content

node/manager: prevent MeshNodeSync from pruning cluster (local) nodes#41039

Merged
giorio94 merged 2 commits intocilium:mainfrom
0xch4z:pr/ck/nodemanager-prunenodes-fix
Aug 18, 2025
Merged

node/manager: prevent MeshNodeSync from pruning cluster (local) nodes#41039
giorio94 merged 2 commits intocilium:mainfrom
0xch4z:pr/ck/nodemanager-prunenodes-fix

Conversation

@0xch4z
Copy link
Copy Markdown
Contributor

@0xch4z 0xch4z commented Aug 8, 2025

Fixes an issue in NodeManager where restored cluster nodes can be pruned before the initial node listing completes.

Currently, with clustermesh, when all remote clusters have finished initial synchronization, MeshNodeSync is called, which considers all mesh and non-mesh nodes for pruning. If this happens before we finish listing initial cluster (local) nodes, we could end up considering the restored cluster nodes as stale for pruning. This also causes some miscounting for metrics, where for example we decrement nodes_all_num (in NodeDeleted for the synthetic delete event) for nodes we never accounted for in the first place.

This patch modifies MeshNodeSync to only consider pruning mesh nodes.

Fixes an issue in NodeManager where restored cluster nodes can be pruned before the initial node listing completes.

@0xch4z 0xch4z requested a review from a team as a code owner August 8, 2025 19:39
@0xch4z 0xch4z requested a review from bimmlerd August 8, 2025 19:39
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Aug 8, 2025
@github-actions github-actions bot added the kind/community-contribution This was a contribution made by a community member. label Aug 8, 2025
Copy link
Copy Markdown
Member

@bimmlerd bimmlerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a nice find! One code suggestion from my side, but cc @giorio94 who may have more insight

@maintainer-s-little-helper
Copy link
Copy Markdown

Commits 8bbad0e, 9da9811 do not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Aug 12, 2025
@0xch4z 0xch4z force-pushed the pr/ck/nodemanager-prunenodes-fix branch from 9da9811 to 4292f64 Compare August 12, 2025 02:53
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Aug 12, 2025
Copy link
Copy Markdown
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @0xch4z for finding and fixing this! The changes look good to me, just a couple of nits inline about the test.

@giorio94 giorio94 added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Aug 12, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Aug 12, 2025
@giorio94 giorio94 added affects/v1.16 This issue affects v1.16 branch affects/v1.17 This issue affects v1.17 branch labels Aug 12, 2025
@maintainer-s-little-helper
Copy link
Copy Markdown

Commit 14f71bd does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Aug 12, 2025
@0xch4z 0xch4z force-pushed the pr/ck/nodemanager-prunenodes-fix branch from 14f71bd to b257b21 Compare August 12, 2025 14:06
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Aug 12, 2025
@0xch4z
Copy link
Copy Markdown
Contributor Author

0xch4z commented Aug 12, 2025

Could we also label needs-backport/1.17 🙏

@0xch4z 0xch4z force-pushed the pr/ck/nodemanager-prunenodes-fix branch from b257b21 to be70a20 Compare August 12, 2025 14:22
@0xch4z
Copy link
Copy Markdown
Contributor Author

0xch4z commented Aug 12, 2025

Apologies for the rebase, commit message failed the linter: https://github.com/cilium/cilium/actions/runs/16897661036/job/47882751543

@bimmlerd
Copy link
Copy Markdown
Member

/test

@bimmlerd
Copy link
Copy Markdown
Member

Actually, please squash your commits into one (or maybe two, if that makes more sense given the metric changes) before we merge this 🙏

@bimmlerd bimmlerd added the dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. label Aug 12, 2025
@bimmlerd
Copy link
Copy Markdown
Member

Could we also label needs-backport/1.17 🙏

We don't really backport further than one minor according to the https://docs.cilium.io/en/stable/contributing/release/backports/#backport-criteria, unless there's a strong reason for it. Vendors do provide support for older branches though, and if you do the work of performing the backport yourself there's a chance it gets accepted anyway.

Fixes an issue in NodeManager where restored cluster nodes can be pruned
before the initial node listing completes.

Currently, with clustermesh, when all remote clusters have finished
initial synchronization, `MeshNodeSync` is called, which considers all
mesh and non-mesh nodes for pruning. If this happens before we finish
listing initial cluster (local) nodes, we could end up considering the
restored cluster nodes as stale for pruning. This also causes some
miscounting for metrics, where for example we decrement `nodes_all_num`
(in `NodeDeleted` for the synthetic delete event) for nodes we never
accounted for in the first place.

This patch modifies MeshNodeSync to only consider pruning mesh nodes.

Signed-off-by: Charlie Kenney <charles.kenney@isovalent.com>
DeleteNodes is called with synthetic delete events for nodes that have
been deleted between NodeManager restarts. Now, we check that the given
node is not a stale node being pruned (that we haven't accounted for)
before we decrement num_nodes_all.

Signed-off-by: Charlie Kenney <charles.kenney@isovalent.com>
@0xch4z 0xch4z force-pushed the pr/ck/nodemanager-prunenodes-fix branch from be70a20 to f9ef0f4 Compare August 12, 2025 14:42
@0xch4z
Copy link
Copy Markdown
Contributor Author

0xch4z commented Aug 12, 2025

Squashed into 2 commits, separating the metrics fix

@bimmlerd
Copy link
Copy Markdown
Member

/test

@0xch4z
Copy link
Copy Markdown
Contributor Author

0xch4z commented Aug 18, 2025

reran CI and got everything but the mergeability check passing

@bimmlerd bimmlerd removed the dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. label Aug 18, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 18, 2025
@bimmlerd
Copy link
Copy Markdown
Member

Sorry - I forgot to remove the don't-merge/discussion label I put on to ensure you'd squash, my bad!

@giorio94 giorio94 enabled auto-merge August 18, 2025 14:30
@0xch4z
Copy link
Copy Markdown
Contributor Author

0xch4z commented Aug 18, 2025

no worries, thank you!

@giorio94 giorio94 added this pull request to the merge queue Aug 18, 2025
Merged via the queue into cilium:main with commit d2dfc60 Aug 18, 2025
68 of 70 checks passed
@joamaki joamaki mentioned this pull request Aug 19, 2025
19 tasks
@joamaki joamaki added backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. and removed needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Aug 19, 2025
@github-actions github-actions bot added backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. and removed backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. labels Aug 21, 2025
if n.Cluster == m.conf.ClusterName || includeMeshed {
m.NodeDeleted(*n)
delete(m.restoredNodes, id)
for _, n := range toDelete {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the fact that we're calling this after releasing the mutex be causing #41543? E.g. we call pruneClusterNodes before local node has been added and thus plan to synthesize the delete for it and then when we release the mutex the local node gets added and then we get here and immediately delete it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

affects/v1.16 This issue affects v1.16 branch affects/v1.17 This issue affects v1.17 branch backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. kind/bug This is a bug in the Cilium logic. kind/community-contribution This was a contribution made by a community member. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

No open projects
Status: Released

Development

Successfully merging this pull request may close these issues.

5 participants