node/manager: prevent MeshNodeSync from pruning cluster (local) nodes#41039
node/manager: prevent MeshNodeSync from pruning cluster (local) nodes#41039giorio94 merged 2 commits intocilium:mainfrom
Conversation
|
Commits 8bbad0e, 9da9811 do not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
9da9811 to
4292f64
Compare
|
Commit 14f71bd does not match "(?m)^Signed-off-by:". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
14f71bd to
b257b21
Compare
|
Could we also label |
b257b21 to
be70a20
Compare
|
Apologies for the rebase, commit message failed the linter: https://github.com/cilium/cilium/actions/runs/16897661036/job/47882751543 |
|
/test |
|
Actually, please squash your commits into one (or maybe two, if that makes more sense given the metric changes) before we merge this 🙏 |
We don't really backport further than one minor according to the https://docs.cilium.io/en/stable/contributing/release/backports/#backport-criteria, unless there's a strong reason for it. Vendors do provide support for older branches though, and if you do the work of performing the backport yourself there's a chance it gets accepted anyway. |
Fixes an issue in NodeManager where restored cluster nodes can be pruned before the initial node listing completes. Currently, with clustermesh, when all remote clusters have finished initial synchronization, `MeshNodeSync` is called, which considers all mesh and non-mesh nodes for pruning. If this happens before we finish listing initial cluster (local) nodes, we could end up considering the restored cluster nodes as stale for pruning. This also causes some miscounting for metrics, where for example we decrement `nodes_all_num` (in `NodeDeleted` for the synthetic delete event) for nodes we never accounted for in the first place. This patch modifies MeshNodeSync to only consider pruning mesh nodes. Signed-off-by: Charlie Kenney <charles.kenney@isovalent.com>
DeleteNodes is called with synthetic delete events for nodes that have been deleted between NodeManager restarts. Now, we check that the given node is not a stale node being pruned (that we haven't accounted for) before we decrement num_nodes_all. Signed-off-by: Charlie Kenney <charles.kenney@isovalent.com>
be70a20 to
f9ef0f4
Compare
|
Squashed into 2 commits, separating the metrics fix |
|
/test |
|
reran CI and got everything but the mergeability check passing |
|
Sorry - I forgot to remove the |
|
no worries, thank you! |
| if n.Cluster == m.conf.ClusterName || includeMeshed { | ||
| m.NodeDeleted(*n) | ||
| delete(m.restoredNodes, id) | ||
| for _, n := range toDelete { |
There was a problem hiding this comment.
Could the fact that we're calling this after releasing the mutex be causing #41543? E.g. we call pruneClusterNodes before local node has been added and thus plan to synthesize the delete for it and then when we release the mutex the local node gets added and then we get here and immediately delete it?
Fixes an issue in NodeManager where restored cluster nodes can be pruned before the initial node listing completes.
Currently, with clustermesh, when all remote clusters have finished initial synchronization,
MeshNodeSyncis called, which considers all mesh and non-mesh nodes for pruning. If this happens before we finish listing initial cluster (local) nodes, we could end up considering the restored cluster nodes as stale for pruning. This also causes some miscounting for metrics, where for example we decrementnodes_all_num(inNodeDeletedfor the synthetic delete event) for nodes we never accounted for in the first place.This patch modifies MeshNodeSync to only consider pruning mesh nodes.