Skip to content

Cilium should reconcile missed node delete events #29822

@rgo3

Description

@rgo3

Cilium can miss NodeDelete events from Kubernetes, e.g when nodes get deleted during an upgrade while the cilium agent is down. This leads to leaking of resources (e.g. xfrm states/policies, node IDs) if a subsystem in cilium that subscribes to node events doesn't implement reconciliation/garbage collection. #26298 (comment) offers a handful of steps to reproduce the bug.

Some subsystems already implement garbage collection to a certain degree for themselves, e.g. the wireguard agent, however any feature depending on the linuxNodeHandler for cleanup suffers from the described issue.

This issue should track the design and development of a reconciler that ensures that the cilium agent knows about the same set of nodes that are currently in the Kubernetes cluster, triggering the cleanup of node-specific resources and unifying garbage collection functionality across all subsystems that subscribe to the NodeManager.

An implementation approach could be based on the Generic Reconciler that was merged into the cilium code base for 1.16

Metadata

Metadata

Assignees

Labels

area/agentCilium agent related.kind/bugThis is a bug in the Cilium logic.pinnedThese issues are not marked stale by our issue bot.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions