-
Notifications
You must be signed in to change notification settings - Fork 3.7k
setDown() tears down wrong pod's veth in aws-cni chaining when deterministic pod names (i.e StatefulSet) cause veth reuse #44463
Description
Is there an existing issue for this?
- I have searched the existing issues
Version
equal or higher than v1.19.0 and lower than v1.20.0
What happened?
setDown() brings down the wrong veth interface during endpoint deletion in aws-cni chaining mode
When Cilium runs in aws-cni chaining mode and a StatefulSet pod is rapidly recreated on the same node, setDown() in pkg/endpoint/endpoint.go brings down the new pod's veth interface instead of the old one.
The root cause: setDown() resolves the host-side interface by name via safenetlink.LinkByName(e.HostInterface()). It does not validate that the resolved interface's ifIndex matches the one stored on the endpoint at creation time. When VPC CNI recycles the interface name for the replacement pod (same name because SHA1(namespace + podname) is deterministic for StatefulSets), setDown() targets the new pod's interface.
The ifIndex is already stored on the endpoint during creation (populated in plugins/cilium-cni/chaining/generic-veth/generic-veth.go), but is never consulted during setDown() or Unload().
// pkg/endpoint/endpoint.go — current implementation
func (e *Endpoint) setDown() error {
link, err := safenetlink.LinkByName(e.HostInterface())
if errors.As(err, &netlink.LinkNotFoundError{}) {
return nil
}
if err != nil {
return fmt.Errorf("setting interface %s down: %w", e.HostInterface(), err)
}
// e.ifIndex is available but never compared to link.Attrs().Index
return netlink.LinkSetDown(link)
}Impact: The affected pod is Running with Ready: True but has completely dead networking. Cilium reports the endpoint as state: ready, overallHealth: OK. No component detects the failure. The interface remains DOWN permanently unless manually corrected.
Expected behavior: setDown() should verify that the interface it found still belongs to this endpoint by comparing link.Attrs().Index against the stored e.ifIndex. If they differ, the interface was recycled and setDown() should be a no-op.
How can we reproduce the issue?
Prerequisites
- EKS cluster with Cilium in
aws-cnichaining mode (Cilium 1.19.0) - VPC CNI with default veth prefix
eni(in testing ran on 1.16.4, but it will be true for any version)
Steps
Note: This reproduction uses a sidecar with a PreStop hook that ignores SIGTERM — a deliberately misbehaving workload that widens the race window, matching the real-world conditions where this was discovered. While it can be mitigated at the application level, it would be good if this was resilient so that networking does not get corrupted and workloads can properly recover despite a SIGKILL when they are otherwise safe.
- Deploy a StatefulSet with
podManagementPolicy: Parallelpinned to a single node:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: repro
namespace: repro-ns
spec:
serviceName: repro
replicas: 4
podManagementPolicy: Parallel
selector:
matchLabels:
app: repro
template:
metadata:
labels:
app: repro
spec:
nodeName: <pick-a-node>
terminationGracePeriodSeconds: 45
containers:
- name: worker
image: busybox
command: ["sh", "-c", "while true; do sleep 5; done"]
- name: slow-sidecar
image: busybox
command: ["sh", "-c", "trap '' TERM; while true; do sleep 1; done"]
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 60"]- Wait for all pods to be Running. Record interface states:
ip -o link show | grep eni | awk '{print $2, $9}'
# All should show state UP- Start
ip monitor linkon the node (via the Cilium agent pod):
kubectl exec -n kube-system <cilium-pod> -- ip monitor link > /tmp/link-monitor.log &- Delete all pods in the test namespace (normal delete, no force required):
kubectl delete pods -n <namespace> --allThe sidecar's PreStop hook (sleep 60) exceeds terminationGracePeriodSeconds (45s).
Kubelet force-kills the container after the grace period expires, then runs CNI DEL.
Meanwhile, the Parallel StatefulSet controller has already created replacement pods
and their CNI ADD has completed — so CNI DEL for the old pods finds the new interfaces.
- After ~50 seconds (grace period + cleanup), check interface states:
ip -o link show | grep eni | awk '{print $2, $9}'
# All (which encounter the race condition) will show state DOWN despite new pods being Runningip monitor linkoutput will show:
<old-ifindex>: eniXXX state DOWN ← VPC CNI deletes old interface
Deleted <old-ifindex>: eniXXX
<new-ifindex>: eniXXX state DOWN ← VPC CNI creates new interface
<new-ifindex>: eniXXX state UP ← VPC CNI brings it UP
<new-ifindex>: eniXXX state DOWN ← Cilium setDown() kills it
Cilium Version
Client: 1.19.0 7c6667e 2026-02-03T16:36:49+01:00 go version go1.25.6 linux/amd64
Daemon: 1.19.0 7c6667e 2026-02-03T16:36:49+01:00 go version go1.25.6 linux/amd64
Kernel Version
Linux 6.12.55-74.119.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC x86_64 GNU/Linux
Kubernetes Version
Server Version: v1.33.7-eks-ac2d5a0
Regression
This was NOT possible before v1.16.0. The setDown() function was introduced in PR #32167 (merged April 30, 2024, commit 6d80a756db). Versions v1.15.x and earlier do not have this code path and are not affected.
In standalone mode (non-chaining), setDown() is safe because Cilium generates unique host-side interface names (lxc<hash(endpoint_id)>). The issue is specific to chaining mode where the external CNI (VPC CNI) controls interface naming and can produce deterministic, reusable names.
Sysdump
No response
Relevant log output
# Cilium agent: new endpoints created, then old endpoints deleted 1s later
time=2026-02-21T08:01:24.289Z msg="Create endpoint request" interface=eniebfb91e3bd0 k8sPodName=veth-race-test/veth-race-test-0 k8sUID=2eb2173e-1065-4b37-9117-e41e9044ec47
time=2026-02-21T08:01:24.916Z msg="Successful endpoint creation" endpointID=1540 ipv4=10.0.197.54
time=2026-02-21T08:01:25.355Z msg="Delete endpoint by containerID request" endpointID=361 containerID=d59022ad31e2 k8sPodName=veth-race-test-0
time=2026-02-21T08:01:25.378Z msg="Removed endpoint" endpointID=361 ipv4=10.0.216.0
# ^^^ setDown() runs during this delete, finds eniebfb91e3bd0 (now ifindex=60, the NEW interface), brings it DOWN
# Cilium CNI plugin: CNI DEL retries 23s later get 404 (endpoint already gone)
time=2026-02-21T08:01:48.674Z level=WARN msg="Errors encountered while deleting endpoint" containerID=d59022ad31e2 error="[DELETE /endpoint][404] deleteEndpointNotFound"
# Old interface (ifindex=56) deleted by VPC CNI, new interface (ifindex=60) created and brought UP, then killed:
56: eniebfb91e3bd0@NONE: <BROADCAST,MULTICAST> mtu 9001 state DOWN
Deleted 56: eniebfb91e3bd0@NONE: state DOWN
60: eniebfb91e3bd0@if3: <BROADCAST,MULTICAST> state DOWN
60: eniebfb91e3bd0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> state UP
60: eniebfb91e3bd0@if3: <BROADCAST,MULTICAST> state DOWN # ← setDown() kills the NEW interface
# Interface state after the race:
# BEFORE: eniebfb91e3bd0 ifindex=56 state=UP
# AFTER: eniebfb91e3bd0 ifindex=60 state=DOWN (different ifindex = different device, same name)Anything else?
Potential fix: Add ifIndex validation to setDown() before calling LinkSetDown. The ifIndex is already stored on the endpoint at creation time — it just needs to be checked:
Cilium Users Document
- Are you a user of Cilium? Please add yourself to the Users doc
Code of Conduct
- I agree to follow this project's Code of Conduct