Skip to content

setDown() tears down wrong pod's veth in aws-cni chaining when deterministic pod names (i.e StatefulSet) cause veth reuse #44463

@yr1453

Description

@yr1453

Is there an existing issue for this?

  • I have searched the existing issues

Version

equal or higher than v1.19.0 and lower than v1.20.0

What happened?

setDown() brings down the wrong veth interface during endpoint deletion in aws-cni chaining mode

When Cilium runs in aws-cni chaining mode and a StatefulSet pod is rapidly recreated on the same node, setDown() in pkg/endpoint/endpoint.go brings down the new pod's veth interface instead of the old one.

The root cause: setDown() resolves the host-side interface by name via safenetlink.LinkByName(e.HostInterface()). It does not validate that the resolved interface's ifIndex matches the one stored on the endpoint at creation time. When VPC CNI recycles the interface name for the replacement pod (same name because SHA1(namespace + podname) is deterministic for StatefulSets), setDown() targets the new pod's interface.

The ifIndex is already stored on the endpoint during creation (populated in plugins/cilium-cni/chaining/generic-veth/generic-veth.go), but is never consulted during setDown() or Unload().

// pkg/endpoint/endpoint.go — current implementation
func (e *Endpoint) setDown() error {
    link, err := safenetlink.LinkByName(e.HostInterface())
    if errors.As(err, &netlink.LinkNotFoundError{}) {
        return nil
    }
    if err != nil {
        return fmt.Errorf("setting interface %s down: %w", e.HostInterface(), err)
    }
    // e.ifIndex is available but never compared to link.Attrs().Index
    return netlink.LinkSetDown(link)
}

Impact: The affected pod is Running with Ready: True but has completely dead networking. Cilium reports the endpoint as state: ready, overallHealth: OK. No component detects the failure. The interface remains DOWN permanently unless manually corrected.

Expected behavior: setDown() should verify that the interface it found still belongs to this endpoint by comparing link.Attrs().Index against the stored e.ifIndex. If they differ, the interface was recycled and setDown() should be a no-op.

How can we reproduce the issue?

Prerequisites

  • EKS cluster with Cilium in aws-cni chaining mode (Cilium 1.19.0)
  • VPC CNI with default veth prefix eni (in testing ran on 1.16.4, but it will be true for any version)

Steps

Note: This reproduction uses a sidecar with a PreStop hook that ignores SIGTERM — a deliberately misbehaving workload that widens the race window, matching the real-world conditions where this was discovered. While it can be mitigated at the application level, it would be good if this was resilient so that networking does not get corrupted and workloads can properly recover despite a SIGKILL when they are otherwise safe.

  1. Deploy a StatefulSet with podManagementPolicy: Parallel pinned to a single node:
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: repro
  namespace: repro-ns
spec:
  serviceName: repro
  replicas: 4
  podManagementPolicy: Parallel
  selector:
    matchLabels:
      app: repro
  template:
    metadata:
      labels:
        app: repro
    spec:
      nodeName: <pick-a-node>
      terminationGracePeriodSeconds: 45
      containers:
        - name: worker
          image: busybox
          command: ["sh", "-c", "while true; do sleep 5; done"]
        - name: slow-sidecar
          image: busybox
          command: ["sh", "-c", "trap '' TERM; while true; do sleep 1; done"]
          lifecycle:
            preStop:
              exec:
                command: ["sh", "-c", "sleep 60"]
  1. Wait for all pods to be Running. Record interface states:
ip -o link show | grep eni | awk '{print $2, $9}'
# All should show state UP
  1. Start ip monitor link on the node (via the Cilium agent pod):
kubectl exec -n kube-system <cilium-pod> -- ip monitor link > /tmp/link-monitor.log &
  1. Delete all pods in the test namespace (normal delete, no force required):
kubectl delete pods -n <namespace> --all

The sidecar's PreStop hook (sleep 60) exceeds terminationGracePeriodSeconds (45s).
Kubelet force-kills the container after the grace period expires, then runs CNI DEL.
Meanwhile, the Parallel StatefulSet controller has already created replacement pods
and their CNI ADD has completed — so CNI DEL for the old pods finds the new interfaces.

  1. After ~50 seconds (grace period + cleanup), check interface states:
ip -o link show | grep eni | awk '{print $2, $9}'
# All (which encounter the race condition) will show state DOWN despite new pods being Running
  1. ip monitor link output will show:
<old-ifindex>: eniXXX state DOWN          ← VPC CNI deletes old interface
Deleted <old-ifindex>: eniXXX
<new-ifindex>: eniXXX state DOWN          ← VPC CNI creates new interface
<new-ifindex>: eniXXX state UP            ← VPC CNI brings it UP
<new-ifindex>: eniXXX state DOWN          ← Cilium setDown() kills it

Cilium Version

Client: 1.19.0 7c6667e 2026-02-03T16:36:49+01:00 go version go1.25.6 linux/amd64
Daemon: 1.19.0 7c6667e 2026-02-03T16:36:49+01:00 go version go1.25.6 linux/amd64

Kernel Version

Linux 6.12.55-74.119.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC x86_64 GNU/Linux

Kubernetes Version

Server Version: v1.33.7-eks-ac2d5a0

Regression

This was NOT possible before v1.16.0. The setDown() function was introduced in PR #32167 (merged April 30, 2024, commit 6d80a756db). Versions v1.15.x and earlier do not have this code path and are not affected.

In standalone mode (non-chaining), setDown() is safe because Cilium generates unique host-side interface names (lxc<hash(endpoint_id)>). The issue is specific to chaining mode where the external CNI (VPC CNI) controls interface naming and can produce deterministic, reusable names.

Sysdump

No response

Relevant log output

# Cilium agent: new endpoints created, then old endpoints deleted 1s later
time=2026-02-21T08:01:24.289Z msg="Create endpoint request" interface=eniebfb91e3bd0 k8sPodName=veth-race-test/veth-race-test-0 k8sUID=2eb2173e-1065-4b37-9117-e41e9044ec47
time=2026-02-21T08:01:24.916Z msg="Successful endpoint creation" endpointID=1540 ipv4=10.0.197.54

time=2026-02-21T08:01:25.355Z msg="Delete endpoint by containerID request" endpointID=361 containerID=d59022ad31e2 k8sPodName=veth-race-test-0
time=2026-02-21T08:01:25.378Z msg="Removed endpoint" endpointID=361 ipv4=10.0.216.0
# ^^^ setDown() runs during this delete, finds eniebfb91e3bd0 (now ifindex=60, the NEW interface), brings it DOWN

# Cilium CNI plugin: CNI DEL retries 23s later get 404 (endpoint already gone)
time=2026-02-21T08:01:48.674Z level=WARN msg="Errors encountered while deleting endpoint" containerID=d59022ad31e2 error="[DELETE /endpoint][404] deleteEndpointNotFound"

# Old interface (ifindex=56) deleted by VPC CNI, new interface (ifindex=60) created and brought UP, then killed:
56: eniebfb91e3bd0@NONE: <BROADCAST,MULTICAST> mtu 9001 state DOWN
Deleted 56: eniebfb91e3bd0@NONE: state DOWN
60: eniebfb91e3bd0@if3: <BROADCAST,MULTICAST> state DOWN
60: eniebfb91e3bd0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> state UP
60: eniebfb91e3bd0@if3: <BROADCAST,MULTICAST> state DOWN           # ← setDown() kills the NEW interface

# Interface state after the race:
# BEFORE: eniebfb91e3bd0  ifindex=56  state=UP
# AFTER:  eniebfb91e3bd0  ifindex=60  state=DOWN  (different ifindex = different device, same name)

Anything else?

Potential fix: Add ifIndex validation to setDown() before calling LinkSetDown. The ifIndex is already stored on the endpoint at creation time — it just needs to be checked:

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.needs/triageThis issue requires triaging to establish severity and next steps.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions