setDown() tears down wrong pod's veth in aws-cni chaining when deterministic pod names (i.e StatefulSet) cause veth reuse

### Is there an existing issue for this?

- [x] I have searched the existing issues

### Version

equal or higher than v1.19.0 and lower than v1.20.0

### What happened?

### `setDown()` brings down the wrong veth interface during endpoint deletion in aws-cni chaining mode

When Cilium runs in `aws-cni` chaining mode and a StatefulSet pod is rapidly recreated on the same node, `setDown()` in `pkg/endpoint/endpoint.go` brings down the **new** pod's veth interface instead of the old one.

**The root cause**: `setDown()` resolves the host-side interface by name via `safenetlink.LinkByName(e.HostInterface())`. It does not validate that the resolved interface's `ifIndex` matches the one stored on the endpoint at creation time. When VPC CNI recycles the interface name for the replacement pod (same name because `SHA1(namespace + podname)` is deterministic for StatefulSets), `setDown()` targets the new pod's interface.

The `ifIndex` is already stored on the endpoint during creation (populated in `plugins/cilium-cni/chaining/generic-veth/generic-veth.go`), but is never consulted during `setDown()` or `Unload()`.

```go
// pkg/endpoint/endpoint.go — current implementation
func (e *Endpoint) setDown() error {
    link, err := safenetlink.LinkByName(e.HostInterface())
    if errors.As(err, &netlink.LinkNotFoundError{}) {
        return nil
    }
    if err != nil {
        return fmt.Errorf("setting interface %s down: %w", e.HostInterface(), err)
    }
    // e.ifIndex is available but never compared to link.Attrs().Index
    return netlink.LinkSetDown(link)
}
```

**Impact**: The affected pod is `Running` with `Ready: True` but has completely dead networking. Cilium reports the endpoint as `state: ready, overallHealth: OK`. No component detects the failure. The interface remains DOWN permanently unless manually corrected.

**Expected behavior**: `setDown()` should verify that the interface it found still belongs to this endpoint by comparing `link.Attrs().Index` against the stored `e.ifIndex`. If they differ, the interface was recycled and `setDown()` should be a no-op.

### How can we reproduce the issue?

### Prerequisites
- EKS cluster with Cilium in `aws-cni` chaining mode (Cilium 1.19.0)
- VPC CNI with default veth prefix `eni` (in testing ran on 1.16.4, but it will be true for any version)

### Steps

#### Note: This reproduction uses a sidecar with a PreStop hook that ignores SIGTERM — a deliberately misbehaving workload that widens the race window, matching the real-world conditions where this was discovered. While it can be mitigated at the application level, it would be good if this was resilient so that networking does not get corrupted and workloads can properly recover despite a SIGKILL when they are otherwise safe.

1. Deploy a StatefulSet with `podManagementPolicy: Parallel` pinned to a single node:

```yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: repro
  namespace: repro-ns
spec:
  serviceName: repro
  replicas: 4
  podManagementPolicy: Parallel
  selector:
    matchLabels:
      app: repro
  template:
    metadata:
      labels:
        app: repro
    spec:
      nodeName: <pick-a-node>
      terminationGracePeriodSeconds: 45
      containers:
        - name: worker
          image: busybox
          command: ["sh", "-c", "while true; do sleep 5; done"]
        - name: slow-sidecar
          image: busybox
          command: ["sh", "-c", "trap '' TERM; while true; do sleep 1; done"]
          lifecycle:
            preStop:
              exec:
                command: ["sh", "-c", "sleep 60"]
```

2. Wait for all pods to be Running. Record interface states:
```bash
ip -o link show | grep eni | awk '{print $2, $9}'
# All should show state UP
```

3. Start `ip monitor link` on the node (via the Cilium agent pod):
```bash
kubectl exec -n kube-system <cilium-pod> -- ip monitor link > /tmp/link-monitor.log &
```

4. Delete all pods in the test namespace (normal delete, no force required):
```bash
kubectl delete pods -n <namespace> --all
```

The sidecar's PreStop hook (`sleep 60`) exceeds `terminationGracePeriodSeconds` (45s).
Kubelet force-kills the container after the grace period expires, then runs CNI DEL.
Meanwhile, the `Parallel` StatefulSet controller has already created replacement pods
and their CNI ADD has completed — so CNI DEL for the old pods finds the new interfaces.

5. After ~50 seconds (grace period + cleanup), check interface states:
```bash
ip -o link show | grep eni | awk '{print $2, $9}'
# All (which encounter the race condition) will show state DOWN despite new pods being Running
```

6. `ip monitor link` output will show:
```
<old-ifindex>: eniXXX state DOWN          ← VPC CNI deletes old interface
Deleted <old-ifindex>: eniXXX
<new-ifindex>: eniXXX state DOWN          ← VPC CNI creates new interface
<new-ifindex>: eniXXX state UP            ← VPC CNI brings it UP
<new-ifindex>: eniXXX state DOWN          ← Cilium setDown() kills it
```

### Cilium Version

Client: 1.19.0 7c6667e1 2026-02-03T16:36:49+01:00 go version go1.25.6 linux/amd64
Daemon: 1.19.0 7c6667e1 2026-02-03T16:36:49+01:00 go version go1.25.6 linux/amd64

### Kernel Version

Linux 6.12.55-74.119.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC x86_64 GNU/Linux

### Kubernetes Version

Server Version: v1.33.7-eks-ac2d5a0

### Regression

This was NOT possible before v1.16.0. The `setDown()` function was introduced in PR #32167 (merged April 30, 2024, commit `6d80a756db`). Versions v1.15.x and earlier do not have this code path and are not affected.

In standalone mode (non-chaining), `setDown()` is safe because Cilium generates unique host-side interface names (`lxc<hash(endpoint_id)>`). The issue is specific to chaining mode where the external CNI (VPC CNI) controls interface naming and can produce deterministic, reusable names.

### Sysdump

_No response_

### Relevant log output

```shell
# Cilium agent: new endpoints created, then old endpoints deleted 1s later
time=2026-02-21T08:01:24.289Z msg="Create endpoint request" interface=eniebfb91e3bd0 k8sPodName=veth-race-test/veth-race-test-0 k8sUID=2eb2173e-1065-4b37-9117-e41e9044ec47
time=2026-02-21T08:01:24.916Z msg="Successful endpoint creation" endpointID=1540 ipv4=10.0.197.54

time=2026-02-21T08:01:25.355Z msg="Delete endpoint by containerID request" endpointID=361 containerID=d59022ad31e2 k8sPodName=veth-race-test-0
time=2026-02-21T08:01:25.378Z msg="Removed endpoint" endpointID=361 ipv4=10.0.216.0
# ^^^ setDown() runs during this delete, finds eniebfb91e3bd0 (now ifindex=60, the NEW interface), brings it DOWN

# Cilium CNI plugin: CNI DEL retries 23s later get 404 (endpoint already gone)
time=2026-02-21T08:01:48.674Z level=WARN msg="Errors encountered while deleting endpoint" containerID=d59022ad31e2 error="[DELETE /endpoint][404] deleteEndpointNotFound"

# Old interface (ifindex=56) deleted by VPC CNI, new interface (ifindex=60) created and brought UP, then killed:
56: eniebfb91e3bd0@NONE: <BROADCAST,MULTICAST> mtu 9001 state DOWN
Deleted 56: eniebfb91e3bd0@NONE: state DOWN
60: eniebfb91e3bd0@if3: <BROADCAST,MULTICAST> state DOWN
60: eniebfb91e3bd0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> state UP
60: eniebfb91e3bd0@if3: <BROADCAST,MULTICAST> state DOWN           # ← setDown() kills the NEW interface

# Interface state after the race:
# BEFORE: eniebfb91e3bd0  ifindex=56  state=UP
# AFTER:  eniebfb91e3bd0  ifindex=60  state=DOWN  (different ifindex = different device, same name)
```

### Anything else?

Potential fix: Add `ifIndex` validation to `setDown()` before calling `LinkSetDown`. The `ifIndex` is already stored on the endpoint at creation time — it just needs to be checked:


### Cilium Users Document

- [ ] Are you a user of Cilium? Please add yourself to the [Users doc](https://github.com/cilium/cilium/blob/main/USERS.md)

### Code of Conduct

- [x] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

setDown() tears down wrong pod's veth in aws-cni chaining when deterministic pod names (i.e StatefulSet) cause veth reuse #44463

Is there an existing issue for this?

Version

What happened?

`setDown()` brings down the wrong veth interface during endpoint deletion in aws-cni chaining mode

How can we reproduce the issue?

Prerequisites

Steps

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

setDown() tears down wrong pod's veth in aws-cni chaining when deterministic pod names (i.e StatefulSet) cause veth reuse #44463

Description

Is there an existing issue for this?

Version

What happened?

setDown() brings down the wrong veth interface during endpoint deletion in aws-cni chaining mode

How can we reproduce the issue?

Prerequisites

Steps

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`setDown()` brings down the wrong veth interface during endpoint deletion in aws-cni chaining mode