Bug: topologyInjector race condition

*Description*:
During Envoy pod rollouts, I'm intermittently experiencing webhook failures due to "get pod failed" for 1-2 pods.

When using zone aware routing this leads to the failed pod to use non-local routing and introduces traffic imbalance for the upstream. The only resolution is to delete the problematic pod.

The pod definitely exists and I think it's a race condition between caches being synced. Would adjusting the controller cache settings to more explicitly include/exclude resources help here?

This may also only affect large clusters.

*Repro steps*:
> Include sample requests, environment, etc. All data and inputs
required to reproduce the bug.
- Cluster with >1k nodes and >10k objects
- Create EnvoyProxy resources with zone aware routing enabled
- Trigger pod rollout

>**Note**: If there are privacy concerns, sanitize the data prior to
sharing.

*Environment*:
envoy-gateway v1.5.0 

*Logs*:

`2025-09-10T14:10:37.770Z	ERROR	proxy-topology-injector	kubernetes/topology_injector.go:53	get pod failed	{"pod": "envoy-gateway/envoy-eg-bc4d5f0c-79d789dcb9-kj7ll", "error": "Pod \"envoy-eg-bc4d5f0c-79d789dcb9-kj7ll\" not found"}`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: topologyInjector race condition #6932

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: topologyInjector race condition #6932

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions