Skip to content

Bug: topologyInjector race condition #6932

@jukie

Description

@jukie

Description:
During Envoy pod rollouts, I'm intermittently experiencing webhook failures due to "get pod failed" for 1-2 pods.

When using zone aware routing this leads to the failed pod to use non-local routing and introduces traffic imbalance for the upstream. The only resolution is to delete the problematic pod.

The pod definitely exists and I think it's a race condition between caches being synced. Would adjusting the controller cache settings to more explicitly include/exclude resources help here?

This may also only affect large clusters.

Repro steps:

Include sample requests, environment, etc. All data and inputs
required to reproduce the bug.

  • Cluster with >1k nodes and >10k objects
  • Create EnvoyProxy resources with zone aware routing enabled
  • Trigger pod rollout

Note: If there are privacy concerns, sanitize the data prior to
sharing.

Environment:
envoy-gateway v1.5.0

Logs:

2025-09-10T14:10:37.770Z ERROR proxy-topology-injector kubernetes/topology_injector.go:53 get pod failed {"pod": "envoy-gateway/envoy-eg-bc4d5f0c-79d789dcb9-kj7ll", "error": "Pod \"envoy-eg-bc4d5f0c-79d789dcb9-kj7ll\" not found"}

Metadata

Metadata

Assignees

Labels

kind/bugSomething isn't working

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions