-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Loadbalancer NLB Target Group health checks failing since upgrade to v1.16.0 #34093
Copy link
Copy link
Closed as not planned
Closed as not planned
Copy link
Labels
area/datapathImpacts bpf/ or low-level forwarding details, including map management and monitor messages.Impacts bpf/ or low-level forwarding details, including map management and monitor messages.kind/bugThis is a bug in the Cilium logic.This is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.This was reported by a user in the Cilium community, eg via Slack.needs/triageThis issue requires triaging to establish severity and next steps.This issue requires triaging to establish severity and next steps.staleThe stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Metadata
Metadata
Assignees
Labels
area/datapathImpacts bpf/ or low-level forwarding details, including map management and monitor messages.Impacts bpf/ or low-level forwarding details, including map management and monitor messages.kind/bugThis is a bug in the Cilium logic.This is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.This was reported by a user in the Cilium community, eg via Slack.needs/triageThis issue requires triaging to establish severity and next steps.This issue requires triaging to establish severity and next steps.staleThe stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.
Is there an existing issue for this?
Version
higher than v1.16.0 and lower than v1.17.0
What happened?
I am running a cluster in AWS EKS with three nodes using the Ingress Controller to manage a network load balancer within AWS. After upgrading to 1.16.0, all health checks on the target groups began to fail. Strangely, replacing a node would allow the checks on port 80 for the new instance to return healthy, at least until the Cilium DS is restarted.
While troubleshooting the issue, I noticed that
cilium statuswould periodically return the following error:I exec'd into the pod reporting this and ran
cilium-dbg statusand only thing of note was this line:Running
cilium-dbg status --verboseshowed that node-manager was degraded with the messageFailed node neighbor link update. I have not been able to determine if the two issues are connected, but I have stripped down our values.yaml to the bare minimum of settings in order to see if our configuration is at fault, but have not been able to restore the health checks.How can we reproduce the issue?
Cilium Version
v1.16.0
Kernel Version
5.10.219-208.866.amzn2.x86_64
Kubernetes Version
v1.29.4-eks-036c24b
Regression
v1.15.6
Sysdump
No response
Relevant log output
No response
Anything else?
No response
Cilium Users Document
Code of Conduct