Skip to content

Loadbalancer NLB Target Group health checks failing since upgrade to v1.16.0 #34093

@okamosy

Description

@okamosy

Is there an existing issue for this?

  • I have searched the existing issues

Version

higher than v1.16.0 and lower than v1.17.0

What happened?

I am running a cluster in AWS EKS with three nodes using the Ingress Controller to manage a network load balancer within AWS. After upgrading to 1.16.0, all health checks on the target groups began to fail. Strangely, replacing a node would allow the checks on port 80 for the new instance to return healthy, at least until the Cilium DS is restarted.

While troubleshooting the issue, I noticed that cilium status would periodically return the following error:

controller node-neighbor-link is failing since 4s (2x): unable to determine next hop IPv4 address for eth1 (<node_ip>): remote node IP is non-routable
unable to determine next hop IPv4 address for eth2 (<node_ip>): remote node IP is non-routable
unable to determine next hop IPv4 address for eth3 (<node_ip>): remote node IP is non-routable

I exec'd into the pod reporting this and ran cilium-dbg status and only thing of note was this line:

Modules Health: Stopped(0) Degraded(3) OK(148)

Running cilium-dbg status --verbose showed that node-manager was degraded with the message Failed node neighbor link update. I have not been able to determine if the two issues are connected, but I have stripped down our values.yaml to the bare minimum of settings in order to see if our configuration is at fault, but have not been able to restore the health checks.

How can we reproduce the issue?

  1. Install Cilium with the following values.yaml file:
egressMasqueradeInterfaces: eth0
eni:
  awsReleaseExcessIPs: true
  enabled: true
envoy:
  enabled: true
ingressController:
  enableProxyProtocol: false
  enabled: true
  loadbalancerMode: shared
  service:
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-alpn-policy: HTTP2Preferred
      service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
      service.beta.kubernetes.io/aws-load-balancer-internal: "true"
      service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance
      service.beta.kubernetes.io/aws-load-balancer-ssl-cert: <ssl-cert-arn>
      service.beta.kubernetes.io/aws-load-balancer-ssl-negotiation-policy: ELBSecurityPolicy-TLS13-1-2-2001-06
      service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443"
      service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=false
      service.beta.kubernetes.io/aws-load-balancer-type: nlb
ipam:
  mode: eni
ipv4NativeRoutingCIDR: <CIDR>
k8sServiceHost: <eks-endpoint>
k8sServicePort: 443
kubeProxyReplacement: true
routingMode: native

Cilium Version

v1.16.0

Kernel Version

5.10.219-208.866.amzn2.x86_64

Kubernetes Version

v1.29.4-eks-036c24b

Regression

v1.15.6

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/datapathImpacts bpf/ or low-level forwarding details, including map management and monitor messages.kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.needs/triageThis issue requires triaging to establish severity and next steps.staleThe stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions