Skip to content

Yet another issue with unexpected fqdn egress dropped traffic #44714

@riuvshyn

Description

@riuvshyn

Is there an existing issue for this?

  • I have searched the existing issues

Version

equal or higher than v1.18.7 and lower than v1.19.0

What happened?

We are experiencing Intermittent issues with cilium FQDN policies.
Sometimes, affected worker nodes just start dripping all fqdn egress traffic from the pods, service-to-service and host firewall traffic is still properly handled.

Important notes:

  • we have many clusters configured exactly same but only few of them are suffering from these issues, we have noticed that workloads on these affected clusters are not very linear
  • most of the affected workloads are using S3 FQDN policies.
  • when issue happens Cilium starts dropping ALL FQDN egress traffic which is whitelisted by relevant CNPs
  • issue is fixed immediately once we restart cilium on affected node but if cilium has entered that "broken" state it would never recover on it's own.
  • we are validated that we are not hitting bpf map limits
  • When issue happens nodes are not under significant load
  • we have tried to bump proxyResponseMaxDelay to 200ms but it didn't really help
  • cilium_ipcache_errors_total shows alot of cannot_overwrite_by_source errors

How can we reproduce the issue?

We don't have reliable way to reproduce the bug.

Cilium Version

1.18.6

Kernel Version

6.8.0-1047-aws

Kubernetes Version

1.33.x

Regression

Occasionally FQDN egress traffic is started being dropped for all FQDN endpoints on affected node.

Sysdump

can't provide sysdumps due to sensitive information.

Relevant log output

Relevant logs we observe when issue appears:
Network status error received, restarting client connections
Timed out waiting for datapath updates of FQDN IP information; returning response. Consider increasing --tofqdns-proxy-response-max-delay if this keeps happening.
Timed out waiting for ipcache to allocate identities for prefixes while consuming policy updates. This may cause policy drops
Detected conflicting label for prefix. This may cause connectivity issues for this address.

Anything else?

cilium config:
apiVersion: v1
data:
  agent-not-ready-taint-key: node.cilium.io/agent-not-ready
  annotate-k8s-node: "true"
  auto-create-cilium-node-resource: "true"
  auto-direct-node-routes: "false"
  aws-enable-prefix-delegation: "true"
  aws-release-excess-ips: "true"
  bpf-distributed-lru: "false"
  bpf-events-drop-enabled: "true"
  bpf-events-policy-verdict-enabled: "true"
  bpf-events-trace-enabled: "false"
  bpf-lb-acceleration: disabled
  bpf-lb-algorithm-annotation: "false"
  bpf-lb-external-clusterip: "false"
  bpf-lb-map-max: "65536"
  bpf-lb-mode-annotation: "false"
  bpf-lb-sock: "false"
  bpf-lb-source-range-all-types: "false"
  bpf-map-dynamic-size-ratio: "0.0025"
  bpf-policy-map-max: "16384"
  bpf-policy-stats-map-max: "65536"
  bpf-root: /sys/fs/bpf
  cgroup-root: /run/cilium/cgroupv2
  cilium-endpoint-gc-interval: 5m0s
  cluster-id: "1"
  cluster-name: mlp-euc1-pe-main01
  clustermesh-enable-endpoint-sync: "false"
  clustermesh-enable-mcs-api: "false"
  cni-chaining-mode: portmap
  conntrack-gc-max-interval: 5m0s
  controller-group-metrics: write-cni-file sync-host-ips sync-lb-maps-with-k8s-services
  custom-cni-conf: "true"
  datapath-mode: veth
  debug: "false"
  debug-verbose: ""
  default-lb-service-ipam: lbipam
  devices: ens+ enp+
  direct-routing-skip-unreachable: "false"
  dnsproxy-socket-linger-timeout: "10"
  ec2-api-endpoint: ""
  egress-gateway-reconciliation-trigger-interval: 1s
  enable-auto-protect-node-port-range: "true"
  enable-bpf-clock-probe: "false"
  enable-endpoint-health-checking: "true"
  enable-endpoint-lockdown-on-policy-overflow: "false"
  enable-endpoint-routes: "true"
  enable-health-check-loadbalancer-ip: "false"
  enable-health-check-nodeport: "true"
  enable-health-checking: "true"
  enable-host-firewall: "true"
  enable-host-legacy-routing: "true"
  enable-hubble: "true"
  enable-hubble-open-metrics: "false"
  enable-internal-traffic-policy: "true"
  enable-ipv4: "true"
  enable-ipv4-big-tcp: "false"
  enable-ipv4-masquerade: "false"
  enable-ipv6: "false"
  enable-ipv6-big-tcp: "false"
  enable-ipv6-masquerade: "true"
  enable-k8s-networkpolicy: "false"
  enable-l2-neigh-discovery: "false"
  enable-l7-proxy: "true"
  enable-lb-ipam: "true"
  enable-masquerade-to-route-source: "false"
  enable-metrics: "true"
  enable-node-port: "false"
  enable-node-selector-labels: "false"
  enable-non-default-deny-policies: "true"
  enable-policy: default
  enable-policy-secrets-sync: "true"
  enable-sctp: "false"
  enable-source-ip-verification: "true"
  enable-svc-source-range-check: "true"
  enable-tcx: "true"
  enable-vtep: "false"
  enable-well-known-identities: "false"
  enable-xt-socket-fallback: "true"
  eni-tags: ''
  envoy-access-log-buffer-size: "4096"
  envoy-base-id: "0"
  envoy-keep-cap-netbindservice: "false"
  external-envoy-proxy: "false"
  health-check-icmp-failure-threshold: "3"
  http-retry-count: "3"
  http-stream-idle-timeout: "300"
  hubble-disable-tls: "false"
  hubble-drop-events: "true"
  hubble-drop-events-interval: 5m
  hubble-drop-events-reasons: auth_required policy_denied
  hubble-event-queue-size: "32768"
  hubble-export-allowlist: '{"verdict":["DROPPED","ERROR","AUDIT"]}'
  hubble-export-denylist: '{"event_type":[{"type":1,"match_sub_type":true,"sub_type":139}]}'
  hubble-export-fieldmask: time source.namespace source.pod_name source.identity source.cluster_name
    source_service destination.namespace destination.pod_name destination.identity
    destination.cluster_name destination_service traffic_direction l4 IP l7 Type node_name
    is_reply event_type verdict Summary drop_reason_desc egress_denied_by ingress_denied_by
  hubble-export-file-compress: "false"
  hubble-export-file-max-backups: "5"
  hubble-export-file-max-size-mb: "100"
  hubble-export-file-path: /var/log/cilium/hubble/events.log
  hubble-listen-address: :4244
  hubble-metrics: dns:query;ignoreAAAA;sourceContext=workload;destinationContext=workload;labelsContext=source_namespace,source_workload,destination_namespace,destination_workload,traffic_direction
    drop:sourceContext=workload;destinationContext=workload|dns;labelsContext=source_namespace,source_workload,destination_namespace,destination_workload,traffic_direction
    flow:sourceContext=workload;destinationContext=workload;labelsContext=source_namespace,source_workload,destination_namespace,destination_workload,traffic_direction
    flows-to-world:port;sourceContext=workload|dns;destinationContext=workload|dns;labelsContext=source_namespace,source_workload,destination_namespace,destination_workload,traffic_direction
  hubble-metrics-server: :9965
  hubble-metrics-server-enable-tls: "false"
  hubble-network-policy-correlation-enabled: "true"
  hubble-socket-path: /var/run/cilium/hubble.sock
  hubble-tls-cert-file: /var/lib/cilium/tls/hubble/server.crt
  hubble-tls-client-ca-files: /var/lib/cilium/tls/hubble/client-ca.crt
  hubble-tls-key-file: /var/lib/cilium/tls/hubble/server.key
  identity-allocation-mode: crd
  identity-gc-interval: 15m0s
  identity-heartbeat-timeout: 30m0s
  identity-management-mode: agent
  install-no-conntrack-iptables-rules: "false"
  ipam: eni
  ipam-cilium-node-update-rate: 15s
  iptables-random-fully: "false"
  k8s-require-ipv4-pod-cidr: "false"
  k8s-require-ipv6-pod-cidr: "false"
  kube-proxy-replacement: "false"
  labels: k8s:mesh.test.com/inject k8s:node-role.kubernetes.io k8s:node.kubernetes.io
    k8s:k8s-app k8s:app k8s:app.kubernetes.io k8s:io.cilium.k8s.namespace.labels.app.kubernetes.io
    k8s:io.cilium.k8s.namespace.labels.kubernetes.io k8s:io.cilium.k8s.policy k8s:app
    k8s:component k8s:job-name k8s:cluster.x-k8s.io k8s:!io.cilium.k8s.namespace.labels
  log-opt: '{"format":"json"}'
  max-connected-clusters: "255"
  mesh-auth-enabled: "true"
  mesh-auth-gc-interval: 5m0s
  mesh-auth-queue-size: "1024"
  mesh-auth-rotated-identities-queue-size: "1024"
  metrics: -cilium_node_connectivity_status -cilium_node_connectivity_latency_seconds
  metrics-sampling-interval: 5m
  monitor-aggregation: medium
  monitor-aggregation-flags: all
  monitor-aggregation-interval: 5s
  nat-map-stats-entries: "32"
  nat-map-stats-interval: 30s
  node-port-bind-protection: "true"
  nodeport-addresses: ""
  nodes-gc-interval: 5m0s
  operator-api-serve-addr: 127.0.0.1:9234
  operator-prometheus-serve-addr: :9963
  policy-audit-mode: "false"
  policy-cidr-match-mode: ""
  policy-default-local-cluster: "false"
  policy-secrets-namespace: cilium-secrets
  policy-secrets-only-from-secrets-namespace: "true"
  preallocate-bpf-maps: "false"
  procfs: /host/proc
  prometheus-serve-addr: :9962
  proxy-connect-timeout: "2"
  proxy-idle-timeout-seconds: "60"
  proxy-initial-fetch-timeout: "30"
  proxy-max-concurrent-retries: "128"
  proxy-max-connection-duration-seconds: "0"
  proxy-max-requests-per-connection: "0"
  proxy-prometheus-port: "9964"
  proxy-xff-num-trusted-hops-egress: "0"
  proxy-xff-num-trusted-hops-ingress: "0"
  read-cni-conf: /host/etc/cni/net.d/01-rke2-cilium.conflist
  remove-cilium-node-taints: "true"
  routing-mode: native
  service-no-backend-response: reject
  set-cilium-is-up-condition: "true"
  synchronize-k8s-nodes: "true"
  tofqdns-dns-reject-response-code: refused
  tofqdns-enable-dns-compression: "true"
  tofqdns-endpoint-max-ip-per-hostname: "1000"
  tofqdns-idle-connection-grace-period: 60s
  tofqdns-max-deferred-connection-deletes: "10000"
  tofqdns-preallocate-identities: "true"
  tofqdns-proxy-response-max-delay: 200ms
  tunnel-protocol: vxlan
  tunnel-source-port-range: 0-0
  unmanaged-pod-watcher-interval: "15"
  vtep-cidr: ""
  vtep-endpoint: ""
  vtep-mac: ""
  vtep-mask: ""
kind: ConfigMap
metadata:
  name: cilium-config
  namespace: kube-system

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/proxyImpacts proxy components, including DNS, Kafka, Envoy and/or XDS servers.kind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions