-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Yet another issue with unexpected fqdn egress dropped traffic #44714
Copy link
Copy link
Closed
Labels
area/proxyImpacts proxy components, including DNS, Kafka, Envoy and/or XDS servers.Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers.kind/bugThis is a bug in the Cilium logic.This is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.This was reported by a user in the Cilium community, eg via Slack.
Description
Is there an existing issue for this?
- I have searched the existing issues
Version
equal or higher than v1.18.7 and lower than v1.19.0
What happened?
We are experiencing Intermittent issues with cilium FQDN policies.
Sometimes, affected worker nodes just start dripping all fqdn egress traffic from the pods, service-to-service and host firewall traffic is still properly handled.
Important notes:
- we have many clusters configured exactly same but only few of them are suffering from these issues, we have noticed that workloads on these affected clusters are not very linear
- most of the affected workloads are using S3 FQDN policies.
- when issue happens Cilium starts dropping ALL FQDN egress traffic which is whitelisted by relevant CNPs
- issue is fixed immediately once we restart cilium on affected node but if cilium has entered that "broken" state it would never recover on it's own.
- we are validated that we are not hitting bpf map limits
- When issue happens nodes are not under significant load
- we have tried to bump
proxyResponseMaxDelayto200msbut it didn't really help cilium_ipcache_errors_totalshows alot ofcannot_overwrite_by_sourceerrors
How can we reproduce the issue?
We don't have reliable way to reproduce the bug.
Cilium Version
1.18.6
Kernel Version
6.8.0-1047-aws
Kubernetes Version
1.33.x
Regression
Occasionally FQDN egress traffic is started being dropped for all FQDN endpoints on affected node.
Sysdump
can't provide sysdumps due to sensitive information.
Relevant log output
Relevant logs we observe when issue appears:Network status error received, restarting client connections
Timed out waiting for datapath updates of FQDN IP information; returning response. Consider increasing --tofqdns-proxy-response-max-delay if this keeps happening.
Timed out waiting for ipcache to allocate identities for prefixes while consuming policy updates. This may cause policy drops
Detected conflicting label for prefix. This may cause connectivity issues for this address.
Anything else?
cilium config:
apiVersion: v1
data:
agent-not-ready-taint-key: node.cilium.io/agent-not-ready
annotate-k8s-node: "true"
auto-create-cilium-node-resource: "true"
auto-direct-node-routes: "false"
aws-enable-prefix-delegation: "true"
aws-release-excess-ips: "true"
bpf-distributed-lru: "false"
bpf-events-drop-enabled: "true"
bpf-events-policy-verdict-enabled: "true"
bpf-events-trace-enabled: "false"
bpf-lb-acceleration: disabled
bpf-lb-algorithm-annotation: "false"
bpf-lb-external-clusterip: "false"
bpf-lb-map-max: "65536"
bpf-lb-mode-annotation: "false"
bpf-lb-sock: "false"
bpf-lb-source-range-all-types: "false"
bpf-map-dynamic-size-ratio: "0.0025"
bpf-policy-map-max: "16384"
bpf-policy-stats-map-max: "65536"
bpf-root: /sys/fs/bpf
cgroup-root: /run/cilium/cgroupv2
cilium-endpoint-gc-interval: 5m0s
cluster-id: "1"
cluster-name: mlp-euc1-pe-main01
clustermesh-enable-endpoint-sync: "false"
clustermesh-enable-mcs-api: "false"
cni-chaining-mode: portmap
conntrack-gc-max-interval: 5m0s
controller-group-metrics: write-cni-file sync-host-ips sync-lb-maps-with-k8s-services
custom-cni-conf: "true"
datapath-mode: veth
debug: "false"
debug-verbose: ""
default-lb-service-ipam: lbipam
devices: ens+ enp+
direct-routing-skip-unreachable: "false"
dnsproxy-socket-linger-timeout: "10"
ec2-api-endpoint: ""
egress-gateway-reconciliation-trigger-interval: 1s
enable-auto-protect-node-port-range: "true"
enable-bpf-clock-probe: "false"
enable-endpoint-health-checking: "true"
enable-endpoint-lockdown-on-policy-overflow: "false"
enable-endpoint-routes: "true"
enable-health-check-loadbalancer-ip: "false"
enable-health-check-nodeport: "true"
enable-health-checking: "true"
enable-host-firewall: "true"
enable-host-legacy-routing: "true"
enable-hubble: "true"
enable-hubble-open-metrics: "false"
enable-internal-traffic-policy: "true"
enable-ipv4: "true"
enable-ipv4-big-tcp: "false"
enable-ipv4-masquerade: "false"
enable-ipv6: "false"
enable-ipv6-big-tcp: "false"
enable-ipv6-masquerade: "true"
enable-k8s-networkpolicy: "false"
enable-l2-neigh-discovery: "false"
enable-l7-proxy: "true"
enable-lb-ipam: "true"
enable-masquerade-to-route-source: "false"
enable-metrics: "true"
enable-node-port: "false"
enable-node-selector-labels: "false"
enable-non-default-deny-policies: "true"
enable-policy: default
enable-policy-secrets-sync: "true"
enable-sctp: "false"
enable-source-ip-verification: "true"
enable-svc-source-range-check: "true"
enable-tcx: "true"
enable-vtep: "false"
enable-well-known-identities: "false"
enable-xt-socket-fallback: "true"
eni-tags: ''
envoy-access-log-buffer-size: "4096"
envoy-base-id: "0"
envoy-keep-cap-netbindservice: "false"
external-envoy-proxy: "false"
health-check-icmp-failure-threshold: "3"
http-retry-count: "3"
http-stream-idle-timeout: "300"
hubble-disable-tls: "false"
hubble-drop-events: "true"
hubble-drop-events-interval: 5m
hubble-drop-events-reasons: auth_required policy_denied
hubble-event-queue-size: "32768"
hubble-export-allowlist: '{"verdict":["DROPPED","ERROR","AUDIT"]}'
hubble-export-denylist: '{"event_type":[{"type":1,"match_sub_type":true,"sub_type":139}]}'
hubble-export-fieldmask: time source.namespace source.pod_name source.identity source.cluster_name
source_service destination.namespace destination.pod_name destination.identity
destination.cluster_name destination_service traffic_direction l4 IP l7 Type node_name
is_reply event_type verdict Summary drop_reason_desc egress_denied_by ingress_denied_by
hubble-export-file-compress: "false"
hubble-export-file-max-backups: "5"
hubble-export-file-max-size-mb: "100"
hubble-export-file-path: /var/log/cilium/hubble/events.log
hubble-listen-address: :4244
hubble-metrics: dns:query;ignoreAAAA;sourceContext=workload;destinationContext=workload;labelsContext=source_namespace,source_workload,destination_namespace,destination_workload,traffic_direction
drop:sourceContext=workload;destinationContext=workload|dns;labelsContext=source_namespace,source_workload,destination_namespace,destination_workload,traffic_direction
flow:sourceContext=workload;destinationContext=workload;labelsContext=source_namespace,source_workload,destination_namespace,destination_workload,traffic_direction
flows-to-world:port;sourceContext=workload|dns;destinationContext=workload|dns;labelsContext=source_namespace,source_workload,destination_namespace,destination_workload,traffic_direction
hubble-metrics-server: :9965
hubble-metrics-server-enable-tls: "false"
hubble-network-policy-correlation-enabled: "true"
hubble-socket-path: /var/run/cilium/hubble.sock
hubble-tls-cert-file: /var/lib/cilium/tls/hubble/server.crt
hubble-tls-client-ca-files: /var/lib/cilium/tls/hubble/client-ca.crt
hubble-tls-key-file: /var/lib/cilium/tls/hubble/server.key
identity-allocation-mode: crd
identity-gc-interval: 15m0s
identity-heartbeat-timeout: 30m0s
identity-management-mode: agent
install-no-conntrack-iptables-rules: "false"
ipam: eni
ipam-cilium-node-update-rate: 15s
iptables-random-fully: "false"
k8s-require-ipv4-pod-cidr: "false"
k8s-require-ipv6-pod-cidr: "false"
kube-proxy-replacement: "false"
labels: k8s:mesh.test.com/inject k8s:node-role.kubernetes.io k8s:node.kubernetes.io
k8s:k8s-app k8s:app k8s:app.kubernetes.io k8s:io.cilium.k8s.namespace.labels.app.kubernetes.io
k8s:io.cilium.k8s.namespace.labels.kubernetes.io k8s:io.cilium.k8s.policy k8s:app
k8s:component k8s:job-name k8s:cluster.x-k8s.io k8s:!io.cilium.k8s.namespace.labels
log-opt: '{"format":"json"}'
max-connected-clusters: "255"
mesh-auth-enabled: "true"
mesh-auth-gc-interval: 5m0s
mesh-auth-queue-size: "1024"
mesh-auth-rotated-identities-queue-size: "1024"
metrics: -cilium_node_connectivity_status -cilium_node_connectivity_latency_seconds
metrics-sampling-interval: 5m
monitor-aggregation: medium
monitor-aggregation-flags: all
monitor-aggregation-interval: 5s
nat-map-stats-entries: "32"
nat-map-stats-interval: 30s
node-port-bind-protection: "true"
nodeport-addresses: ""
nodes-gc-interval: 5m0s
operator-api-serve-addr: 127.0.0.1:9234
operator-prometheus-serve-addr: :9963
policy-audit-mode: "false"
policy-cidr-match-mode: ""
policy-default-local-cluster: "false"
policy-secrets-namespace: cilium-secrets
policy-secrets-only-from-secrets-namespace: "true"
preallocate-bpf-maps: "false"
procfs: /host/proc
prometheus-serve-addr: :9962
proxy-connect-timeout: "2"
proxy-idle-timeout-seconds: "60"
proxy-initial-fetch-timeout: "30"
proxy-max-concurrent-retries: "128"
proxy-max-connection-duration-seconds: "0"
proxy-max-requests-per-connection: "0"
proxy-prometheus-port: "9964"
proxy-xff-num-trusted-hops-egress: "0"
proxy-xff-num-trusted-hops-ingress: "0"
read-cni-conf: /host/etc/cni/net.d/01-rke2-cilium.conflist
remove-cilium-node-taints: "true"
routing-mode: native
service-no-backend-response: reject
set-cilium-is-up-condition: "true"
synchronize-k8s-nodes: "true"
tofqdns-dns-reject-response-code: refused
tofqdns-enable-dns-compression: "true"
tofqdns-endpoint-max-ip-per-hostname: "1000"
tofqdns-idle-connection-grace-period: 60s
tofqdns-max-deferred-connection-deletes: "10000"
tofqdns-preallocate-identities: "true"
tofqdns-proxy-response-max-delay: 200ms
tunnel-protocol: vxlan
tunnel-source-port-range: 0-0
unmanaged-pod-watcher-interval: "15"
vtep-cidr: ""
vtep-endpoint: ""
vtep-mac: ""
vtep-mask: ""
kind: ConfigMap
metadata:
name: cilium-config
namespace: kube-system
Cilium Users Document
- Are you a user of Cilium? Please add yourself to the Users doc
Code of Conduct
- I agree to follow this project's Code of Conduct
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area/proxyImpacts proxy components, including DNS, Kafka, Envoy and/or XDS servers.Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers.kind/bugThis is a bug in the Cilium logic.This is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.This was reported by a user in the Cilium community, eg via Slack.