-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Inconsistent behavior related to VRRP packets drops #18347
Copy link
Copy link
Labels
area/datapathImpacts bpf/ or low-level forwarding details, including map management and monitor messages.Impacts bpf/ or low-level forwarding details, including map management and monitor messages.area/host-firewallImpacts the host firewall or the host endpoint.Impacts the host firewall or the host endpoint.kind/bugThis is a bug in the Cilium logic.This is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.This was reported by a user in the Cilium community, eg via Slack.pinnedThese issues are not marked stale by our issue bot.These issues are not marked stale by our issue bot.
Description
Is there an existing issue for this?
- I have searched the existing issues
What happened?
I'm running keepalived on two of my hosts and they are both reporting a timeout.
The VIP is 192.168.33.200. Running ip ad | grep 200 on the two nodes:
node3 | CHANGED | rc=0 >>
inet 192.168.33.200/32 scope global eth1
node2 | CHANGED | rc=0 >>
inet 192.168.33.200/32 scope global eth1
Looking at hubble observe -t drop I see the following:
Dec 30 21:01:55.186: 192.168.33.20 <> 192.168.33.30 CT: Unknown L4 protocol DROPPED (IPv4)
Dec 30 21:01:55.808: 192.168.33.30 <> 192.168.33.20 CT: Unknown L4 protocol DROPPED (IPv4)
Dec 30 21:01:56.186: 192.168.33.20 <> 192.168.33.30 CT: Unknown L4 protocol DROPPED (IPv4)
Dec 30 21:01:56.415: fe:54:00:7f:be:9e <> 01:80:c2:00:00:00 Unsupported L2 protocol DROPPED (Ethernet)
Dec 30 21:01:56.472: fe:54:00:34:d6:c7 <> 01:80:c2:00:00:00 Unsupported L2 protocol DROPPED (Ethernet)
Dec 30 21:01:56.702: fe:54:00:e3:4a:21 <> 01:80:c2:00:00:00 Unsupported L2 protocol DROPPED (Ethernet)
Dec 30 21:01:56.808: 192.168.33.30 <> 192.168.33.20 CT: Unknown L4 protocol DROPPED (IPv4)
Dec 30 21:01:56.829: fe:54:00:4b:40:45 <> 01:80:c2:00:00:00 Unsupported L2 protocol DROPPED (Ethernet)
Dec 30 21:01:56.861: fe:54:00:1e:82:f6 <> 01:80:c2:00:00:00 Unsupported L2 protocol DROPPED (Ethernet)
Dec 30 21:01:57.140: fe:54:00:2c:b7:f6 <> 01:80:c2:00:00:00 Unsupported L2 protocol DROPPED (Ethernet)
Dec 30 21:01:57.186: 192.168.33.20 <> 192.168.33.30 CT: Unknown L4 protocol DROPPED (IPv4)
Dec 30 21:01:57.808: 192.168.33.30 <> 192.168.33.20 CT: Unknown L4 protocol DROPPED (IPv4)
Dec 30 21:01:58.186: 192.168.33.20 <> 192.168.33.30 CT: Unknown L4 protocol DROPPED (IPv4)
Dec 30 21:01:58.456: fe:54:00:34:d6:c7 <> 01:80:c2:00:00:00 Unsupported L2 protocol DROPPED (Ethernet)
Dec 30 21:01:58.686: fe:54:00:e3:4a:21 <> 01:80:c2:00:00:00 Unsupported L2 protocol DROPPED (Ethernet)
Dec 30 21:01:58.808: 192.168.33.30 <> 192.168.33.20 CT: Unknown L4 protocol DROPPED (IPv4)
Dec 30 21:01:59.124: fe:54:00:2c:b7:f6 <> 01:80:c2:00:00:00 Unsupported L2 protocol DROPPED (Ethernet)
JSON output:
{"time":"2021-12-30T21:06:17.103685380Z","verdict":"DROPPED","drop_reason":166,"ethernet":{"source":"fe:54:00:2c:b7:f6","destination":"01:80:c2:00:00:00"},"source":{"identity":1,"labels":["reserved:host"]},"destination":{"identity":2,"labels":["reserved:world"]},"Type":"L3_L4","node_name":"node2","event_type":{"type":1,"sub_type":166},"traffic_direction":"INGRESS","drop_reason_desc":"UNSUPPORTED_L2_PROTOCOL","Summary":"Ethernet"}
{"time":"2021-12-30T21:06:17.210109754Z","verdict":"DROPPED","drop_reason":137,"ethernet":{"source":"52:54:00:2c:b7:f6","destination":"52:54:00:4b:40:45"},"IP":{"source":"192.168.33.20","destination":"192.168.33.30","ipVersion":"IPv4"},"source":{"identity":1,"labels":["reserved:host"]},"destination":{"identity":6,"labels":["reserved:remote-node"]},"Type":"L3_L4","node_name":"node2","event_type":{"type":1,"sub_type":137},"traffic_direction":"INGRESS","drop_reason_desc":"CT_UNKNOWN_L4_PROTOCOL","Summary":"IPv4"}
The two nodes IPs are 192.168.33.20 and 192.168.33.30.
I've tested with both 1.11 and 1.10 with exactly identical machines(except for Cilium) and the issue only happens on 1.11
Cilium Version
Defaulted container "cilium-agent" out of: cilium-agent, clean-cilium-state (init)
Client: 1.11.0 27e0848 2021-12-05T15:34:41-08:00 go version go1.17.3 linux/amd64
Daemon: 1.11.0 27e0848 2021-12-05T15:34:41-08:00 go version go1.17.3 linux/amd64
Kernel Version
Linux node1 4.18.0-348.7.1.el8_5.x86_64 #1 SMP Tue Dec 21 19:02:23 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"ab69524f795c42094a6630298ff53f3c3ebab7f4", GitTreeState:"clean", BuildDate:"2021-12-07T18:16:20Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"ab69524f795c42094a6630298ff53f3c3ebab7f4", GitTreeState:"clean", BuildDate:"2021-12-07T18:09:57Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"linux/amd64"}
Sysdump
Relevant log output
No response
Anything else?
cilium status --verbose:
KVStore: Ok Disabled
Kubernetes: Ok 1.23 (v1.23.0) [linux/amd64]
Kubernetes APIs: ["cilium/v2::CiliumClusterwideNetworkPolicy", "cilium/v2::CiliumEndpoint", "cilium/v2::CiliumNetworkPolicy", "cilium/v2::CiliumNode", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"]
KubeProxyReplacement: Strict [eth1 192.168.33.20 (Direct Routing)]
Host firewall: Enabled [eth1]
Cilium: Ok 1.11.0 (v1.11.0-27e0848)
NodeMonitor: Listening for events on 2 CPUs with 64x4096 of shared memory
Cilium health daemon: Ok
IPAM: IPv4: 3/254 allocated from 172.16.1.0/24,
Allocated addresses:
172.16.1.134 (router)
172.16.1.196 (ingress-nginx/ingress-nginx-controller-njn5d)
172.16.1.26 (health)
BandwidthManager: Disabled
Host Routing: Legacy
Masquerading: IPTables [IPv4: Enabled, IPv6: Disabled]
Clock Source for BPF: ktime
Controller Status: 24/24 healthy
Name Last success Last error Count Message
cilium-health-ep 16s ago never 0 no error
dns-garbage-collector-job 27s ago never 0 no error
endpoint-2534-regeneration-recovery never never 0 no error
endpoint-3923-regeneration-recovery never never 0 no error
endpoint-4044-regeneration-recovery never never 0 no error
endpoint-gc 27s ago never 0 no error
ipcache-inject-labels 2h35m8s ago 2h35m25s ago 0 no error
k8s-heartbeat 26s ago never 0 no error
mark-k8s-node-as-available 2h35m17s ago never 0 no error
metricsmap-bpf-prom-sync 5s ago never 0 no error
neighbor-table-refresh 17s ago never 0 no error
resolve-identity-2534 16s ago never 0 no error
resolve-identity-3923 1m33s ago never 0 no error
resolve-identity-4044 17s ago never 0 no error
sync-endpoints-and-host-ips 17s ago never 0 no error
sync-lb-maps-with-k8s-services 2h35m17s ago never 0 no error
sync-policymap-2534 14s ago never 0 no error
sync-policymap-3923 14s ago never 0 no error
sync-policymap-4044 14s ago never 0 no error
sync-to-k8s-ciliumendpoint (2534) 6s ago never 0 no error
sync-to-k8s-ciliumendpoint (3923) 3s ago never 0 no error
sync-to-k8s-ciliumendpoint (4044) 7s ago never 0 no error
template-dir-watcher never never 0 no error
update-k8s-node-annotations 2h35m25s ago never 0 no error
Proxy Status: OK, ip 172.16.1.134, 0 redirects active on ports 10000-20000
Hubble: Ok Current/Max Flows: 4095/4095 (100.00%), Flows/s: 3.61 Metrics: Disabled
KubeProxyReplacement Details:
Status: Strict
Socket LB Protocols: TCP, UDP
Devices: eth1 192.168.33.20 (Direct Routing)
Mode: SNAT
Backend Selection: Random
Session Affinity: Enabled
Graceful Termination: Enabled
XDP Acceleration: Disabled
Services:
- ClusterIP: Enabled
- NodePort: Enabled (Range: 30000-32767)
- LoadBalancer: Enabled
- externalIPs: Enabled
- HostPort: Enabled
BPF Maps: dynamic sizing: on (ratio: 0.002500)
Name Size
Non-TCP connection tracking 65536
TCP connection tracking 131072
Endpoint policy 65535
Events 2
IP cache 512000
IP masquerading agent 16384
IPv4 fragmentation 8192
IPv4 service 65536
IPv6 service 65536
IPv4 service backend 65536
IPv6 service backend 65536
IPv4 service reverse NAT 65536
IPv6 service reverse NAT 65536
Metrics 1024
NAT 131072
Neighbor table 131072
Global policy 16384
Per endpoint policy 65536
Session affinity 65536
Signal 2
Sockmap 65535
Sock reverse NAT 65536
Tunnel 65536
Encryption: Disabled
Cluster health: 6/6 reachable (2021-12-30T21:15:14Z)
Name IP Node Endpoints
node2 (localhost) 192.168.33.20 reachable reachable
node1 192.168.33.10 reachable reachable
node3 192.168.33.30 reachable reachable
node4 192.168.33.40 reachable reachable
node5 192.168.33.50 reachable reachable
node6 192.168.33.60 reachable reachable
I'm running Rocky Linux 8.5.
My Helm overrides:
values:
hostPort:
enabled: true
hostServices:
enabled: true
containerRuntime:
integration: crio
hostFirewall:
enabled: true
hubble:
relay:
enabled: true
ui:
enabled: true
ingress:
enabled: true
className: "{{ ingress_class }}"
hosts: ["{{ ingress_hosts['hubble'] }}"]
Code of Conduct
- I agree to follow this project's Code of Conduct
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area/datapathImpacts bpf/ or low-level forwarding details, including map management and monitor messages.Impacts bpf/ or low-level forwarding details, including map management and monitor messages.area/host-firewallImpacts the host firewall or the host endpoint.Impacts the host firewall or the host endpoint.kind/bugThis is a bug in the Cilium logic.This is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.This was reported by a user in the Cilium community, eg via Slack.pinnedThese issues are not marked stale by our issue bot.These issues are not marked stale by our issue bot.