Skip to content

Cilium-operator podCIDR allocation causes traffic to dropped in iptables FORWARD chain #11807

@jrajahalme

Description

@jrajahalme

Cilium-operator podCIDR allocation causes return traffic from host proxy accessed via (k8s 1.17) nodeport (for example) destined to another node to be dropped in iptables FORWARD chain.

This causes the CI fail, if the now-disabled test hitting this is enabled again (#11710).

Related: #11235

How to reproduce:

  1. Cache vagrant box image off of master with default k8s version:
$ checkout master
$ cd test
$ ./vagrant-local-start.sh

Remove the resulting VMs k8s1-1.18 and k8s2-1.18 to save memory. The script will leave behind a box file in test/.vagrant/.

  1. Checkout commit 5132966, run the CI with k8s 1.17 and focus --focus="K8s.*Tests.NodePort.with.L7.Policy" - should work:
$ cd test
$ export K8S_VERSION=1.17
$ ./vagrant-local-start.sh
$ ginkgo --focus="K8s.*Tests.NodePort.with.L7.Policy -v -- --cilium.provision=false --cilium.showCommands --cilium.holdEnvironment=true
...
Ran 1 of 395 Specs in 203.253 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 394 Skipped
PASS

kubectl get pods:

NAMESPACE     NAME                              READY   STATUS    RESTARTS   AGE    IP              NODE   NOMINATED NODE   READINESS GATES
default       test-k8s2-5b756fd6c5-s5qjr        2/2     Running   0          21s    10.10.1.86      k8s2   <none>           <none>
default       testclient-nl79c                  1/1     Running   0          21s    10.10.0.63      k8s1   <none>           <none>
default       testclient-zdgqg                  1/1     Running   0          21s    10.10.1.245     k8s2   <none>           <none>
default       testds-d6fxt                      2/2     Running   0          21s    10.10.1.248     k8s2   <none>           <none>
default       testds-l9p4p                      2/2     Running   0          21s    10.10.0.215     k8s1   <none>           <none>
kube-system   cilium-44qfh                      1/1     Running   0          81s    192.168.36.12   k8s2   <none>           <none>
kube-system   cilium-operator-bb8f7cb95-9dfcb   1/1     Running   0          81s    192.168.36.12   k8s2   <none>           <none>
kube-system   cilium-qjncl                      1/1     Running   0          81s    192.168.36.11   k8s1   <none>           <none>
kube-system   coredns-767d4c6dd7-5pvvp          1/1     Running   0          40s    10.10.1.77      k8s2   <none>           <none>
kube-system   etcd-k8s1                         1/1     Running   0          13m    192.168.36.11   k8s1   <none>           <none>
kube-system   kube-apiserver-k8s1               1/1     Running   0          13m    192.168.36.11   k8s1   <none>           <none>
kube-system   kube-controller-manager-k8s1      1/1     Running   0          13m    192.168.36.11   k8s1   <none>           <none>
kube-system   kube-proxy-ctpgk                  1/1     Running   0          13m    192.168.36.11   k8s1   <none>           <none>
kube-system   kube-proxy-cxpth                  1/1     Running   0          3m9s   192.168.36.12   k8s2   <none>           <none>
kube-system   kube-scheduler-k8s1               1/1     Running   0          13m    192.168.36.11   k8s1   <none>           <none>
kube-system   log-gatherer-ds5bz                1/1     Running   0          91s    192.168.36.12   k8s2   <none>           <none>
kube-system   log-gatherer-twhzb                1/1     Running   0          91s    192.168.36.11   k8s1   <none>           <none>
  1. Checkout commit 934053c, run the CI with k8s 1.17 and the same focus - should fail (note that vagrant-local-start.sh deletes the existing VMs in the beginning, so that the test starts from a clean slate):
$ cd test
$ export K8S_VERSION=1.17
$ ./vagrant-local-start.sh
$ ginkgo --focus="K8s.*Tests.NodePort.with.L7.Policy -v -- --cilium.provision=false --cilium.showCommands --cilium.holdEnvironment=true
...
[k8s1 host can not connect to service "http://192.168.36.12:32544" (failed in request 1/10)
Expected command: kubectl exec -n kube-system log-gatherer-x282s -- curl --path-as-is -s -D /dev/stderr --fail --connect-timeout 5 --max-time 8 http://192.168.36.12:32544 -w "time-> DNS: '%{time_namelookup}(%{remote_ip})', Connect: '%{time_connect}',Transfer '%{time_starttransfer}', total '%{time_total}'" 
To succeed, but it failed:
Exitcode: 28 
Stdout:
 	 time-> DNS: '0.000033()', Connect: '0.000000',Transfer '0.000000', total '5.002505'
Stderr:
 	 
]

kubectl get pods:

NAMESPACE     NAME                              READY   STATUS    RESTARTS   AGE     IP              NODE   NOMINATED NODE   READINESS GATES
default       test-k8s2-5b756fd6c5-z6jfp        2/2     Running   0          107s    10.0.1.105      k8s2   <none>           <none>
default       testclient-b69cs                  1/1     Running   0          107s    10.0.1.56       k8s2   <none>           <none>
default       testclient-pb59z                  1/1     Running   0          107s    10.0.0.214      k8s1   <none>           <none>
default       testds-2ftnw                      2/2     Running   0          107s    10.0.1.229      k8s2   <none>           <none>
default       testds-zv6gv                      2/2     Running   0          107s    10.0.0.205      k8s1   <none>           <none>
kube-system   cilium-g45qs                      1/1     Running   0          2m54s   192.168.36.12   k8s2   <none>           <none>
kube-system   cilium-operator-bb8f7cb95-8svzw   1/1     Running   0          2m54s   192.168.36.11   k8s1   <none>           <none>
kube-system   cilium-prntb                      1/1     Running   0          2m54s   192.168.36.11   k8s1   <none>           <none>
kube-system   coredns-767d4c6dd7-bc5bn          1/1     Running   0          2m3s    10.0.1.244      k8s2   <none>           <none>
kube-system   etcd-k8s1                         1/1     Running   0          17m     192.168.36.11   k8s1   <none>           <none>
kube-system   kube-apiserver-k8s1               1/1     Running   0          17m     192.168.36.11   k8s1   <none>           <none>
kube-system   kube-controller-manager-k8s1      1/1     Running   0          17m     192.168.36.11   k8s1   <none>           <none>
kube-system   kube-proxy-hm4bx                  1/1     Running   0          17m     192.168.36.11   k8s1   <none>           <none>
kube-system   kube-proxy-vxp5t                  1/1     Running   0          4m24s   192.168.36.12   k8s2   <none>           <none>
kube-system   kube-scheduler-k8s1               1/1     Running   0          17m     192.168.36.11   k8s1   <none>           <none>
kube-system   log-gatherer-d4ss8                1/1     Running   0          3m4s    192.168.36.12   k8s2   <none>           <none>
kube-system   log-gatherer-x282s                1/1     Running   0          3m4s    192.168.36.11   k8s1   <none>           <none>

Looking at the saved iptables rules of the k8s2 node on both cases reveals this:

FORWARD DROPs:

< :INPUT ACCEPT [10035:2583634]
< :FORWARD DROP [0:0]
< :OUTPUT ACCEPT [10081:2304007]
---
> :INPUT ACCEPT [27688:5897898]
> :FORWARD DROP [40:2400]
> :OUTPUT ACCEPT [27765:9353361]

No KUBE-FORWARD rule matches as the source address is no longer 10.10.0.0/16:

< [28:6580] -A KUBE-FORWARD -s 10.10.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
---
> [0:0] -A KUBE-FORWARD -s 10.10.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

Note: The packet and byte counts above are from a a more limited test run, so they do not match the full run counts.

To validate this, add new rules that match the actual cluster range:

vagrant@k8s2:~$ sudo iptables -t filter -I KUBE-FORWARD -s 10.0.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
vagrant@k8s2:~$ sudo iptables -t filter -I KUBE-FORWARD -d 10.0.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

Repeat for k8s1. After this the test traffic works:

$ ginkgo --focus="K8s.*Tests.NodePort.with.L7.Policy -v -- --cilium.provision=false --cilium.showCommands --cilium.holdEnvironment=true
...
Ran 1 of 395 Specs in 172.768 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 394 Skipped
PASS

Note that k8s will remove the manually added rules within a few minutes, so if that happens the validation may need to be repeated.

And the counters on the new rules are updated:

vagrant@k8s2:~$ sudo iptables-save -c | grep KUBE-FORWARD
:KUBE-FORWARD - [0:0]
[88:19442] -A FORWARD -m comment --comment "kubernetes forwarding rules" -j KUBE-FORWARD
[0:0] -A KUBE-FORWARD -d 10.0.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
[80:18962] -A KUBE-FORWARD -s 10.0.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
[0:0] -A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP
[0:0] -A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
[0:0] -A KUBE-FORWARD -s 10.10.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
[0:0] -A KUBE-FORWARD -d 10.10.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

Based on this the problem seems to be that Cilium-operator podCIDR allocator may allocate POD CIDR ranges that are not within the k8s cluster range, possibly depending on the k8s version (the affected test has been seen succeeding in CI with k8s 1.18).

Metadata

Metadata

Assignees

Labels

area/datapathImpacts bpf/ or low-level forwarding details, including map management and monitor messages.area/k8sImpacts the kubernetes API, or kubernetes -> cilium internals translation layers.area/proxyImpacts proxy components, including DNS, Kafka, Envoy and/or XDS servers.ci/flakeThis is a known failure that occurs in the tree. Please investigate me!kind/bugThis is a bug in the Cilium logic.release-note/miscThis PR makes changes that have no direct user impact.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions