Cilium-operator podCIDR allocation causes return traffic from host proxy accessed via (k8s 1.17) nodeport (for example) destined to another node to be dropped in iptables FORWARD chain.
This causes the CI fail, if the now-disabled test hitting this is enabled again (#11710).
Related: #11235
How to reproduce:
- Cache vagrant box image off of master with default k8s version:
$ checkout master
$ cd test
$ ./vagrant-local-start.sh
Remove the resulting VMs k8s1-1.18 and k8s2-1.18 to save memory. The script will leave behind a box file in test/.vagrant/.
- Checkout commit 5132966, run the CI with k8s 1.17 and focus
--focus="K8s.*Tests.NodePort.with.L7.Policy" - should work:
$ cd test
$ export K8S_VERSION=1.17
$ ./vagrant-local-start.sh
$ ginkgo --focus="K8s.*Tests.NodePort.with.L7.Policy -v -- --cilium.provision=false --cilium.showCommands --cilium.holdEnvironment=true
...
Ran 1 of 395 Specs in 203.253 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 394 Skipped
PASS
kubectl get pods:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default test-k8s2-5b756fd6c5-s5qjr 2/2 Running 0 21s 10.10.1.86 k8s2 <none> <none>
default testclient-nl79c 1/1 Running 0 21s 10.10.0.63 k8s1 <none> <none>
default testclient-zdgqg 1/1 Running 0 21s 10.10.1.245 k8s2 <none> <none>
default testds-d6fxt 2/2 Running 0 21s 10.10.1.248 k8s2 <none> <none>
default testds-l9p4p 2/2 Running 0 21s 10.10.0.215 k8s1 <none> <none>
kube-system cilium-44qfh 1/1 Running 0 81s 192.168.36.12 k8s2 <none> <none>
kube-system cilium-operator-bb8f7cb95-9dfcb 1/1 Running 0 81s 192.168.36.12 k8s2 <none> <none>
kube-system cilium-qjncl 1/1 Running 0 81s 192.168.36.11 k8s1 <none> <none>
kube-system coredns-767d4c6dd7-5pvvp 1/1 Running 0 40s 10.10.1.77 k8s2 <none> <none>
kube-system etcd-k8s1 1/1 Running 0 13m 192.168.36.11 k8s1 <none> <none>
kube-system kube-apiserver-k8s1 1/1 Running 0 13m 192.168.36.11 k8s1 <none> <none>
kube-system kube-controller-manager-k8s1 1/1 Running 0 13m 192.168.36.11 k8s1 <none> <none>
kube-system kube-proxy-ctpgk 1/1 Running 0 13m 192.168.36.11 k8s1 <none> <none>
kube-system kube-proxy-cxpth 1/1 Running 0 3m9s 192.168.36.12 k8s2 <none> <none>
kube-system kube-scheduler-k8s1 1/1 Running 0 13m 192.168.36.11 k8s1 <none> <none>
kube-system log-gatherer-ds5bz 1/1 Running 0 91s 192.168.36.12 k8s2 <none> <none>
kube-system log-gatherer-twhzb 1/1 Running 0 91s 192.168.36.11 k8s1 <none> <none>
- Checkout commit 934053c, run the CI with k8s 1.17 and the same focus - should fail (note that
vagrant-local-start.sh deletes the existing VMs in the beginning, so that the test starts from a clean slate):
$ cd test
$ export K8S_VERSION=1.17
$ ./vagrant-local-start.sh
$ ginkgo --focus="K8s.*Tests.NodePort.with.L7.Policy -v -- --cilium.provision=false --cilium.showCommands --cilium.holdEnvironment=true
...
[k8s1 host can not connect to service "http://192.168.36.12:32544" (failed in request 1/10)
Expected command: kubectl exec -n kube-system log-gatherer-x282s -- curl --path-as-is -s -D /dev/stderr --fail --connect-timeout 5 --max-time 8 http://192.168.36.12:32544 -w "time-> DNS: '%{time_namelookup}(%{remote_ip})', Connect: '%{time_connect}',Transfer '%{time_starttransfer}', total '%{time_total}'"
To succeed, but it failed:
Exitcode: 28
Stdout:
time-> DNS: '0.000033()', Connect: '0.000000',Transfer '0.000000', total '5.002505'
Stderr:
]
kubectl get pods:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
default test-k8s2-5b756fd6c5-z6jfp 2/2 Running 0 107s 10.0.1.105 k8s2 <none> <none>
default testclient-b69cs 1/1 Running 0 107s 10.0.1.56 k8s2 <none> <none>
default testclient-pb59z 1/1 Running 0 107s 10.0.0.214 k8s1 <none> <none>
default testds-2ftnw 2/2 Running 0 107s 10.0.1.229 k8s2 <none> <none>
default testds-zv6gv 2/2 Running 0 107s 10.0.0.205 k8s1 <none> <none>
kube-system cilium-g45qs 1/1 Running 0 2m54s 192.168.36.12 k8s2 <none> <none>
kube-system cilium-operator-bb8f7cb95-8svzw 1/1 Running 0 2m54s 192.168.36.11 k8s1 <none> <none>
kube-system cilium-prntb 1/1 Running 0 2m54s 192.168.36.11 k8s1 <none> <none>
kube-system coredns-767d4c6dd7-bc5bn 1/1 Running 0 2m3s 10.0.1.244 k8s2 <none> <none>
kube-system etcd-k8s1 1/1 Running 0 17m 192.168.36.11 k8s1 <none> <none>
kube-system kube-apiserver-k8s1 1/1 Running 0 17m 192.168.36.11 k8s1 <none> <none>
kube-system kube-controller-manager-k8s1 1/1 Running 0 17m 192.168.36.11 k8s1 <none> <none>
kube-system kube-proxy-hm4bx 1/1 Running 0 17m 192.168.36.11 k8s1 <none> <none>
kube-system kube-proxy-vxp5t 1/1 Running 0 4m24s 192.168.36.12 k8s2 <none> <none>
kube-system kube-scheduler-k8s1 1/1 Running 0 17m 192.168.36.11 k8s1 <none> <none>
kube-system log-gatherer-d4ss8 1/1 Running 0 3m4s 192.168.36.12 k8s2 <none> <none>
kube-system log-gatherer-x282s 1/1 Running 0 3m4s 192.168.36.11 k8s1 <none> <none>
Looking at the saved iptables rules of the k8s2 node on both cases reveals this:
FORWARD DROPs:
< :INPUT ACCEPT [10035:2583634]
< :FORWARD DROP [0:0]
< :OUTPUT ACCEPT [10081:2304007]
---
> :INPUT ACCEPT [27688:5897898]
> :FORWARD DROP [40:2400]
> :OUTPUT ACCEPT [27765:9353361]
No KUBE-FORWARD rule matches as the source address is no longer 10.10.0.0/16:
< [28:6580] -A KUBE-FORWARD -s 10.10.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
---
> [0:0] -A KUBE-FORWARD -s 10.10.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
Note: The packet and byte counts above are from a a more limited test run, so they do not match the full run counts.
To validate this, add new rules that match the actual cluster range:
vagrant@k8s2:~$ sudo iptables -t filter -I KUBE-FORWARD -s 10.0.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
vagrant@k8s2:~$ sudo iptables -t filter -I KUBE-FORWARD -d 10.0.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
Repeat for k8s1. After this the test traffic works:
$ ginkgo --focus="K8s.*Tests.NodePort.with.L7.Policy -v -- --cilium.provision=false --cilium.showCommands --cilium.holdEnvironment=true
...
Ran 1 of 395 Specs in 172.768 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 394 Skipped
PASS
Note that k8s will remove the manually added rules within a few minutes, so if that happens the validation may need to be repeated.
And the counters on the new rules are updated:
vagrant@k8s2:~$ sudo iptables-save -c | grep KUBE-FORWARD
:KUBE-FORWARD - [0:0]
[88:19442] -A FORWARD -m comment --comment "kubernetes forwarding rules" -j KUBE-FORWARD
[0:0] -A KUBE-FORWARD -d 10.0.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
[80:18962] -A KUBE-FORWARD -s 10.0.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
[0:0] -A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP
[0:0] -A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
[0:0] -A KUBE-FORWARD -s 10.10.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
[0:0] -A KUBE-FORWARD -d 10.10.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
Based on this the problem seems to be that Cilium-operator podCIDR allocator may allocate POD CIDR ranges that are not within the k8s cluster range, possibly depending on the k8s version (the affected test has been seen succeeding in CI with k8s 1.18).
Cilium-operator podCIDR allocation causes return traffic from host proxy accessed via (k8s 1.17) nodeport (for example) destined to another node to be dropped in iptables FORWARD chain.
This causes the CI fail, if the now-disabled test hitting this is enabled again (#11710).
Related: #11235
How to reproduce:
Remove the resulting VMs
k8s1-1.18andk8s2-1.18to save memory. The script will leave behind a box file intest/.vagrant/.--focus="K8s.*Tests.NodePort.with.L7.Policy"- should work:kubectl get pods:vagrant-local-start.shdeletes the existing VMs in the beginning, so that the test starts from a clean slate):kubectl get pods:Looking at the saved iptables rules of the k8s2 node on both cases reveals this:
FORWARD DROPs:
No KUBE-FORWARD rule matches as the source address is no longer 10.10.0.0/16:
Note: The packet and byte counts above are from a a more limited test run, so they do not match the full run counts.
To validate this, add new rules that match the actual cluster range:
Repeat for
k8s1. After this the test traffic works:Note that k8s will remove the manually added rules within a few minutes, so if that happens the validation may need to be repeated.
And the counters on the new rules are updated:
Based on this the problem seems to be that Cilium-operator podCIDR allocator may allocate POD CIDR ranges that are not within the k8s cluster range, possibly depending on the k8s version (the affected test has been seen succeeding in CI with k8s 1.18).