Cilium-operator podCIDR allocation causes traffic to dropped in iptables FORWARD chain

Cilium-operator podCIDR allocation causes return traffic from host proxy accessed via (k8s 1.17) nodeport (for example) destined to another node to be dropped in iptables FORWARD chain.

This causes the CI fail, if the now-disabled test hitting this is enabled again (#11710).

Related: #11235

How to reproduce:

1. Cache vagrant box image off of master with default k8s version:
```
$ checkout master
$ cd test
$ ./vagrant-local-start.sh
```
Remove the resulting VMs `k8s1-1.18` and `k8s2-1.18` to save memory. The script will leave behind a box file in `test/.vagrant/`.

2. Checkout commit 513296652c0244c08914ac0760d6717f0a466004, run the CI *with k8s 1.17* and focus `--focus="K8s.*Tests.NodePort.with.L7.Policy"` - should work:
```
$ cd test
$ export K8S_VERSION=1.17
$ ./vagrant-local-start.sh
$ ginkgo --focus="K8s.*Tests.NodePort.with.L7.Policy -v -- --cilium.provision=false --cilium.showCommands --cilium.holdEnvironment=true
...
Ran 1 of 395 Specs in 203.253 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 394 Skipped
PASS
```

`kubectl get pods`:
```
NAMESPACE     NAME                              READY   STATUS    RESTARTS   AGE    IP              NODE   NOMINATED NODE   READINESS GATES
default       test-k8s2-5b756fd6c5-s5qjr        2/2     Running   0          21s    10.10.1.86      k8s2   <none>           <none>
default       testclient-nl79c                  1/1     Running   0          21s    10.10.0.63      k8s1   <none>           <none>
default       testclient-zdgqg                  1/1     Running   0          21s    10.10.1.245     k8s2   <none>           <none>
default       testds-d6fxt                      2/2     Running   0          21s    10.10.1.248     k8s2   <none>           <none>
default       testds-l9p4p                      2/2     Running   0          21s    10.10.0.215     k8s1   <none>           <none>
kube-system   cilium-44qfh                      1/1     Running   0          81s    192.168.36.12   k8s2   <none>           <none>
kube-system   cilium-operator-bb8f7cb95-9dfcb   1/1     Running   0          81s    192.168.36.12   k8s2   <none>           <none>
kube-system   cilium-qjncl                      1/1     Running   0          81s    192.168.36.11   k8s1   <none>           <none>
kube-system   coredns-767d4c6dd7-5pvvp          1/1     Running   0          40s    10.10.1.77      k8s2   <none>           <none>
kube-system   etcd-k8s1                         1/1     Running   0          13m    192.168.36.11   k8s1   <none>           <none>
kube-system   kube-apiserver-k8s1               1/1     Running   0          13m    192.168.36.11   k8s1   <none>           <none>
kube-system   kube-controller-manager-k8s1      1/1     Running   0          13m    192.168.36.11   k8s1   <none>           <none>
kube-system   kube-proxy-ctpgk                  1/1     Running   0          13m    192.168.36.11   k8s1   <none>           <none>
kube-system   kube-proxy-cxpth                  1/1     Running   0          3m9s   192.168.36.12   k8s2   <none>           <none>
kube-system   kube-scheduler-k8s1               1/1     Running   0          13m    192.168.36.11   k8s1   <none>           <none>
kube-system   log-gatherer-ds5bz                1/1     Running   0          91s    192.168.36.12   k8s2   <none>           <none>
kube-system   log-gatherer-twhzb                1/1     Running   0          91s    192.168.36.11   k8s1   <none>           <none>
``` 

3. Checkout commit 934053ced7bbdf1945b64a1dbd52843b4849b043, run the CI *with k8s 1.17* and the same focus - should fail (note that `vagrant-local-start.sh` deletes the existing VMs in the beginning, so that the test starts from a clean slate):
```
$ cd test
$ export K8S_VERSION=1.17
$ ./vagrant-local-start.sh
$ ginkgo --focus="K8s.*Tests.NodePort.with.L7.Policy -v -- --cilium.provision=false --cilium.showCommands --cilium.holdEnvironment=true
...
[k8s1 host can not connect to service "http://192.168.36.12:32544" (failed in request 1/10)
Expected command: kubectl exec -n kube-system log-gatherer-x282s -- curl --path-as-is -s -D /dev/stderr --fail --connect-timeout 5 --max-time 8 http://192.168.36.12:32544 -w "time-> DNS: '%{time_namelookup}(%{remote_ip})', Connect: '%{time_connect}',Transfer '%{time_starttransfer}', total '%{time_total}'" 
To succeed, but it failed:
Exitcode: 28 
Stdout:
 	 time-> DNS: '0.000033()', Connect: '0.000000',Transfer '0.000000', total '5.002505'
Stderr:
 	 
]
```

`kubectl get pods`:
```
NAMESPACE     NAME                              READY   STATUS    RESTARTS   AGE     IP              NODE   NOMINATED NODE   READINESS GATES
default       test-k8s2-5b756fd6c5-z6jfp        2/2     Running   0          107s    10.0.1.105      k8s2   <none>           <none>
default       testclient-b69cs                  1/1     Running   0          107s    10.0.1.56       k8s2   <none>           <none>
default       testclient-pb59z                  1/1     Running   0          107s    10.0.0.214      k8s1   <none>           <none>
default       testds-2ftnw                      2/2     Running   0          107s    10.0.1.229      k8s2   <none>           <none>
default       testds-zv6gv                      2/2     Running   0          107s    10.0.0.205      k8s1   <none>           <none>
kube-system   cilium-g45qs                      1/1     Running   0          2m54s   192.168.36.12   k8s2   <none>           <none>
kube-system   cilium-operator-bb8f7cb95-8svzw   1/1     Running   0          2m54s   192.168.36.11   k8s1   <none>           <none>
kube-system   cilium-prntb                      1/1     Running   0          2m54s   192.168.36.11   k8s1   <none>           <none>
kube-system   coredns-767d4c6dd7-bc5bn          1/1     Running   0          2m3s    10.0.1.244      k8s2   <none>           <none>
kube-system   etcd-k8s1                         1/1     Running   0          17m     192.168.36.11   k8s1   <none>           <none>
kube-system   kube-apiserver-k8s1               1/1     Running   0          17m     192.168.36.11   k8s1   <none>           <none>
kube-system   kube-controller-manager-k8s1      1/1     Running   0          17m     192.168.36.11   k8s1   <none>           <none>
kube-system   kube-proxy-hm4bx                  1/1     Running   0          17m     192.168.36.11   k8s1   <none>           <none>
kube-system   kube-proxy-vxp5t                  1/1     Running   0          4m24s   192.168.36.12   k8s2   <none>           <none>
kube-system   kube-scheduler-k8s1               1/1     Running   0          17m     192.168.36.11   k8s1   <none>           <none>
kube-system   log-gatherer-d4ss8                1/1     Running   0          3m4s    192.168.36.12   k8s2   <none>           <none>
kube-system   log-gatherer-x282s                1/1     Running   0          3m4s    192.168.36.11   k8s1   <none>           <none>
``` 

Looking at the saved iptables rules of the *k8s2* node on both cases reveals this:

FORWARD DROPs: 
```
< :INPUT ACCEPT [10035:2583634]
< :FORWARD DROP [0:0]
< :OUTPUT ACCEPT [10081:2304007]
---
> :INPUT ACCEPT [27688:5897898]
> :FORWARD DROP [40:2400]
> :OUTPUT ACCEPT [27765:9353361]
```

No KUBE-FORWARD rule matches as the source address is no longer 10.10.0.0/16:
```
< [28:6580] -A KUBE-FORWARD -s 10.10.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
---
> [0:0] -A KUBE-FORWARD -s 10.10.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
```
Note: The packet and byte counts above are from a a more limited test run, so they do not match the full run counts.

To validate this, add new rules that match the actual cluster range:
```
vagrant@k8s2:~$ sudo iptables -t filter -I KUBE-FORWARD -s 10.0.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
vagrant@k8s2:~$ sudo iptables -t filter -I KUBE-FORWARD -d 10.0.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
```
Repeat for `k8s1`.  After this the test traffic works:
```
$ ginkgo --focus="K8s.*Tests.NodePort.with.L7.Policy -v -- --cilium.provision=false --cilium.showCommands --cilium.holdEnvironment=true
...
Ran 1 of 395 Specs in 172.768 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 394 Skipped
PASS
```
Note that k8s will remove the manually added rules within a few minutes, so if that happens the validation may need to be repeated.

And the counters on the new rules are updated:
```
vagrant@k8s2:~$ sudo iptables-save -c | grep KUBE-FORWARD
:KUBE-FORWARD - [0:0]
[88:19442] -A FORWARD -m comment --comment "kubernetes forwarding rules" -j KUBE-FORWARD
[0:0] -A KUBE-FORWARD -d 10.0.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
[80:18962] -A KUBE-FORWARD -s 10.0.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
[0:0] -A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP
[0:0] -A KUBE-FORWARD -m comment --comment "kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
[0:0] -A KUBE-FORWARD -s 10.10.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod source rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
[0:0] -A KUBE-FORWARD -d 10.10.0.0/16 -m comment --comment "kubernetes forwarding conntrack pod destination rule" -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
```

Based on this the problem seems to be that Cilium-operator podCIDR allocator may allocate POD CIDR ranges that are not within the k8s cluster range, possibly depending on the k8s version (the affected test has been seen succeeding in CI with k8s 1.18).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cilium-operator podCIDR allocation causes traffic to dropped in iptables FORWARD chain #11807

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cilium-operator podCIDR allocation causes traffic to dropped in iptables FORWARD chain #11807

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions