[Tracking] Calico Pod-to-pod connectivity interrupted on kernel 5.4 with Mellanox ConnectX4 Lx (MT27710 Family)

**Flatcar Status**: A [workaround](https://github.com/flatcar-linux/Flatcar/issues/183#issuecomment-691000397) is in place for releases 2605.4.0 and above. This issue is kept for tracking the bug with newer kernel releases.

**Description**

On Kubernetes, pod-to-pod connections between pods running on different nodes don't work on current Flatcar Beta (2605.3.0) and Alpha (2605.1.0). Stable (2512.4.0) works fine for me. My clusters run calico.

**Impact**

My Kubernetes clusters don't work on Flatcar Beta and Alpha.

**Environment and steps to reproduce**

1. **Set-up**: Flatcar running on Packet.
2. **Task**: N/A
3. **Action(s)**: (see the "Additional information" section for more details)
  a. Started pod A with a web server
  b. Started pod B on a different node of the cluster
  c. Tried to connect from pod B to pod A
4. **Error**: Connection timeout. However, **ping works between the containers**.

**Expected behavior**

An established connection.

**Additional information**

* I used the following lokocfg and [Lokomotive v0.4.0](https://github.com/kinvolk/lokomotive/releases/tag/v0.4.0):

    ```
    cluster "packet" {
      asset_dir = "./assets"
    
      cluster_name = "broken"
    
      controller_count = 1
      controller_type = "t1.small.x86"
    
      enable_tls_bootstrap = false
    
      os_channel = "alpha"
    
      dns {
        provider = "route53"
        zone = "example.net"
      }
    
      facility = "sjc1"
      project_id = "..."
    
      ssh_pubkeys = [
        "ssh-rsa AAAAB3...",
      ]
    
      management_cidrs = ["0.0.0.0/0"]
      node_private_cidr = "10.xxx.xxx.xxx/25"
    
      enable_aggregation = true
    
      oidc {}
    
      worker_pool "pool-1" {
        count = 2
        node_type = "c2.medium.x86"
    
        os_channel = "alpha"
      }
    }
    ```

* I followed the instructions in [Debug Services](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/) to reproduce the issue.

* I ran tcpdump and see the packets coming to the receiving host through the Calico `tunl0` interface but they never reach the receiving container veth.

* I checked the iptables counters on the receiving host and see that this rule gets triggered when I try to connect from pod B to pod A:

    ```
    core@test-broken-pool-1-worker-1 ~ $ sudo iptables-save -c | grep DROP | grep -v "\[0:0\]"
    ...
    [9:540] -A cali-fh-tunl0 -m comment --comment "cali:Su0l1tIx53hedKuv" -m conntrack --ctstate INVALID -j DROP
    ...
    ```

* I tried fixes similar to the one mentioned in https://github.com/flatcar-linux/Flatcar/issues/181 but even setting `rp_filter=0` on all interfaces doesn't fix the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tracking] Calico Pod-to-pod connectivity interrupted on kernel 5.4 with Mellanox ConnectX4 Lx (MT27710 Family) #183

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Tracking] Calico Pod-to-pod connectivity interrupted on kernel 5.4 with Mellanox ConnectX4 Lx (MT27710 Family) #183

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions