Skip to content

bpf: nodeport: handle DSR at from-netdev / to-netdev#22978

Merged
YutaroHayakawa merged 7 commits intocilium:masterfrom
julianwiedmann:nodeport-hostns-dsr
Mar 7, 2023
Merged

bpf: nodeport: handle DSR at from-netdev / to-netdev#22978
YutaroHayakawa merged 7 commits intocilium:masterfrom
julianwiedmann:nodeport-hostns-dsr

Conversation

@julianwiedmann
Copy link
Copy Markdown
Member

@julianwiedmann julianwiedmann commented Jan 9, 2023

(see partner PR #22756)

Replies by DSR service backends currently get their revDNAT processing in
from-container. There's two problems with that:

  1. we have cases where the packet doesn't reach this part of the
    from-container program (ie. redirect to host proxy), and
  2. we're not handling replies by hostns backends at all.

This PR tackles the problem as follows:

  1. Add support for RevDNAT of DSR connections in the to-netdev path.
  2. extract the DSR information (from IPv4 Options / IPv6 Extension Header) in the nodeport entry path, and not in to-container. Build a nodeport-level CT entry for connections that have DSR info (similar as for the local-backend case). Use the DSR info to build a SNAT entry for RevDNAT of the backend replies.
  3. reduce the DSR support in bpf_lxc to only cover old connections. The nodeport path can't easily create a CT entry for these connections (as for TCP we won't see another SYN with DSR info). There's possibilities how to move this into to-netdev as well, but let's try that in a follow-on PR.
When using KPR Nodeport with DSR, support backends in hostNetwork or with L7 policies.

Updated nodeport ingress diagram:

flowchart TD
    A[from-netdev] -->|CILIUM_CALL_IPV4_FROM_NETDEV| B{"bpf_skip_nodeport()?"}
    B --> |No| C["nodeport_lb4()"]
    B --> |Yes| B1["// further packet process"]
    C --> D["lb4_lookup_service()"]
    D --> E{"dst is svc ?"}
    E --> |No| F{"has DSR info / CT entry with .dsr set?"}
    E --> |Yes| F1["lb4_local()"]
    F --> |Yes| F2[" "]
    F2 --> |CILIUM_CALL_IPV4_NODEPORT_DSR_INGRESS| X["tail_nodeport_dsr_ingress_ipv4()"]
    X --> X1["create / update CT entry and SNAT entry"]
    X1 --> X2["bpf_skip_nodeport_set()"]
    X2 --> |CILIUM_CALL_IPV4_FROM_NETDEV| B
    F1 --> G1{"backend local?"}
    G1 --> |Yes| H1["CTX_ACT_OK"]
    G1 --> |No| H2[" "]
    H2 --> |CILIUM_CALL_IPV4_NODEPORT_NAT_EGRESS| G2["tail_nodeport_nat_egress_ipv4()"]
    G2 --> J2["snat_v4_nat()"]
    J2 --> K2{"found new SNAT mapping?"}
    K2 --> |No| L2["CTX_ACT_DROP"]
    K2 --> |Yes| M2["// Do SNAT"]
    M2 --> N2["ctx_redirect()"]
    F --> |No| F3[" "]
    F3 --> |CILIUM_CALL_IPV4_NODEPORT_NAT_INGRESS| G["tail_nodeport_nat_ingress_ipv4()"]
    G --> H["snat_v4_rev_nat()"]
    H --> J{"Reverse mapping exist?"}
    J --> |No| K["bpf_skip_nodeport_set()"]
    K --> |CILIUM_CALL_IPV4_FROM_NETDEV| B
    J --> |"Yes\n(This can be rev-SNAT for outside2pod)"| L["tail_rev_nodeport_lb4()"]
    L --> M["rev_nodeport_lb4()"]
    M --> N["ct_lb_lookup4()"]
    N --> O{"NodePort CT reply?"}
    O --> |Yes| P["// do rev-DNAT"]
    P --> R["ctx_redirect()"]
    O --> |No| Q["bpf_skip_nodeport_set()"]
    Q --> |CILIUM_CALL_IPV4_FROM_NETDEV| B
Loading

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 9, 2023
@julianwiedmann julianwiedmann added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Jan 9, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 9, 2023
@julianwiedmann
Copy link
Copy Markdown
Member Author

/test

@julianwiedmann julianwiedmann force-pushed the nodeport-hostns-dsr branch 2 times, most recently from b356961 to c02dfcc Compare January 10, 2023 12:53
@julianwiedmann
Copy link
Copy Markdown
Member Author

/test

@julianwiedmann
Copy link
Copy Markdown
Member Author

/test

@julianwiedmann
Copy link
Copy Markdown
Member Author

/test

@julianwiedmann
Copy link
Copy Markdown
Member Author

/test

@julianwiedmann
Copy link
Copy Markdown
Member Author

/test

@julianwiedmann
Copy link
Copy Markdown
Member Author

/test

@julianwiedmann julianwiedmann force-pushed the nodeport-hostns-dsr branch 2 times, most recently from 3a3d73d to e77c6f8 Compare January 23, 2023 14:50
@julianwiedmann
Copy link
Copy Markdown
Member Author

/test

@julianwiedmann julianwiedmann changed the title WIP DSR reply from hostns bpf: nodeport: handle revDNAT for DSR backends at to-netdev Jan 23, 2023
@julianwiedmann
Copy link
Copy Markdown
Member Author

/test

@julianwiedmann julianwiedmann added the area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. label Jan 27, 2023
@julianwiedmann
Copy link
Copy Markdown
Member Author

/test

@julianwiedmann
Copy link
Copy Markdown
Member Author

julianwiedmann commented Jan 27, 2023

Some ideas for reviewers to double-check:

  • is there any change in behaviour for old connections (where bpf_lxc has already created the SNAT entry and flagged its CT entry as .dsr = true)
  • in from-container, is there anything in the code path after the call to xlate_dsr_v4() that might be surprised by now seeing replies that haven't been RevDNATed yet
  • does all the CT logic in tail_nodeport_dsr_ingress_ipv4() check out?

@julianwiedmann julianwiedmann added the backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. label Apr 24, 2023
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request Oct 30, 2023
DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in cilium#22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request Oct 30, 2023
With cilium#22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request Oct 30, 2023
With cilium#22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
dylandreimerink pushed a commit that referenced this pull request Oct 30, 2023
DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in #22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
dylandreimerink pushed a commit that referenced this pull request Oct 30, 2023
With #22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 30, 2023
[ upstream commit 5b22423 ]

DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in cilium#22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 30, 2023
[ upstream commit 21072cd ]

With cilium#22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 30, 2023
[ upstream commit 5b22423 ]

DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in cilium#22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 30, 2023
[ upstream commit 21072cd ]

With cilium#22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 30, 2023
[ upstream commit 5b22423 ]

DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in cilium#22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 30, 2023
[ upstream commit 21072cd ]

With cilium#22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 30, 2023
[ upstream commit 5b22423 ]

DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in cilium#22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 30, 2023
[ upstream commit 21072cd ]

With cilium#22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 31, 2023
[ upstream commit 5b22423 ]

DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in cilium#22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 31, 2023
[ upstream commit 21072cd ]

With cilium#22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 31, 2023
[ upstream commit 5b22423 ]

DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in cilium#22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Oct 31, 2023
[ upstream commit 21072cd ]

With cilium#22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
aditighag pushed a commit that referenced this pull request Nov 2, 2023
[ upstream commit 5b22423 ]

DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in #22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
aditighag pushed a commit that referenced this pull request Nov 2, 2023
[ upstream commit 21072cd ]

With #22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Nov 3, 2023
[ upstream commit 5b22423 ]

DSR uses a OUT NAT entry for RevDNAT of backend replies. Prior to the
changes in cilium#22978, this NAT entry was
protected by the CT_INGRESS entry which bpf_lxc creates for the backend
connection.

Test that GC of the NAT entry works when the CT entry is removed.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
pippolo84 pushed a commit to pippolo84/cilium that referenced this pull request Nov 3, 2023
[ upstream commit 21072cd ]

With cilium#22978 we changed how DSR NAT
entries are managed. Instead of associating the NAT entry's lifetime with
bpf_lxc's CT_INGRESS entry, the nodeport code on the backend now creates
its own CT_EGRESS entry. When such a CT_EGRESS entry is GC'ed, we should
therefore also purge the related DSR NAT entry.

Also add a test for this case.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

No open projects
Status: Released

Development

Successfully merging this pull request may close these issues.

DSR does not work for service endpoints in host netns

8 participants