Skip to content

High-scale IPCache: Nodeport LB support Part 2#25854

Merged
dylandreimerink merged 7 commits intocilium:mainfrom
julianwiedmann:1.14-hs-ipcache
Jun 6, 2023
Merged

High-scale IPCache: Nodeport LB support Part 2#25854
dylandreimerink merged 7 commits intocilium:mainfrom
julianwiedmann:1.14-hs-ipcache

Conversation

@julianwiedmann
Copy link
Copy Markdown
Member

@julianwiedmann julianwiedmann commented Jun 2, 2023

Continuing from #25745, this PR is in the context of the high-scale ipcache feature described at cilium/design-cfps#7.

It enables the nodeport LB to handle an GENEVE-encapsulated service request (from inside the clustermesh), and forward the request with the client's original security identity to the backend using GENEVE-DSR. It also adds full XDP passthrough support for this feature.

As we want to support in-clustermesh traffic to dedicated LB nodes (where the service translation happens in the LB, and not at the source), reduce the MTU sufficiently to tolerate the GENEVE DSR option that gets added by the LB.

In detail without XDP:

  • encapsulated packet enters from-netdev. Gets manually decapsulated, and redirected to cilium_geneve.
  • in from-overlay, nodeport_lb4() load-balances the inner packet. Tail-calls to the DSR egress path, and propagates the client's security identity along (that we previously obtained from the tunnel encapsulation).
  • The DSR-GENEVE egress code adds fresh encapsulation, and inserts the Security identity here (instead of WORLD_ID).
  • the load-balanced request is forwarded to the backend node.

And with XDP:

  • as the packet enters the XDP program, we detect that it's GENEVE-encapsulated and locate the inner IPv4 header.
  • nodeport_lb4() now operates on the inner packet (service lookup, DNAT etc).
  • The DSR-GENEVE egress code learns how to update the existing encapsulation headers, and punches the DSR GENEVE option into the right location.
  • the load-balanced request is forwarded to the backend node from inside XDP.
Add support for load-balancing encapsulated requests in a configuration with high-scale ipcache.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 2, 2023
@julianwiedmann julianwiedmann added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. kind/feature This introduces new functionality. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. feature/high-scale-ipcache Relates to the high-scale ipcache feature. labels Jun 2, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 2, 2023
@julianwiedmann julianwiedmann added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. feature/lb-only Impacts cilium running in lb-only datapath mode labels Jun 2, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 2, 2023
@julianwiedmann julianwiedmann added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. area/loadbalancing Impacts load-balancing and Kubernetes service implementations labels Jun 2, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jun 2, 2023
@julianwiedmann
Copy link
Copy Markdown
Member Author

/test

@julianwiedmann julianwiedmann force-pushed the 1.14-hs-ipcache branch 2 times, most recently from cd1b120 to 73019fc Compare June 2, 2023 08:53
…MODE

The high-scale IP cache mode requires native routing.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
When the TC path load-balances a request in the from-overlay program, the
original SrcSecID (as extracted from the encap header) gets passed to
nodeport_lb4() via the `src_identity` parameter.

Pass this through to the DSR path, so that we can re-insert it into the
fresh encapsulation. This preserves the client's SrcSecID all the way to
the backend.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
For in-cluster access to services, we require Geneve encapsulation to
transport the source's security identity to the backend.

We haven't looked at enabling this in the Nodeport NAT path, so require
full-DSR mode.

Also reflect this in the relevant BPF compile tests:
- add ENABLE_DSR and DSR_ENCAP_GENEVE
- remove ENABLE_DSR_HYBRID

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Pods connecting to a remote service backend must consider that the LB needs
to insert a GENEVE-DSR option.

In practice this is only relevant for UDP packets, as for TCP we have the
optimization to only insert the option into the SYN packet.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Add minimal support for inserting the GENEVE DSR IPv4 option into an
existing GENEVE header. Note that this currently only supports packets that
have fixed-length encap headers (ie. no IP or GENEVE options).

Callers are expected to adjust the outer headers as needed (eg. update the
length field in the UDP header).

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
nodeport_lb*() currently assumes that the L3 header is located inside the
packet at offset ETH_HLEN. To enable load-balancing for the inner packet
of an encapsulated connection, parameterize this offset and the actual
ip4 pointer. For some programs it's an actual win to just pass through the
ip4 pointer, as they have already validated it.

For now only touch the IPv4 path to avoid churn. IPv6 will follow when
needed.

Also add variants of the L2 / L3 validation helpers that are able to deal
with variable offsets.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Enable the XDP Loadbalancer to handle GENEVE-encapsulated service requests.
Such traffic would usually get passed up to TC, have its encapsulation
stripped off and then get load-balanced in the from-overlay program.

But in combination with DSR_ENCAP_GENEVE, we can preserve the encap headers
(and manually adjust them as needed), insert the DSR option (if needed) and
redirect the packet to its backend right away.

In detail:
- the XDP program detects GENEVE encapsulation, and determines the l3_off
  for the inner packet.
- nodeport_lb4() then transparently load-balances the inner packet, and
  passes the l3_off to the DSR stage.
- the DSR stage for XDP has all the smarts to adjust the encapsulation
  headers, and insert the DSR option if necessary.

The GENEVE detection in bpf_xdp is fairly minimalistic now. But this
mostly matches the restrictions we have on the ctx_set_tunnel_opt() side
for XDP.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
@julianwiedmann
Copy link
Copy Markdown
Member Author

/test

@julianwiedmann julianwiedmann requested a review from ldelossa June 2, 2023 13:03
@julianwiedmann julianwiedmann marked this pull request as ready for review June 2, 2023 13:04
@julianwiedmann julianwiedmann requested review from a team as code owners June 2, 2023 13:04
@julianwiedmann julianwiedmann requested review from a team as code owners June 2, 2023 13:04
@julianwiedmann
Copy link
Copy Markdown
Member Author

(the coccicheck warning is a false-positive. We merely raise the DROP_MISSED_TAIL_CALL drop notification a bit later in the code).

Copy link
Copy Markdown
Contributor

@lmb lmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total noob as to sig-datapath changes, I'll try and find someone else to review. Other owned changes LGTM.


out:
if (IS_ERR(ret))
return send_drop_notify_error_ext(ctx, 0, ret, ext_err,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this will now send _ext err if the tail call failed. Does that matter?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. In the end this is for fine-granular debugging, so we would know whether ext_err is even relevant for the reported DROP reason.

static __always_inline __maybe_unused bool
__revalidate_data_pull(struct __ctx_buff *ctx, void **data, void **data_end,
void **l3, const __u32 l3_len, const bool pull)
void **l3, const __u32 l3_off, const __u32 l3_len,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: feels more conventional if l3 and l3_len arent split in args.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 either way feels fine to me

struct genevehdr geneve;

/* add free space after GENEVE header: */
if (ctx_adjust_hroom(ctx, opt_len, BPF_ADJ_ROOM_MAC, ctx_adjust_hroom_flags()) < 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noob question: this doesn't take geneve_off so how does the function know where to add space?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the corresponding change in xdp.h's ctx_adjust_hroom(). It's completely custom and fragile - we just "know" how to behave for specific hroom values.

if (ctx_load_bytes(ctx, geneve_off, &geneve, sizeof(geneve)) < 0)
return DROP_INVALID;

geneve.opt_len += (__u8)(opt_len >> 2);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess opt_len must be a multiple of 4 or something. Can we build_bug_on or similar? Or document that somewhere.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To a certain extent I would hope that callers know what data format they need to follow for such an option (as it also needs to work for the TC variant of set_tunnel_opt()), and get to keep the pieces if they don't.

But yep, a build_bug() should be doable - will queue that up as follow-on, thanks!

// true, the MTU is adjusted to account for encapsulation overhead for all
// routes involved in node to node communication.
func NewConfiguration(authKeySize int, encryptEnabled bool, encapEnabled bool, wireguardEnabled bool, mtu int, mtuDetectIP net.IP) Configuration {
func NewConfiguration(authKeySize int, encryptEnabled bool, encapEnabled bool, wireguardEnabled bool, hsIpcacheDSRenabled bool, mtu int, mtuDetectIP net.IP) Configuration {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive by comment: this really ought to take a struct. All of the bools makes this unreadable at callsites.

Nit: capitalization of hsIpcacheDSRenabled seems wonky.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No disagreement from my side, this has grown way out of proportion 😬 .

@julianwiedmann
Copy link
Copy Markdown
Member Author

Total noob as to sig-datapath changes, I'll try and find someone else to review. Other owned changes LGTM.

Thank you! Hopefully @ldelossa and/or @borkmann can have a look :).

Copy link
Copy Markdown
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My codeowners looks good to me. Seconding the proposal to refactor the MTU constructor at some point.

Copy link
Copy Markdown
Contributor

@ldelossa ldelossa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All nits, LGTM.


// DsrTunnelOverhead is about the GENEVE DSR option that gets inserted
// by the LB, when addressing a Service in hs-ipcache mode
DsrTunnelOverhead = 12
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this name?

If the comments immediately point out that this value is used for "GENEVE DSR" then why not make the value also contain this detail? Seems pretty specific to Geneve.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 probably just leftovers from the initial prototype, and missed while polishing.

// true, the MTU is adjusted to account for encapsulation overhead for all
// routes involved in node to node communication.
func NewConfiguration(authKeySize int, encryptEnabled bool, encapEnabled bool, wireguardEnabled bool, mtu int, mtuDetectIP net.IP) Configuration {
func NewConfiguration(authKeySize int, encryptEnabled bool, encapEnabled bool, wireguardEnabled bool, hsIpcacheDSRenabled bool, mtu int, mtuDetectIP net.IP) Configuration {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit but:

s/hsIpcacheDSRenabled/hsIPCacheDSREnabled

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐫


static __always_inline bool validate_ethertype(struct __ctx_buff *ctx,
__u16 *proto)
static __always_inline bool validate_ethertype_l2_off(struct __ctx_buff *ctx,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your thoughts on taking and optionally filling a pointer to an ethhdr here, akin to how we fill in layer 3 headers in revalidate_data.

Could be handy if we quickly need to inspect l2 attributes?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any paths that need more than just the ethertype today?

@julianwiedmann julianwiedmann added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 6, 2023
@julianwiedmann
Copy link
Copy Markdown
Member Author

Reviews are in, tests are green (with one exception for #25854 (comment)) -> ready to merge.

@dylandreimerink could you please do the honors? 🙏

@dylandreimerink
Copy link
Copy Markdown
Member

dylandreimerink commented Jun 6, 2023

What is the deal with the file ./bpf_xdp.c: DROP_MISSED_TAIL_CALL missing after tail call on line 189 error thrown by coccicheck? Seems like you are breaking a custom rule https://github.com/cilium/cilium/blob/main/contrib/coccinelle/tail_calls.cocci#L102

@dylandreimerink dylandreimerink merged commit ba78c61 into cilium:main Jun 6, 2023
@julianwiedmann julianwiedmann deleted the 1.14-hs-ipcache branch June 6, 2023 07:58
@pchaigno
Copy link
Copy Markdown
Member

pchaigno commented Jun 6, 2023

@dylandreimerink @julianwiedmann Why was this merged when it is known to break Coccicheck? Coccicheck is a Required CI job.

@julianwiedmann
Copy link
Copy Markdown
Member Author

@dylandreimerink @julianwiedmann Why was this merged when it is known to break Coccicheck? Coccicheck is a Required CI job.

It was a false-positive for this change, and we already had plenty of existing code paths that follow the same pattern. Why is coccicheck now complaining about this specific part, but the 20 others are fine? 😠

Either way, #25924.

@pchaigno
Copy link
Copy Markdown
Member

pchaigno commented Jun 6, 2023

I understand all that, but it's still a CI breakage.

@julianwiedmann
Copy link
Copy Markdown
Member Author

I understand all that, but it's still a CI breakage.

ack, sorry. The expectation was that it would only check the actual patch diff 😞

@pchaigno
Copy link
Copy Markdown
Member

pchaigno commented Jun 6, 2023

Ah, I see 👍 Only checkpatch works like that today, which is part of the reason why checkpatch is not marked Required.

julianwiedmann added a commit that referenced this pull request Mar 26, 2026
This code was initially introduced as part of the hs-ipcache feature
(#25745,
 #25854). It allowed for the XDP path
to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE
mode), and redirect the packet straight out to the backend.

But with HS-ipcache gone, this code path is pretty much unused. It *might*
have seen some unintentional usage for E/W NodePort access when SocketLB
is disabled - but with #41963, we now
also LB at the source in such configurations. Either way, it's perfectly
fine to remove this optimization.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit that referenced this pull request Mar 26, 2026
This code was initially introduced as part of the hs-ipcache feature
(#25745,
 #25854). It allowed for the XDP path
to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE
mode), and redirect the packet straight out to the backend.

But with HS-ipcache gone, this code path is pretty much unused. It *might*
have seen some unintentional usage for E/W NodePort access when SocketLB
is disabled - but with #41963, we now
also LB at the source in such configurations. Either way, it's perfectly
fine to remove this optimization.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
julianwiedmann added a commit that referenced this pull request Mar 26, 2026
This code was initially introduced as part of the hs-ipcache feature
(#25745,
 #25854). It allowed for the XDP path
to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE
mode), and redirect the packet straight out to the backend.

But with HS-ipcache gone, this code path is pretty much unused. It *might*
have seen some unintentional usage for E/W NodePort access when SocketLB
is disabled - but with #41963, we now
also LB at the source in such configurations. Either way, it's perfectly
fine to remove this optimization.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
github-merge-queue bot pushed a commit that referenced this pull request Mar 31, 2026
This code was initially introduced as part of the hs-ipcache feature
(#25745,
 #25854). It allowed for the XDP path
to peek into GENEVE-encapsulated traffic, LB it in-place (using DSR-GENEVE
mode), and redirect the packet straight out to the backend.

But with HS-ipcache gone, this code path is pretty much unused. It *might*
have seen some unintentional usage for E/W NodePort access when SocketLB
is disabled - but with #41963, we now
also LB at the source in such configurations. Either way, it's perfectly
fine to remove this optimization.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/high-scale-ipcache Relates to the high-scale ipcache feature. feature/lb-only Impacts cilium running in lb-only datapath mode kind/feature This introduces new functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants