Skip to content

bpf: nat: skip snat when using iptables masquerade#36631

Closed
gyutaeb wants to merge 1 commit intocilium:mainfrom
gyutaeb:fix-wrong-snat-36572
Closed

bpf: nat: skip snat when using iptables masquerade#36631
gyutaeb wants to merge 1 commit intocilium:mainfrom
gyutaeb:fix-wrong-snat-36572

Conversation

@gyutaeb
Copy link
Copy Markdown
Member

@gyutaeb gyutaeb commented Dec 16, 2024

Please ensure your pull request adheres to the following guidelines:

  • For first time contributors, read Submitting a pull request
  • All code is covered by unit and/or runtime tests where feasible.
  • All commits contain a well written commit description including a title,
    description and a Fixes: #XXX line if the commit addresses a particular
    GitHub issue.
  • If your commit description contains a Fixes: <commit-id> tag, then
    please add the commit author[s] as reviewer[s] to this issue.
  • All commits are signed off. See the section Developer’s Certificate of Origin
  • Provide a title or release-note blurb suitable for the release notes.
  • Are you a user of Cilium? Please add yourself to the Users doc
  • Thanks for contributing!

When using iptables masquerade, the source address is already nat-ed by iptables. So, we should skip snat when using iptables masquerade. If we don't skip snat, the source address is nat-ed twice. It also unnecessarily occupies entry in the SNAT_MAPPING map. And if it exceeds SNAT_COLLISION_RETRIES, it leads to connection failures.

Fixes: #36572

@gyutaeb gyutaeb requested a review from a team as a code owner December 16, 2024 13:24
@gyutaeb gyutaeb requested a review from gentoo-root December 16, 2024 13:24
@maintainer-s-little-helper maintainer-s-little-helper Bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 16, 2024
@github-actions github-actions Bot added the kind/community-contribution This was a contribution made by a community member. label Dec 16, 2024
@gyutaeb gyutaeb force-pushed the fix-wrong-snat-36572 branch from 3d57c2b to dde7574 Compare December 18, 2024 03:03
Copy link
Copy Markdown
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is no longer correct to check the NAT_MIN_EGRESS condition. This is because target->from_local_endpoint is now being checked. In the case of !target->from_local_endpoint, source NAT is not required,

I don't follow. Can you please elaborate much more on the suggested change. Why is source NAT not required in the case you're describing? Also please add this in-depth explanation to the patch description, not just the PR.

@julianwiedmann julianwiedmann added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. feature/snat Relates to SNAT or Masquerading of traffic labels Dec 18, 2024
@gyutaeb gyutaeb force-pushed the fix-wrong-snat-36572 branch from dde7574 to f75e97c Compare December 20, 2024 09:42
@gyutaeb
Copy link
Copy Markdown
Member Author

gyutaeb commented Dec 23, 2024

@julianwiedmann @gentoo-root Thank you for review! When using iptables masquerade instead of bpf masquerade, SNAT is not required. I plan to enhance the cDefinesMap and explanation of suggested change. I will re-request after that.

@gyutaeb gyutaeb force-pushed the fix-wrong-snat-36572 branch from f75e97c to 2649ebb Compare January 2, 2025 13:37
@gyutaeb gyutaeb requested a review from a team as a code owner January 2, 2025 13:37
@gyutaeb gyutaeb requested a review from ti-mo January 2, 2025 13:37
@gyutaeb gyutaeb changed the title bpf: nat: remove NAT_MIN_EGRESS bpf: nat: skip snat when using iptables masquerade Jan 2, 2025
@gyutaeb gyutaeb force-pushed the fix-wrong-snat-36572 branch from 2649ebb to 3ce2466 Compare January 3, 2025 03:35
@gyutaeb
Copy link
Copy Markdown
Member Author

gyutaeb commented Jan 3, 2025

@julianwiedmann @gentoo-root @ti-mo Thank you for your patience. The previous changes had some ambiguities. To clarify the bug and the fix, I added cDefinesMap["ENABLE_IPTABLES_MASQUERADE_IPV4"] and cDefinesMap["ENABLE_IPTABLES_MASQUERADE_IPV6"] to explicitly handle SNAT accordingly. Additionally, I provided a clear and detailed explanation of the issue #36572 and the commit description.
The issue is that when using iptables masquerade, the source address is already nat-ed by iptables. So, we should skip SNAT when using iptables masquerade. If we don't skip SNAT, the source address is nat-ed twice. It also unnecessarily occupies entry in the SNAT_MAPPING map. And if it exceeds SNAT_COLLISION_RETRIES, it leads to connection failures due to DROP_NAT_NO_MAPPING.(In the beginning, since there are no entries in the SNAT_MAPPING, source port is preserved. However, as entries accumulate, port nat occurs)
This patch ensures that SNAT is skipped when using iptables masquerade. It prevents duplicate SNAT, avoids unnecessary use of the SNAT_MAPPING map, and mitigates connection failures caused by packet drops. Additionally, by eliminating the complexity caused by duplicate SNAT, it reduces troubleshooting time and clarifies the masquerade process.
The new define was named based on the code reference below.

cilium/pkg/option/config.go

Lines 2416 to 2426 in 5a4fdc9

// IptablesMasqueradingIPv4Enabled returns true if iptables-based
// masquerading is enabled for IPv4.
func (c *DaemonConfig) IptablesMasqueradingIPv4Enabled() bool {
return !c.EnableBPFMasquerade && c.EnableIPv4Masquerade
}
// IptablesMasqueradingIPv6Enabled returns true if iptables-based
// masquerading is enabled for IPv6.
func (c *DaemonConfig) IptablesMasqueradingIPv6Enabled() bool {
return !c.EnableBPFMasquerade && c.EnableIPv6Masquerade
}

@gyutaeb gyutaeb requested a review from julianwiedmann January 3, 2025 04:05
When using iptables masquerade, the source address is already nat-ed by
iptables. So, we should skip snat when using iptables masquerade. If
we don't skip snat, the source address is nat-ed twice. It also
unnecessarily occupies entry in the SNAT_MAPPING map. And if it exceeds
SNAT_COLLISION_RETRIES, it leads to connection failures.

Fixes: cilium#36572

Signed-off-by: Gyutae Bae <gyu.8ae@gmail.com>
@gyutaeb gyutaeb force-pushed the fix-wrong-snat-36572 branch from 3ce2466 to 2a6d867 Compare January 6, 2025 07:10
@gyutaeb
Copy link
Copy Markdown
Member Author

gyutaeb commented Jan 6, 2025

*Update Branch(Rebase)

Copy link
Copy Markdown
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how this can be correct for the case of using BPF Nodeport without BPF masquerade. Please elaborate a lot more.

We're intentionally tracking connections in the SNAT engine for this case, and applying SNAT as needed to avoid conflicts between LB-to-remote-backend connections and host-originating / iptables-masqueraded connections.

@gyutaeb
Copy link
Copy Markdown
Member Author

gyutaeb commented Jan 9, 2025

I don't see how this can be correct for the case of using BPF Nodeport without BPF masquerade. Please elaborate a lot more.

We're intentionally tracking connections in the SNAT engine for this case, and applying SNAT as needed to avoid conflicts between LB-to-remote-backend connections and host-originating / iptables-masqueraded connections.

@julianwiedmann Thank you for your detailed explanation. I agree with your opinion, After reviewing the case you explained, I realized that skipping SNAT in iptables-masquerade cannot prevent the conflict with LB-to-remote-backend.

flow1: client -> node:nodeport(32000) >>>> SNAT engine >>>> node:natedPort(34000) -> remote-backend
flow2: host -> external >>>> iptables-masquerade >>>> host:natedPort(34000) -> external

I hadn't considered that scenario, so thank you for pointing it out. This PR should not be merged. To elaborate, during my attempts to reproduce this issue, I observed a rare case where the SNAT engine performed SNAT on an egress packet, but the reverse NAT for the response packet failed. This occurred when the --bpf-nat-global-max limit was reached. I tried skipping SNAT in order to resolve this issue and packet drop, but it's clear now that skipping SNAT is not the right approach. Instead, I will focus on analyzing why reverse NAT occasionally fails. If you have encountered or received reports about a similar issue, I'd greatly appreciate it if you could share them.
In summary, this PR should not be merged. I will focus on the reverse NAT issue and submit a new PR with an appropriate fix.

@julianwiedmann
Copy link
Copy Markdown
Member

In summary, this PR should not be merged. I will focus on the reverse NAT issue and submit a new PR with an appropriate fix.

Thank you for the feedback! Looking forward to what you come up with :).

I'll flip this PR to draft then for now.

@julianwiedmann julianwiedmann marked this pull request as draft January 9, 2025 16:52
@gyutaeb
Copy link
Copy Markdown
Member Author

gyutaeb commented Jan 16, 2025

I observed a rare case where the SNAT engine performed SNAT on an egress packet, but the reverse NAT for the response packet failed.

@julianwiedmann Hello, I'd like to share the root cause of the issue that caused REV NAT to fail, as mentioned earlier. This issue was mitigated by #35304, which introduced the idea of recreating evicted reverse entries during NAT, using the otuple. I think this is a fantastic approach!
While #35304 has significantly reduced the issue, there are still rare cases where the reverse entry is evicted when snat_v4_rev_nat_handle_mapping() attempts to find it.
To address this, I'd like to propose an alternative: if snat_v4_rev_nat_handle_mapping() cannot find the reverse entry, it could attempt to recreate it using the original entry(assuming the original entry has not been deleted). What are your thoughts on this approach?

@julianwiedmann
Copy link
Copy Markdown
Member

I observed a rare case where the SNAT engine performed SNAT on an egress packet, but the reverse NAT for the response packet failed.

@julianwiedmann Hello, I'd like to share the root cause of the issue that caused REV NAT to fail, as mentioned earlier. This issue was mitigated by #35304, which introduced the idea of recreating evicted reverse entries during NAT, using the otuple. I think this is a fantastic approach! While #35304 has significantly reduced the issue, there are still rare cases where the reverse entry is evicted when snat_v4_rev_nat_handle_mapping() attempts to find it. To address this, I'd like to propose an alternative: if snat_v4_rev_nat_handle_mapping() cannot find the reverse entry, it could attempt to recreate it using the original entry(assuming the original entry has not been deleted). What are your thoughts on this approach?

Sorry, I don't think I'll have the cycles to co-design a solution for the problem you encountered. But maybe you can connect with @sugangli and discuss this in more detail - hack up a minimal reproducer that demonstrates the problem etc?

@sugangli
Copy link
Copy Markdown
Contributor

Hi @gyutaeb,

IIUC, retrieving the original entry from the rev nat path could be a bit tricky. Could you give an example on how it could be done? Feel free to tag me in the new PR once you are ready.

@gyutaeb
Copy link
Copy Markdown
Member Author

gyutaeb commented Jan 19, 2025

Hi, @sugangli
I will reach out to you once the code is ready. I'm excited to collaborate with you. Currently, I am working on a PR related to #31213, and it will take a few more weeks to complete. Thank you for your support!

@gyutaeb
Copy link
Copy Markdown
Member Author

gyutaeb commented Feb 11, 2025

Hi @julianwiedmann @sugangli.
I have examined the idea mentioned above and would like to share the results.

First of all, recreating the reverse entry in snat_v4_rev_nat_handle_mapping() seems too difficult. This is because recreating it requires the otuple, which is only stored in the reverse entry. To recreate reverse entries deleted by LRU, we could store the otuple in a separate data structure. However, this approach would unnecessarily increase software complexity, making it less desirable.
Instead, I have a new idea to mitigate network failures caused by LRU eviction. The idea is to recreate the deleted original entry in snat_v4_rev_nat_handle_mapping().

The issue occurs as follows:

  • snat_v4_rev_nat_handle_mapping() correctly finds the reverse entry.
  • When the egress packet reaches snat_v4_nat_handle_mapping(), the original entry has already been evicted by LRU.
  • Since the original entry is missing, snat_v4_new_mapping() assigns a new source port for NAT.
  • If this happens, the endpoint socket detects a port change and sends an RST packet.

This issue can still occur, and recreating the original entry in snat_v4_rev_nat_handle_mapping() might be a good way to address it. I'd love to hear your thoughts on this. I plan to prepare a PoC code in the next few weeks.

Additionally, it would be helpful to provide a tool that allows users to determine whether their bpf-nat-global-max value is appropriate. Normally, the max value is not reached, but in cases where keepalive is not used, it can reach the limit. Moreover, when LRU eviction occurs, it can take time for users to determine whether it was caused by LRU.

Therefore, it would be beneficial to develop a tool like the one proposed in kubectl-cilium. There are two possible approaches: adding this functionality to the cilium CLI within cilium-agent, or, integrating it into kubectl-cilium.

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it
has not had recent activity. It will be closed if no further activity
occurs. Thank you for your contributions.

@github-actions github-actions Bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Mar 14, 2025
@github-actions github-actions Bot removed the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Mar 24, 2025
@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it
has not had recent activity. It will be closed if no further activity
occurs. Thank you for your contributions.

@github-actions github-actions Bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Apr 23, 2025
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2025

This pull request has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions Bot closed this May 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. feature/snat Relates to SNAT or Masquerading of traffic kind/community-contribution This was a contribution made by a community member. stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

If pods establish more external connections than the value specified by --bpf-nat-global-max, it causes a TCP Reset in SNAT behavior for cil-to-netdev

3 participants