Skip to content

bpf: ICMP checksum handling in SNAT RevNAT#43196

Merged
julianwiedmann merged 2 commits intocilium:mainfrom
yushoyamaguchi:revnat_snat_csum
Jan 13, 2026
Merged

bpf: ICMP checksum handling in SNAT RevNAT#43196
julianwiedmann merged 2 commits intocilium:mainfrom
yushoyamaguchi:revnat_snat_csum

Conversation

@yushoyamaguchi
Copy link
Copy Markdown
Member

@yushoyamaguchi yushoyamaguchi commented Dec 7, 2025

Fixes: #40827

When a LB node forwards packets to a backend-pod on another node,
the packets are SNATed.
The corresponding reverse NAT (RevNAT) is then applied
to the reply packets.

Checksum recaluculation is required when RevNAT.
When the reply packet is an ICMP error packet, the checksum
in the ICMP header must be recalculated to reflect the changes
of the inner packet header.
https://datatracker.ietf.org/doc/html/rfc5508#section-4.1

Since that logic was missing, it has been added in this commit.

The below is an error message of nat4_icmp_error_tcp_snat_revnat (new test) without the checksum recalculate patch.

An error message of RevNAT BPF-test
--- FAIL: TestBPF (39.66s)
--- FAIL: TestBPF/icmp_error_revnat.o (0.55s)
--- FAIL: TestBPF/icmp_error_revnat.o/nat4_icmp_error_tcp_snat_revnat (0.42s)

    bpf_test.go:577: icmp_error_revnat.c:137: CTX and buffer 'ICMP4_ERR_FRAG_NEEDED_AFTER_REVNAT' content mismatch 
    bpf_test.go:421:   Scapy asserts failed: 1
    bpf_test.go:428: 
        === START ./scapy.h:171 'icmp4_revnat_ok' ===
        >>> Expected (len: 164 bytes) <<<
        ###[ Ethernet ]### 
          dst       = 13:37:13:37:13:37
          src       = de:ad:be:ef:de:ef
          type      = IPv4
        ###[ IP ]### 
             version   = 4
             ihl       = 5
             tos       = 0x0
             len       = 68
             id        = 1
             flags     = 
             frag      = 0
             ttl       = 64
             proto     = icmp
             chksum    = 0xa60d
             src       = 192.168.0.2
             dst       = 10.0.10.1
             \options   \
        ###[ ICMP ]### 
                type      = dest-unreach
                code      = fragmentation-needed
                chksum    = 0xcbe5
                reserved  = 0
                length    = 0
                nexthopmtu= 1500
                unused    = ''
        ###[ IP in ICMP ]### 
                   version   = 4
                   ihl       = 5
                   tos       = 0x0
                   len       = 40
                   id        = 1
                   flags     = DF
                   frag      = 0
                   ttl       = 64
                   proto     = tcp
                   chksum    = 0x6624
                   src       = 10.0.10.1
                   dst       = 192.168.0.2
                   \options   \
        ###[ TCP in ICMP ]### 
                      sport     = 3030
                      dport     = http
                      seq       = 0
                      ack       = 0
                      dataofs   = 5
                      reserved  = 0
                      flags     = S
                      window    = 8192
                      chksum    = 0xaf11
                      urgptr    = 0
                      options   = ''
        
        >>> Got (len: 164 bytes) <<<
        ###[ Ethernet ]### 
          dst       = 13:37:13:37:13:37
          src       = de:ad:be:ef:de:ef
          type      = IPv4
        ###[ IP ]### 
             version   = 4
             ihl       = 5
             tos       = 0x0
             len       = 68
             id        = 1
             flags     = 
             frag      = 0
             ttl       = 64
             proto     = icmp
             chksum    = 0xa60d
             src       = 192.168.0.2
             dst       = 10.0.10.1
             \options   \
        ###[ ICMP ]### 
                type      = dest-unreach
                code      = fragmentation-needed
                chksum    = 0x788e
                reserved  = 0
                length    = 0
                nexthopmtu= 1500
                unused    = ''
        ###[ IP in ICMP ]### 
                   version   = 4
                   ihl       = 5
                   tos       = 0x0
                   len       = 40
                   id        = 1
                   flags     = DF
                   frag      = 0
                   ttl       = 64
                   proto     = tcp
                   chksum    = 0x6624
                   src       = 10.0.10.1
                   dst       = 192.168.0.2
                   \options   \
        ###[ TCP in ICMP ]### 
                      sport     = 3030
                      dport     = http
                      seq       = 0
                      ack       = 0
                      dataofs   = 5
                      reserved  = 0
                      flags     = S
                      window    = 8192
                      chksum    = 0xaf11
                      urgptr    = 0
                      options   = ''
        
        >>> Diff <<<
          --- a/pkt (Expected)
          +++ b/pkt (Got)
        
          - [ Ether ][ IP ][ ICMP ].chksum: 52197
          + [ Ether ][ IP ][ ICMP ].chksum: 30862
        
        === END ./scapy.h:171 'icmp4_revnat_ok' ===

┌──────────────────────────────────────────────────────────────────────────────────────────────┐
│  STATUS │ ELAPSED │                  PACKAGE                   │ COVER │ PASS │ FAIL │ SKIP  │
│─────────┼─────────┼────────────────────────────────────────────┼───────┼──────┼──────┼───────│
│  FAIL   │ 39.70s  │ github.com/cilium/cilium/bpf/tests/bpftest │  --   │ 594  │  3   │  0    │
└──────────────────────────────────────────────────────────────────────────────────────────────┘
Fix ICMP error packet handling by adding the missing checksum recalculation performed during RevNAT for SNATed load-balanced traffic.

@yushoyamaguchi yushoyamaguchi requested a review from a team as a code owner December 7, 2025 14:59
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Dec 7, 2025
@yushoyamaguchi yushoyamaguchi force-pushed the revnat_snat_csum branch 2 times, most recently from c3069d6 to 89f6154 Compare December 7, 2025 15:49
@yushoyamaguchi
Copy link
Copy Markdown
Member Author

yushoyamaguchi commented Dec 7, 2025

This fix is required before we support #32029.
In current implementation, even if we support #32029, reply ICMP error packet is not received because of the incorrect checksum.

@yushoyamaguchi
Copy link
Copy Markdown
Member Author

yushoyamaguchi commented Dec 7, 2025

@pchaigno @julianwiedmann

Sorry for the delay.
Please review.

@yushoyamaguchi yushoyamaguchi changed the title ICMP csum handling in SNAT RevNAT ICMP checksum handling in SNAT RevNAT Dec 7, 2025
@yushoyamaguchi
Copy link
Copy Markdown
Member Author

/test

@yushoyamaguchi yushoyamaguchi changed the title ICMP checksum handling in SNAT RevNAT bpf: ICMP checksum handling in SNAT RevNAT Dec 10, 2025
@pchaigno
Copy link
Copy Markdown
Member

It looks like the CI may be failing because of #15474. You may need to rebase to pass.

@yushoyamaguchi yushoyamaguchi force-pushed the revnat_snat_csum branch 4 times, most recently from 7ab85c9 to 77eb7ca Compare December 22, 2025 12:15
@pchaigno pchaigno added area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Jan 6, 2026
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 6, 2026
@pchaigno
Copy link
Copy Markdown
Member

pchaigno commented Jan 6, 2026

You'll need to address the warning from Check Patch. The two other failures look like flakes.

Copy link
Copy Markdown
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Looks good to me overall - I added a few comments, although I think we shouldn't attempt to optimize the logic further for now. Let's do that once we have addressed the bug.

One aspect I'd still want to confirm is what the effect on skb->csum is. Not sure whether we can easily validate that from within the BPF test suite today.

Comment on lines +539 to +566
static __always_inline void
snat_v4_calc_icmp_error_csum_diff(__be32 old_addr, __be32 new_addr,
__be16 old_port, __be16 new_port,
bool inner_has_l4_csum, __wsum *diff_for_csum)
{
__be32 old_port32 = (__be32)old_port;
__be32 new_port32 = (__be32)new_port;

*diff_for_csum = 0;

if (inner_has_l4_csum) {
/* Calculate diff value for checksum.
* Reflect the change in the inner L4 checksum caused by the pseudo-address update
* into diff_for_csum.
* All the other changes in inner packet cancel each other out.
*/
if (old_addr != new_addr)
*diff_for_csum = csum_diff(&new_addr, 4, &old_addr, 4, 0);
} else {
/* Calculate diff value for checksum.
* If the inner L4 header does not include the L4 checksum,
* only the port is modified within the inner L4 header.
*/
if (old_port != new_port)
*diff_for_csum = csum_diff(&old_port32, 4, &new_port32, 4, 0);
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think eventually we'll want to integrate this logic into snat_v4_rewrite_headers() (so it literally tells the caller what csum-relevant changes it made, and the caller can then feed that information into the next stage of header rewrites). But not today.

🤯 for the "truncated L4" scenario, this is such an annoying corner case to handle ...

Copy link
Copy Markdown
Member

@msune msune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only reviewed the Scapy part. Overall LGTM. Minor changes to style + naming required.

Thanks!

@yushoyamaguchi yushoyamaguchi force-pushed the revnat_snat_csum branch 2 times, most recently from e7ad1ad to 67a8a5c Compare January 9, 2026 14:21
@yushoyamaguchi yushoyamaguchi requested a review from msune January 10, 2026 02:50
@yushoyamaguchi
Copy link
Copy Markdown
Member Author

/test

When a LB node forwards packets to a backend-pod on another node,
the packets are SNATed.
The corresponding reverse NAT (RevNAT) is then applied
to the reply packets.

Checksum recaluculation is required when RevNAT.
When the reply packet is an ICMP error packet, the checksum
in the ICMP header must be recalculated to reflect the changes
of the inner packet header.
https://datatracker.ietf.org/doc/html/rfc5508#section-4.1

Since that logic was missing, it has been added in this commit.

Fixes: cilium#40827

Signed-off-by: Yusho Yamaguchi <ysh.824@outlook.jp>
This commit adds a new test case of SNAT RevNAT using scapy.
By this test case, we could also validate checksum of outputed packet.

Signed-off-by: Yusho Yamaguchi <ysh.824@outlook.jp>
@julianwiedmann julianwiedmann added the feature/snat Relates to SNAT or Masquerading of traffic label Jan 13, 2026
@julianwiedmann
Copy link
Copy Markdown
Member

/test

Copy link
Copy Markdown
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, looks good to me!

So as follow-up actions, we'd want to

  1. add more scapy tests (IPv4 and IPv6),
  2. fix up the forward path (snat_v4_nat_handle_icmp_error()),
  3. look at the suggested optimizations

@julianwiedmann
Copy link
Copy Markdown
Member

julianwiedmann commented Jan 13, 2026

One aspect I'd still want to confirm is what the effect on skb->csum is. Not sure whether we can easily validate that from within the BPF test suite today.

for the record - the additional l4_csum_replace() call also affects skb->csum. But that's exactly what we need - it matches our update to the outer L4 header's csum field.

@julianwiedmann julianwiedmann added this pull request to the merge queue Jan 13, 2026
Merged via the queue into cilium:main with commit e27cc7d Jan 13, 2026
75 of 76 checks passed
@msune
Copy link
Copy Markdown
Member

msune commented Jan 13, 2026

@julianwiedmann should this be backported?

@julianwiedmann julianwiedmann added backport/author The backport will be carried out by the author of the PR. affects/v1.16 This issue affects v1.16 branch affects/v1.17 This issue affects v1.17 branch needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Jan 13, 2026
@julianwiedmann
Copy link
Copy Markdown
Member

@yushoyamaguchi would you want to give the v1.18 backport a try? We don't have scapy there, so it would be landing without a test ...

@yushoyamaguchi
Copy link
Copy Markdown
Member Author

@julianwiedmann

Thank you for your suggestion!
If it is okay to backport this to v1.18 without test, I want to try.

To confirm the process:

  • create a branch off v1.18.
  • cherry-pick the checksum fix commit.
  • open a PR back to v1.18.

Does that sound good?

@julianwiedmann
Copy link
Copy Markdown
Member

@julianwiedmann

Thank you for your suggestion! If it is okay to backport this to v1.18 without test, I want to try.

To confirm the process:

* create a branch off `v1.18`.

* cherry-pick the checksum fix commit.

* open a PR back to `v1.18`.

Does that sound good?

yep sounds good! There's full documentation here as well.

@yushoyamaguchi
Copy link
Copy Markdown
Member Author

yushoyamaguchi commented Jan 18, 2026

@julianwiedmann

I've made #43861.

There is no bpf-test to check the effect of this patch in v1.18, but I confirmed that cilium built from this version works correctly about RevSNAT checksum in my KinD cluster

@pchaigno pchaigno added backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. and removed needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Jan 18, 2026
@github-actions github-actions bot added backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. and removed backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. labels Jan 18, 2026
@cilium-release-bot cilium-release-bot bot moved this to Released in cilium v1.19.0 Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

affects/v1.16 This issue affects v1.16 branch affects/v1.17 This issue affects v1.17 branch area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. backport/author The backport will be carried out by the author of the PR. backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. feature/snat Relates to SNAT or Masquerading of traffic release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

No open projects
Status: Released

Development

Successfully merging this pull request may close these issues.

Bug: ICMP error packets have incorrect checksum after SNAT RevNAT in LB-node

4 participants