Skip to content

Fix up dplane handling of some edge cases#18919

Merged
riw777 merged 3 commits intoFRRouting:masterfrom
donaldsharp:large_ecmp_down_the_pipe
Jun 5, 2025
Merged

Fix up dplane handling of some edge cases#18919
riw777 merged 3 commits intoFRRouting:masterfrom
donaldsharp:large_ecmp_down_the_pipe

Conversation

@donaldsharp
Copy link
Copy Markdown
Member

When using no fpm use-nexthop-group and we have 512 way ecmp and ipv6 routes 8k buffer is not enough to actually encode the route to the dplane. Add code to stop this problem from happening. Add code to let the ctx carry the failure back up to zebra. Add code to let zebra set the flags appropriately.

@mjstapp @pbrisset you two should put your heads together and make sure you are happy with this change. If so I would recommend that someone port the dplane_fpm_nl.c code changes over to the sonic dplane module.

@mjstapp mjstapp self-requested a review June 2, 2025 11:55
Currently the buffer size used for encoding of routes is
8192 bytes.  This is not a great choice for when you are using
512 way ECMP with v6 nexthops.  You quickly run out of space.
Let's just make the size big enough to just work for the forseeable
future.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
When a dplane install fails in the dplane_fpm_nl
code.  Let the upper level protocol know that something
has gone amiss.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
@donaldsharp donaldsharp force-pushed the large_ecmp_down_the_pipe branch from 9637f09 to 32c5c45 Compare June 4, 2025 18:19
Copy link
Copy Markdown
Contributor

@mjstapp mjstapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Currently when using the dplane_fpm_nl module and a route installation
fails in that code, zebra is being notified that the route failed
to install but was doing nothing about it.  Leaving the routes
in a `queued` state.  Which is not a good.  Modify the code
to remove the queued state and properly notice that the route
installation has failed for the operator.

New behavior(Note that I've intentionally introduced the problem):
2025-05-30 19:30:29.796 [ERR!] zebra: [X4VXS-XMMJ2] fpm_nl_enqueue: netlink_route_multipath_msg_encode failed
2025-05-30 19:30:29.796 [DEBG] zebra: [JGWSB-SMNVE] dplane: incoming new work counter: 0
2025-05-30 19:30:29.796 [DEBG] zebra: [QP8G8-08P0Q] dplane enqueues 0 new work to provider 'Kernel' curr is 0
2025-05-30 19:30:29.796 [DEBG] zebra: [JVY1P-93VFY] dplane provider 'Kernel': processing
2025-05-30 19:30:29.796 [DEBG] zebra: [VCDW6-A7ZF1] dplane dequeues 0 completed work from provider Kernel
2025-05-30 19:30:29.796 [DEBG] zebra: [QP8G8-08P0Q] dplane enqueues 0 new work to provider 'dplane_fpm_nl' curr is 0
2025-05-30 19:30:29.796 [DEBG] zebra: [VCDW6-A7ZF1] dplane dequeues 3 completed work from provider dplane_fpm_nl
2025-05-30 19:30:29.796 [DEBG] zebra: [JTWAB-1MH4Y] dplane has 3 completed, 0 errors, for zebra main
2025-05-30 19:30:29.796 [DEBG] zebra: [J7K9Z-9M7DT] Nexthop dplane ctx 0x55c0f6258700, op NH_INSTALL, nexthop ID (17), result SUCCESS
2025-05-30 19:30:29.796 [DEBG] zebra: [J7K9Z-9M7DT] Nexthop dplane ctx 0x55c0f624add0, op NH_INSTALL, nexthop ID (16), result SUCCESS
2025-05-30 19:30:29.796 [DEBG] zebra: [S17A6-6B1X2] default(0:254):1::1/128 Processing dplane result ctx 0x55c0f624c000, op ROUTE_UPDATE result FAILURE
2025-05-30 19:30:29.796 [DEBG] zebra: [TFH34-6R9W1] default(0:254):1::1/128 Stale dplane result for old_re 0x55c0f62569d0
2025-05-30 19:30:29.796 [WARN] zebra: [VYKYC-709DP] default(0:254):1::1/128: Route install failed
r1(config)# do show ipv6 route
Codes: K - kernel route, C - connected, L - local, S - static,
       R - RIPng, O - OSPFv3, I - IS-IS, B - BGP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric, t - Table-Direct,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

IPv6 unicast VRF default:
S>r 1::1/128 [1/0] via fe80::1, r1-eth0, weight 1, 00:00:03
  r                via fe80::2, r1-eth0, weight 1, 00:00:03
  r                via fe80::3, r1-eth0, weight 1, 00:00:03
C>* fe80::/64 is directly connected, r1-eth0, weight 1, 00:00:58

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
@donaldsharp donaldsharp force-pushed the large_ecmp_down_the_pipe branch from 32c5c45 to e4acb14 Compare June 4, 2025 18:28
Copy link
Copy Markdown
Member

@riw777 riw777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@riw777 riw777 merged commit 76ef539 into FRRouting:master Jun 5, 2025
13 checks passed
r12f pushed a commit to Azure/sonic-buildimage-msft that referenced this pull request Jun 13, 2025
<!--
Please make sure you've read and understood our contributing guidelines:
     https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

** Make sure all your commits include a signature generated with `git
commit -s` **

If this is a bug fix, make sure your description includes "fixes #xxxx",
or
     "closes #xxxx" or "resolves #xxxx"

     Please provide the following information:
-->

#### Why I did it
When a route with 512 nexthops is programmed, show ipv6 route shows the
route as queued, even though the route gets programmed in the hardware.
Porting the fix FRRouting/frr#18919 from FRR as
well as aligning dplane_fpm_sponic

##### Work item tracking
- Microsoft ADO **(number only)**:

#### How I did it
Modifying dplane_fpm_sonic and added patches.

#### How to verify it
Install routes with 512 nexthops and confirm if they are not queued.

<!--
If PR needs to be backported, then the PR must be tested against the
base branch and the earliest backport release branch and provide tested
image version on these two branches. For example, if the PR is requested
for master, 202211 and 202012, then the requester needs to provide test
results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
Ensure to add label/tag for the feature raised. example - PR#2174 under
sonic-utilities repo. where, Generic Config and Update feature has been
labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on
https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
StormLiangMS pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Jun 23, 2025
Why I did it
When a route with 512 nexthops is programmed, show ipv6 route shows the route as queued, even though the route gets programmed in the hardware. Porting the fix FRRouting/frr#18919 from FRR as well as aligning dplane_fpm_sponic

Work item tracking
Microsoft ADO (number only):
How I did it
Modifying dplane_fpm_sonic and added patches.

How to verify it
Install routes with 512 nexthops and confirm if they are not queued.
mssonicbld added a commit to mssonicbld/sonic-buildimage that referenced this pull request Jun 23, 2025
<!--
     Please make sure you've read and understood our contributing guidelines:
     https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

     ** Make sure all your commits include a signature generated with `git commit -s` **

     If this is a bug fix, make sure your description includes "fixes #xxxx", or
     "closes #xxxx" or "resolves #xxxx"

     Please provide the following information:
-->

#### Why I did it
When a route with 512 nexthops is programmed, show ipv6 route shows the route as queued, even though the route gets programmed in the hardware. Porting the fix FRRouting/frr#18919 from FRR as well as aligning dplane_fpm_sponic

##### Work item tracking
- Microsoft ADO **(number only)**:

#### How I did it
Modifying dplane_fpm_sonic and added patches.

#### How to verify it
Install routes with 512 nexthops and confirm if they are not queued.

<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 202205
- [ ] 202211
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
mssonicbld added a commit to sonic-net/sonic-buildimage that referenced this pull request Jun 23, 2025
<!--
 Please make sure you've read and understood our contributing guidelines:
 https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

 failure_prs.log Make sure all your commits include a signature generated with `git commit -s` **

 If this is a bug fix, make sure your description includes "fixes #xxxx", or
 "closes #xxxx" or "resolves #xxxx"

 Please provide the following information:
-->

#### Why I did it
When a route with 512 nexthops is programmed, show ipv6 route shows the route as queued, even though the route gets programmed in the hardware. Porting the fix FRRouting/frr#18919 from FRR as well as aligning dplane_fpm_sponic

##### Work item tracking
- Microsoft ADO **(number only)**:

#### How I did it
Modifying dplane_fpm_sonic and added patches.

#### How to verify it
Install routes with 512 nexthops and confirm if they are not queued.

<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 202205
- [ ] 202211
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
@donaldsharp donaldsharp deleted the large_ecmp_down_the_pipe branch July 30, 2025 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants