Skip to content

Fix issue: wrong teamd link watch state after warm reboot#135

Closed
stephenxs wants to merge 4 commits intomasterfrom
fix-teamd-lw-warm-reboot
Closed

Fix issue: wrong teamd link watch state after warm reboot#135
stephenxs wants to merge 4 commits intomasterfrom
fix-teamd-lw-warm-reboot

Conversation

@stephenxs
Copy link
Copy Markdown
Owner

@stephenxs stephenxs commented Mar 1, 2023

Why I did it

Fix issue: wrong teamd link watch state after warm reboot due to TEAM_ATTR_PORT_CHANGED lost

The flag TEAM_ATTR_PORT_CHANGED is maintained by kernel team driver:

  • a flag "changed" is maintained in struct team_port struct
  • the flag is set by __team_port_change_send once relevant information is updated, including port linkup (together with speed, duplex), adding or removing
  • the flag is cleared by team_nl_fill_one_port_get once the updated information has been notified to user space via RTNL

In the userspace, the change flag is maintained by libteam in struct team_port.
The team daemon calls port_priv_change_handler_func on receiving port change event.
The logic in port_priv_change_handler_func

  1. creates the port if it did not exist, which triggers port add event and eventually calls lacp_port_added callback.
  2. triggers port change event if team_port->changed is true, which eventually calls lw_ethtool_event_watch_port_changed to update port state for link watch ethtool.
  3. removes the port if team_port->removed is removed

In lacp_port_added, it calls team_refresh to refresh ifinfo, port info, and option info from the kernel via RTNL.
In this step, port_priv_change_handler_func is called recursively.

  • In the inner call, it won't get TEAM_ATTR_PORT_CHANGED flag because kernel has already notified that.
  • As a result, team_port->changed flag is cleared in the libteam.
  • The port change event won't be triggered from either inner or outer call of port_priv_change_handler_func.

If the port has been up when the port is being added to the team device, the "port up" information is carried in the outer call but will be lost.

In case the flag TEAM_ATTR_PORT_CHANGED is set only in the inner call, function port_priv_change_handler_func can be called in the inner call.
However, it will fail to fetch "enable" options because option_list_init has not be called.

Signed-off-by: Stephen Sun stephens@nvidia.com

How I did it

Fix:
Do not call check_call_change_handlers when parsing RTNL function is called from another check_call_change_handlers recursively.

How to verify it

  • Manually test
  • Regression test
    • warm reboot
    • warm reboot sad lag
    • warm reboot sad lag member
    • warm reboot sad (partial)

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211

Description for the changelog

Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Stephen Sun <stephens@nvidia.com>
@stephenxs stephenxs force-pushed the fix-teamd-lw-warm-reboot branch from f20efeb to 5e4ca30 Compare March 3, 2023 01:23
@stephenxs
Copy link
Copy Markdown
Owner Author

ci 2310 + rerun 275 passed

Copy link
Copy Markdown

@stepanblyschak stepanblyschak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this is a fix for an issue that hapeens due to warm reboot patch? I suggest changing warm reboot patch instead of adding new one in that case.

I understand the recursion happens because of team_refresh in lacp_port_added? Why do we need team_refresh there?

@stephenxs
Copy link
Copy Markdown
Owner Author

Is this is a fix for an issue that hapeens due to warm reboot patch? I suggest changing warm reboot patch instead of adding new one in that case.

I understand the recursion happens because of team_refresh in lacp_port_added? Why do we need team_refresh there?

Regarding merging it into the warm reboot patch, I'm OK only if it's a common practice to merge multiple patches into one. Regarding whether team_refresh is required in lacp_port_added, we need to further check it, but it was there from the beginning.

@stepanblyschak
Copy link
Copy Markdown

@stephenxs I believe so, since it is a fix for a patch that we introduced, it is common to change the original patch (or we can lose track of which patches are backports, which were developed by us) - e.g we already did it in the past -

- https://github.com/sonic-net/sonic-buildimage/pull/3544
- https://github.com/sonic-net/sonic-buildimage/pull/8227

I am Ok with these changes but I would like to better understand why team_refresh is being called from lacp_port_added. I am Ok to approve to discuss it with MSFT.

@stephenxs
Copy link
Copy Markdown
Owner Author

@stephenxs I believe so, since it is a fix for a patch that we introduced, it is common to change the original patch (or we can lose track of which patches are backports, which were developed by us) - e.g we already did it in the past -

- https://github.com/sonic-net/sonic-buildimage/pull/3544
- https://github.com/sonic-net/sonic-buildimage/pull/8227

I am Ok with these changes but I would like to better understand why team_refresh is being called from lacp_port_added. I am Ok to approve to discuss it with MSFT.

Thanks. I'm still checking whether there is a negative effect if team_refresh is removed. So far I didn't see any but still need to check more.
Regarding merging patches, I will do.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
@stephenxs
Copy link
Copy Markdown
Owner Author

@stephenxs I believe so, since it is a fix for a patch that we introduced, it is common to change the original patch (or we can lose track of which patches are backports, which were developed by us) - e.g we already did it in the past -

- https://github.com/sonic-net/sonic-buildimage/pull/3544
- https://github.com/sonic-net/sonic-buildimage/pull/8227

I am Ok with these changes but I would like to better understand why team_refresh is being called from lacp_port_added. I am Ok to approve to discuss it with MSFT.

Hi @stepanblyschak
I do not see any solid reason that team_refresh is required to be called in lacp_port_added.
One possible reason is that lacp_port_added calls team_set_port_enabled which requires options and if the option is not available at the time (eg. hasn't received kernel notification yet), it can accelerate the process by calling team_refresh. But I'm not sure whether it is necessary because in very short time it should receive the notification from the kernel.
So, I'm also open to other options, eg. to remove team_refresh from lacp_port_added.
I think we can discuss the current solution with MSFT to test the water.

btw, I merged the patch to the original one.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
@stephenxs stephenxs force-pushed the fix-teamd-lw-warm-reboot branch from 6267876 to 9fd6ba5 Compare March 14, 2023 10:48
@stephenxs stephenxs closed this Mar 24, 2023
@stephenxs stephenxs reopened this Mar 24, 2023
@stephenxs
Copy link
Copy Markdown
Owner Author

ci 2388 + rerun 313 passed

@stephenxs stephenxs closed this Mar 29, 2023
@stephenxs stephenxs deleted the fix-teamd-lw-warm-reboot branch April 8, 2023 09:48
stephenxs pushed a commit that referenced this pull request Jul 14, 2023
Why I did it
[Submodule][202211] Advance sonic-restapi pointer

The branch 202012 has already updated to commit 47e4b53.

4f6f979 Fix the redis security issue CVE-2023-28858 and CVE-2023-28859 (#139)
47e4b53 Fix adv_pfx len for ipv6 (#135)
44121be Support ipv6 prefix lenght greater than 64 and check for adv_prefix (#134)
99c467d Add API support for adv prefix and custom monitoring (#133)
347684a Use github code scanning instead of LGTM (#132)
86543d0 Updates to route PATCH API (#129)
a1af82c Install libyang to azure pipeline (#128)
2007c4c Increase coverage threshold (#126)

Work item tracking
Microsoft ADO (number only): 17705422
How I did it
How to verify it
stephenxs pushed a commit that referenced this pull request Jul 27, 2023
…lly (sonic-net#15929)

#### Why I did it
src/sonic-gnmi
```
* fb338d5 - (HEAD -> master, origin/master, origin/HEAD) Merge pull request #135 from liuh-80/dev/liuh/cherry-pick-zmq (3 hours ago) [Hua Liu]
* f8d9c7e - Merge branch 'master' into dev/liuh/cherry-pick-zmq (8 hours ago) [Qi Luo]
* cbd5185 - Fix PR comments (26 hours ago) [liuh-80]
* 226fc31 - Fix PR comments (2 days ago) [liuh-80]
* 6579847 - Fix UT (3 days ago) [liuh-80]
* 53713c3 - Improve code coverage (3 days ago) [liuh-80]
* d8ff562 - Fix UT (3 days ago) [liuh-80]
* c3a66bc - Cherry-pick ZMQ change from nvidia repo (3 days ago) [liuh-80]
```
#### How I did it
#### How to verify it
#### Description for the changelog
stephenxs pushed a commit that referenced this pull request Jul 12, 2024
…utomatically (sonic-net#19415)

#### Why I did it
src/sonic-host-services
```
* 02d9b55 - (HEAD -> master, origin/master, origin/HEAD) Added support to render template format of `delayed` flag on Feature Table. (#135) (28 hours ago) [abdosi]
* 60fdfea - Fixed determine/process reboot-cause service dependency (sonic-net#17406) (#132) (13 days ago) [anamehra]
```
#### How I did it
#### How to verify it
#### Description for the changelog
stephenxs pushed a commit that referenced this pull request Feb 2, 2026
…ically (sonic-net#25237)

#### Why I did it
src/sonic-dash-ha
```
* 9b3c0bf - (HEAD -> master, origin/master, origin/HEAD) Add bfd rewrite on pmon change. (#136) (33 hours ago) [dypet]
* af44396 - [build] Disable debian helper auto install for cargo project. (#135) (2 days ago) [Liu Shilong]
* 17e2e0b - Implement bfd pinned state (#134) (8 days ago) [yue-fred-gao]
* c04969e - switch to using libboost1.83 (#133) (9 days ago) [yijingyan2]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants