Skip to content

Fix checkReplyType failed issue via recreating xcvr_table_helper on forking subprocess#255

Merged
prgeor merged 3 commits intosonic-net:masterfrom
stephenxs:fix-xcvrd-racing-condition
Apr 21, 2022
Merged

Fix checkReplyType failed issue via recreating xcvr_table_helper on forking subprocess#255
prgeor merged 3 commits intosonic-net:masterfrom
stephenxs:fix-xcvrd-racing-condition

Conversation

@stephenxs
Copy link
Copy Markdown
Collaborator

@stephenxs stephenxs commented Apr 15, 2022

Description

Fix sonic-net/sonic-buildimage#10530
Fix checkReplyType failed issue via recreating xcvr_table_helper on forking subprocess

In a rare scenario, xcvrd can fail due to the following exception:

xcvrd Traceback (most recent call last):
xcvrd   File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
xcvrd     self.run()
xcvrd   File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
xcvrd     self._target(*self._args, **self._kwargs)
xcvrd   File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 1140, in task_worker
xcvrd   File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 548, in del_port_sfp_dom_info_from_db
xcvrd     dom_tbl._del(port_name)
xcvrd   File "/usr/lib/python3/dist-packages/swsscommon/swsscommon.py", line 2474, in _del
xcvrd     return self.delete(*args, **kwargs)
xcvrd   File "/usr/lib/python3/dist-packages/swsscommon/swsscommon.py", line 2468, in delete
xcvrd     return _swsscommon.Table_delete(self, *args, **kwargs)
xcvrd RuntimeError: Expected to get redis type 3 got type 5, err: NON-STRING-REPLY: Input/output error

Signed-off-by: Stephen Sun stephens@nvidia.com

Motivation and Context

xcvr_table_helper is an integrated object containing all table and database connector objects for redis database accessing which are socket-based. It is created at the beginning of the xcvrd daemon. The sockets will be shared across parent and children processes after children forked, which is the chief culprit of the issue.

The application communicates with redis-db server in a synchronous way, which means the message flow is like this:

a process sends the request to redis-db server and then blocks
the redis-db server handles and replies the request
the process receives the reply and then continue to run

All the messages will transport via sockets.

For xcvrd, it works in a sparse mode which means the frequency at which it operates redis-db tables is relatively low and a child and the parent rarely send/receive messages at the same time, which means it works well even if two application processes, like a child and the parent, shares the socket via which they communicate with the redis-server in most cases.
However, it is still possible that both children and parent operates redis-db tables at the same time and messages interleave, which can cause a process receives a message which was sent to other processes sharing the socket with it. In this case, checkReplyType fails because the reply type is not expected.

To resolve the issue, the XcvrTableHelper should be a class member instead of a global variable. By doing so, the sockets created by it will not be duplicated among subprocess forking.

How Has This Been Tested?

Regression and manually test.

Additional Information (Optional)

…rking subprocess

Signed-off-by: Stephen Sun <stephens@nvidia.com>
@stephenxs stephenxs added the Bug label Apr 15, 2022
@stephenxs stephenxs requested a review from prgeor April 15, 2022 01:53
@stephenxs
Copy link
Copy Markdown
Collaborator Author

We also need this PR to go to 202111 but it can not be cherry-picked. I will open new PR to back port it once this has been approved.

@stephenxs stephenxs marked this pull request as ready for review April 15, 2022 06:13
@stephenxs stephenxs requested a review from keboliu April 15, 2022 06:16
@prgeor prgeor self-assigned this Apr 19, 2022
Copy link
Copy Markdown
Collaborator

@prgeor prgeor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also remove the global reference to 'xcvr_table_helper' in Main task 'init' function?

@prgeor
Copy link
Copy Markdown
Collaborator

prgeor commented Apr 19, 2022

Thanks @stephenxs for finding the bug and fixing it!

Signed-off-by: Stephen Sun <stephens@nvidia.com>
@stephenxs stephenxs requested a review from prgeor April 20, 2022 17:48
Signed-off-by: Stephen Sun <stephens@nvidia.com>
@stephenxs
Copy link
Copy Markdown
Collaborator Author

Can we also remove the global reference to 'xcvr_table_helper' in Main task 'init' function?

Done. All global xcvr_table_helper has been moved to class scope.

@prgeor prgeor merged commit e0f8a35 into sonic-net:master Apr 21, 2022
@stephenxs stephenxs deleted the fix-xcvrd-racing-condition branch April 22, 2022 00:04
stephenxs added a commit to stephenxs/sonic-platform-daemons that referenced this pull request Apr 22, 2022
…orking subprocess (sonic-net#255)

* Fix message interleaving issue via recreating xcvr_table_helper on forking subprocess

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Address comments: change xcvr_table_helper to class member

Signed-off-by: Stephen Sun <stephens@nvidia.com>

* Fix a typo

Signed-off-by: Stephen Sun <stephens@nvidia.com>
liat-grozovik pushed a commit that referenced this pull request Apr 25, 2022
…orking subprocess (#255) (#256)

This is to backport #255 to 202111 branch
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

2 participants