Fix checkReplyType failed issue via recreating xcvr_table_helper on forking subprocess#255
Merged
prgeor merged 3 commits intosonic-net:masterfrom Apr 21, 2022
Merged
Conversation
…rking subprocess Signed-off-by: Stephen Sun <stephens@nvidia.com>
Collaborator
Author
|
We also need this PR to go to 202111 but it can not be cherry-picked. I will open new PR to back port it once this has been approved. |
prgeor
reviewed
Apr 19, 2022
Collaborator
prgeor
left a comment
There was a problem hiding this comment.
Can we also remove the global reference to 'xcvr_table_helper' in Main task 'init' function?
Collaborator
|
Thanks @stephenxs for finding the bug and fixing it! |
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
Collaborator
Author
Done. All global |
prgeor
approved these changes
Apr 21, 2022
stephenxs
added a commit
to stephenxs/sonic-platform-daemons
that referenced
this pull request
Apr 22, 2022
…orking subprocess (sonic-net#255) * Fix message interleaving issue via recreating xcvr_table_helper on forking subprocess Signed-off-by: Stephen Sun <stephens@nvidia.com> * Address comments: change xcvr_table_helper to class member Signed-off-by: Stephen Sun <stephens@nvidia.com> * Fix a typo Signed-off-by: Stephen Sun <stephens@nvidia.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fix sonic-net/sonic-buildimage#10530
Fix checkReplyType failed issue via recreating xcvr_table_helper on forking subprocess
In a rare scenario,
xcvrdcan fail due to the following exception:Signed-off-by: Stephen Sun stephens@nvidia.com
Motivation and Context
xcvr_table_helperis an integrated object containing all table and database connector objects for redis database accessing which are socket-based. It is created at the beginning of thexcvrddaemon. The sockets will be shared across parent and children processes after children forked, which is the chief culprit of the issue.The application communicates with redis-db server in a synchronous way, which means the message flow is like this:
All the messages will transport via sockets.
For
xcvrd, it works in a sparse mode which means the frequency at which it operates redis-db tables is relatively low and a child and the parent rarely send/receive messages at the same time, which means it works well even if two application processes, like a child and the parent, shares the socket via which they communicate with the redis-server in most cases.However, it is still possible that both children and parent operates redis-db tables at the same time and messages interleave, which can cause a process receives a message which was sent to other processes sharing the socket with it. In this case,
checkReplyTypefails because the reply type is not expected.To resolve the issue, the
XcvrTableHelpershould be a class member instead of a global variable. By doing so, the sockets created by it will not be duplicated among subprocess forking.How Has This Been Tested?
Regression and manually test.
Additional Information (Optional)