Description
Error NON-STRING-REPLY: Input/output error occurred when xcvrd tries removing TRANSCEIVER_DOM_SENSOR table in test_check_sfputil_reset.
Mar 23 20:08:34.437008 r-leopard-41 INFO pmon#/supervisord: xcvrd Traceback (most recent call last):
Mar 23 20:08:34.437075 r-leopard-41 INFO pmon#/supervisord: xcvrd File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
Mar 23 20:08:34.437075 r-leopard-41 INFO pmon#/supervisord: xcvrd self.run()
Mar 23 20:08:34.437102 r-leopard-41 INFO pmon#/supervisord: xcvrd File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
Mar 23 20:08:34.437156 r-leopard-41 INFO pmon#/supervisord: xcvrd self._target(*self._args, **self._kwargs)
Mar 23 20:08:34.437156 r-leopard-41 INFO pmon#/supervisord: xcvrd File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 1140, in task_worker
Mar 23 20:08:34.437184 r-leopard-41 INFO pmon#/supervisord: xcvrd del_port_sfp_dom_info_from_db(logical_port, self.port_mapping, xcvr_table_helper.get_intf_tbl(asic_index), xcvr_table_helper.get_dom_tbl(asic_index))
Mar 23 20:08:34.437242 r-leopard-41 INFO pmon#/supervisord: xcvrd File "/usr/local/lib/python3.7/dist-packages/xcvrd/xcvrd.py", line 548, in del_port_sfp_dom_info_from_db
Mar 23 20:08:34.437242 r-leopard-41 INFO pmon#/supervisord: xcvrd dom_tbl._del(port_name)
Mar 23 20:08:34.437242 r-leopard-41 INFO pmon#/supervisord: xcvrd File "/usr/lib/python3/dist-packages/swsscommon/swsscommon.py", line 2474, in _del
Mar 23 20:08:34.437275 r-leopard-41 INFO pmon#/supervisord: xcvrd return self.delete(*args, **kwargs)
Mar 23 20:08:34.437324 r-leopard-41 INFO pmon#/supervisord: xcvrd File "/usr/lib/python3/dist-packages/swsscommon/swsscommon.py", line 2468, in delete
Mar 23 20:08:34.437324 r-leopard-41 INFO pmon#/supervisord: xcvrd return _swsscommon.Table_delete(self, *args, **kwargs)
Mar 23 20:08:34.437324 r-leopard-41 INFO pmon#/supervisord: xcvrd RuntimeError: Expected to get redis type 3 got type 5, err: NON-STRING-REPLY: Input/output error
Steps to reproduce the issue:
- Run
test_check_sfputil_reset test.
The issue is reproduced at a very low probability.
Describe the results you received:
See above.
Describe the results you expected:
There should not be an error.
Output of show version:
SONiC Software Version: SONiC.202111.36-d5354df3d_Internal
Distribution: Debian 11.2
Kernel: 5.10.0-8-2-amd64
Build commit: d5354df3d
Build date: Wed Mar 23 15:13:58 UTC 2022
Built by: sw-r2d2-bot@r-build-sonic-ci03-241
Platform: x86_64-mlnx_msn4700-r0
HwSKU: ACS-MSN4700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2047X05586
Model Number: MSN4700-WS2FO
Hardware Revision: A1
Uptime: 11:19:26 up 8:06, 1 user, load average: 0.28, 0.30, 0.43
Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-dhcp-relay latest b730b1609fe1 440MB
docker-teamd 202111.36-d5354df3d_Internal 7504fb9dfede 440MB
docker-teamd latest 7504fb9dfede 440MB
docker-syncd-mlnx 202111.36-d5354df3d_Internal 319ca0799459 919MB
docker-syncd-mlnx latest 319ca0799459 919MB
docker-sonic-telemetry 202111.36-d5354df3d_Internal fd34833c0e65 515MB
docker-sonic-telemetry latest fd34833c0e65 515MB
docker-sonic-mgmt-framework 202111.36-d5354df3d_Internal c6b6976eb7a6 582MB
docker-sonic-mgmt-framework latest c6b6976eb7a6 582MB
docker-snmp 202111.36-d5354df3d_Internal d10b4d6808a2 470MB
docker-snmp latest d10b4d6808a2 470MB
docker-sflow 202111.36-d5354df3d_Internal a2d91e19d650 441MB
docker-sflow latest a2d91e19d650 441MB
docker-router-advertiser 202111.36-d5354df3d_Internal c6405b50d408 426MB
docker-router-advertiser latest c6405b50d408 426MB
docker-platform-monitor 202111.36-d5354df3d_Internal 5559e18327e9 811MB
docker-platform-monitor latest 5559e18327e9 811MB
docker-orchagent 202111.36-d5354df3d_Internal 0846a835d5bb 459MB
docker-orchagent latest 0846a835d5bb 459MB
docker-nat 202111.36-d5354df3d_Internal f91e91c53826 443MB
docker-nat latest f91e91c53826 443MB
docker-mux 202111.36-d5354df3d_Internal 8181bb494c0c 479MB
docker-mux latest 8181bb494c0c 479MB
docker-macsec 202111.36-d5354df3d_Internal d2037e3dfc19 444MB
docker-macsec latest d2037e3dfc19 444MB
docker-lldp 202111.36-d5354df3d_Internal a25eecb07e2e 466MB
docker-lldp latest a25eecb07e2e 466MB
docker-fpm-frr 202111.36-d5354df3d_Internal 706fc3e4f24e 459MB
docker-fpm-frr latest 706fc3e4f24e 459MB
docker-database 202111.36-d5354df3d_Internal 1f1cc72c4651 426MB
docker-database latest 1f1cc72c4651 426MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh 1.2.0-202111-internal-20 d9e7b3422f7d 472MB
Output of show techsupport:
(paste your output here or download and attach the file here )
sonic_dump_r-leopard-41_20220323_201136.tar.gz
sonic_dump_r-tigon-17_20220323_202911.tar.gz
Additional information you deem important (e.g. issue happens only occasionally):
It is likely that the issue is caused by socket being shared between parent/child processes of xcvrd.
There are two processes in xcvrd
- A main (parent) process which is responsible for DOM information update. Implemented in class DomInfoUpdateTask
- A child process which is responsible for SFP change event handling. Implemented in class SfpStateUpdateTask
Both child and parent share the global xcvr_table_helper which is an instance of XcvrTableHelper. In XcvrTableHelper it will creates a db_connect object which is on top of a socket. It is created in DaemonXcvrd.init which is called by DaemonXcvrd.run. Later sfp_state_update is created and SfpStateUpdateTask.task_run is called to create the child process which will inherite all the resources, including the socket, with the parent.
When the system runs, both child and parent interact with redis-db server via the socket. The message can interleave between child and parent.
Normal flow should be like
P - parent
C - child
R - redis-db server
P sends request to R
R responds P with rp
P receives and handles rp and continue to run
C sends request to R
R responds P with rc
C receives and handles rc and continue to run
...
But in some rare case, the interleaved flow can be like this:
P send request p to R
C send request c to R
R responds P with rp
R responds C with rc
C receives and handles the response rp which was sent to parent. At this point it complains for receiving a mismatched redis-reply
Description
Error
NON-STRING-REPLY: Input/output erroroccurred whenxcvrdtries removing TRANSCEIVER_DOM_SENSOR table intest_check_sfputil_reset.Steps to reproduce the issue:
test_check_sfputil_resettest.The issue is reproduced at a very low probability.
Describe the results you received:
See above.
Describe the results you expected:
There should not be an error.
Output of
show version:Output of
show techsupport:sonic_dump_r-leopard-41_20220323_201136.tar.gz
sonic_dump_r-tigon-17_20220323_202911.tar.gz
Additional information you deem important (e.g. issue happens only occasionally):
It is likely that the issue is caused by socket being shared between parent/child processes of
xcvrd.There are two processes in
xcvrd- A main (parent) process which is responsible for DOM information update. Implemented in class
DomInfoUpdateTask- A child process which is responsible for SFP change event handling. Implemented in class
SfpStateUpdateTaskBoth child and parent share the global
xcvr_table_helperwhich is an instance ofXcvrTableHelper. InXcvrTableHelperit will creates adb_connectobject which is on top of a socket. It is created inDaemonXcvrd.initwhich is called byDaemonXcvrd.run. Latersfp_state_updateis created andSfpStateUpdateTask.task_runis called to create the child process which will inherite all the resources, including the socket, with the parent.When the system runs, both child and parent interact with redis-db server via the socket. The message can interleave between child and parent.
Normal flow should be like
But in some rare case, the interleaved flow can be like this: