so far we noticed the link down alert happens when we upgrade the Arista T2 from SONIC 202205 image to 202405 image.
the alert arise because of 2 issues:
- when bumping image from 202205->202405, the swss0/1 and syncd0/1 were down, this issue happened in staging only, we did not see the issue in lab when bumping image. Also this is not platform specific issue, could be related to image, but not sure why it happens in staging only.
- on the Arista T2 running 202405 image, after we restart swss0/1, we see syncd interrupts on all LCs and potentially this is causing the CRC link error on its peer T1s(the interrupts syslog is the suspicious culprit log that matches alert creation time):
<30>2025-01-30T21:14:59.666022+00:00 STG01-0101-0400-01T2-sup00 INFO systemd[1]: Started swss@0.service - switch state service.
<13>2025-01-30T21:15:01.666281+00:00 STG01-0101-0400-01T2-sup00 NOTICE root: Started swss0 service...
<14>2025-01-30T21:15:22.349038+00:00 STG01-0101-0400-01T2-lc05 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015
<10>2025-01-30T21:15:22.349099+00:00 STG01-0101-0400-01T2-lc05 CRIT syncd#syncd: [none] SAI_API_SWITCH:_brcm_sai_switch_event_cb:950 0x5a1100 Received unhandled switch event - Device Interrupt(13) on unit 0: 0x29a 0x0 0x0
<14>2025-01-30T21:15:22.749727+00:00 STG01-0101-0400-01T2-lc05 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015
<14>2025-01-30T21:15:22.840429+00:00 STG01-0101-0400-01T2-lc05 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015
<14>2025-01-30T21:15:23.122929+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015
<10>2025-01-30T21:15:23.123224+00:00 STG01-0101-0400-01T2-lc04 CRIT syncd#syncd: [none] SAI_API_SWITCH:_brcm_sai_switch_event_cb:950 0x5a1100 Received unhandled switch event - Device Interrupt(13) on unit 0: 0x29a 0x0 0x0
<14>2025-01-30T21:15:23.173596+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015
<14>2025-01-30T21:15:23.305868+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015
<14>2025-01-30T21:15:23.418543+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015
<14>2025-01-30T21:15:23.565039+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015
<10>2025-01-30T21:15:23.626048+00:00 STG01-0101-0400-01T2-lc03 CRIT syncd0#syncd: [06:00.0] SAI_API_SWITCH:_brcm_sai_switch_event_cb:950 0x5a1100 Received unhandled switch event - Device Interrupt(13) on unit 0: 0x8b8 0x0 0x0
<10>2025-01-30T21:15:24.126213+00:00 STG01-0101-0400-01T2-lc03 CRIT syncd1#syncd: [07:00.0] SAI_API_SWITCH:_brcm_sai_switch_event_cb:950 0x5a1100 Received unhandled switch event - Device Interrupt(13) on unit 0: 0x8b8 0x0 0x0
so far we noticed the link down alert happens when we upgrade the Arista T2 from SONIC 202205 image to 202405 image.
the alert arise because of 2 issues:
<30>2025-01-30T21:14:59.666022+00:00 STG01-0101-0400-01T2-sup00 INFO systemd[1]: Started swss@0.service - switch state service.
<13>2025-01-30T21:15:01.666281+00:00 STG01-0101-0400-01T2-sup00 NOTICE root: Started swss0 service...
<14>2025-01-30T21:15:22.349038+00:00 STG01-0101-0400-01T2-lc05 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015
<10>2025-01-30T21:15:22.349099+00:00 STG01-0101-0400-01T2-lc05 CRIT syncd#syncd: [none] SAI_API_SWITCH:_brcm_sai_switch_event_cb:950 0x5a1100 Received unhandled switch event - Device Interrupt(13) on unit 0: 0x29a 0x0 0x0
<14>2025-01-30T21:15:22.749727+00:00 STG01-0101-0400-01T2-lc05 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015
<14>2025-01-30T21:15:22.840429+00:00 STG01-0101-0400-01T2-lc05 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015
<14>2025-01-30T21:15:23.122929+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015
<10>2025-01-30T21:15:23.123224+00:00 STG01-0101-0400-01T2-lc04 CRIT syncd#syncd: [none] SAI_API_SWITCH:_brcm_sai_switch_event_cb:950 0x5a1100 Received unhandled switch event - Device Interrupt(13) on unit 0: 0x29a 0x0 0x0
<14>2025-01-30T21:15:23.173596+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015
<14>2025-01-30T21:15:23.305868+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015
<14>2025-01-30T21:15:23.418543+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015
<14>2025-01-30T21:15:23.565039+00:00 STG01-0101-0400-01T2-lc04 INFO syncd#supervisord: syncd 0:dnxc_interrupt_print_info: name=RTP_LinkMaskChange, id=666, index=0, block=0, unit=0, recurring_action=0 | Check RMGR / RTPWP settings in both device and device partner. If configuration is OK - look for physical link error indication and retrain link if needed. | RTP Link Mask Change#015
<10>2025-01-30T21:15:23.626048+00:00 STG01-0101-0400-01T2-lc03 CRIT syncd0#syncd: [06:00.0] SAI_API_SWITCH:_brcm_sai_switch_event_cb:950 0x5a1100 Received unhandled switch event - Device Interrupt(13) on unit 0: 0x8b8 0x0 0x0
<10>2025-01-30T21:15:24.126213+00:00 STG01-0101-0400-01T2-lc03 CRIT syncd1#syncd: [07:00.0] SAI_API_SWITCH:_brcm_sai_switch_event_cb:950 0x5a1100 Received unhandled switch event - Device Interrupt(13) on unit 0: 0x8b8 0x0 0x0