Skip to content

Improve LC reboot cause for supervisor heartbeat loss#271

Merged
judyjoseph merged 1 commit intosonic-net:masterfrom
ymd-arista:master
Jul 15, 2025
Merged

Improve LC reboot cause for supervisor heartbeat loss#271
judyjoseph merged 1 commit intosonic-net:masterfrom
ymd-arista:master

Conversation

@ymd-arista
Copy link
Copy Markdown
Contributor

@ymd-arista ymd-arista commented Jun 18, 2025

When the supervisor reboots ungracefully (e.g., kernel panic), LCs lose connection and are subsequently rebooted. This change adds 'heartbeat loss' as a software reboot cause for LCs.

It also modifies the reboot cause logic to prioritize this software heartbeat loss cause over any hardware triggers that occur during the supervisor-initiated LC reboot. This ensures accurate reporting of the LC's reboot reason.

With new changes, output is:

  1. When there's a graceful restart on Supervisor
admin@nfc405-3:~$ show reboot-cause
User issued 'Reboot from Supervisor' command [User: Supervisor, Time: Tue Jul 15 07:03:09 PM UTC 2025]
admin@nfc405-3:~$
admin@nfc405-3:~$ show reboot-cause history
Name                 Cause                                                                                                     Time                             User        Comment
-------------------  --------------------------------------------------------------------------------------------------------  -------------------------------  ----------  -----------------------------------------------------------------------------------------------------------------------------------
2025_07_15_19_07_16  Reboot from Supervisor                                                                                    Tue Jul 15 07:03:09 PM UTC 2025  Supervisor  N/A
  1. When there's ungraceful restart on Supervisor
admin@nfc405-3:~$ show reboot-cause
Heartbeat with the Supervisor card lost
admin@nfc405-3:~$
admin@nfc405-3:~$ show reboot-cause history
Name                 Cause                                                                                                     Time                             User        Comment
-------------------  --------------------------------------------------------------------------------------------------------  -------------------------------  ----------  -----------------------------------------------------------------------------------------------------------------------------------
2025_07_15_19_30_31  Heartbeat with the Supervisor card lost                                                                   N/A                              N/A         N/A

@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@ymd-arista ymd-arista changed the title Improve LC reboot cause for supervisor heartbeat Improve LC reboot cause for supervisor heartbeat loss Jun 18, 2025
@arlakshm arlakshm requested a review from mlok-nokia June 18, 2025 17:58
@arlakshm
Copy link
Copy Markdown
Contributor

Target release 202511

When the supervisor reboots ungracefully (e.g., kernel panic),
LCs lose connection and are subsequently rebooted. This change
adds 'heartbeat loss' as a software reboot cause for LCs.

It also modifies the reboot cause logic to prioritize this
software heartbeat loss cause over any hardware triggers
that occur during the supervisor-initiated LC reboot.
This ensures accurate reporting of the LC's reboot reason.

Signed-off-by: Mohan Yelugoti <ymd@arista.com>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@rlhui rlhui requested review from abdosi and arlakshm June 25, 2025 17:08
@rlhui
Copy link
Copy Markdown

rlhui commented Jul 2, 2025

@arlakshm reminder on this, thx

@judyjoseph
Copy link
Copy Markdown
Contributor

@ymd-arista can you add the test result of "show reboot-cause history" in LC -- to the desription of this PR

  1. When there is graceful restart on SUP.
  2. When there is ungracefull restart of SUP

@ymd-arista
Copy link
Copy Markdown
Contributor Author

@judyjoseph : Updated description to show output of both cases.

@mssonicbld
Copy link
Copy Markdown

Cherry-pick PR to 202505: #291

liamkearney-msft pushed a commit to liamkearney-msft/sonic-host-services that referenced this pull request Aug 6, 2025
When the supervisor reboots ungracefully (e.g., kernel panic),
LCs lose connection and are subsequently rebooted. This change
adds 'heartbeat loss' as a software reboot cause for LCs.

It also modifies the reboot cause logic to prioritize this
software heartbeat loss cause over any hardware triggers
that occur during the supervisor-initiated LC reboot.
This ensures accurate reporting of the LC's reboot reason.

Signed-off-by: Mohan Yelugoti <ymd@arista.com>
(Cherry-picked from master)
Signed-off-by: Liam Kearney <liamkearney@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

8 participants