Skip to content

[action] [PR:271] Improve LC reboot cause for supervisor heartbeat loss#291

Merged
mssonicbld merged 1 commit intosonic-net:202505from
mssonicbld:cherry/202505/271
Jul 16, 2025
Merged

[action] [PR:271] Improve LC reboot cause for supervisor heartbeat loss#291
mssonicbld merged 1 commit intosonic-net:202505from
mssonicbld:cherry/202505/271

Conversation

@mssonicbld
Copy link
Copy Markdown

When the supervisor reboots ungracefully (e.g., kernel panic), LCs lose connection and are subsequently rebooted. This change adds 'heartbeat loss' as a software reboot cause for LCs.

It also modifies the reboot cause logic to prioritize this software heartbeat loss cause over any hardware triggers that occur during the supervisor-initiated LC reboot. This ensures accurate reporting of the LC's reboot reason.

With new changes, output is:

  1. When there's a graceful restart on Supervisor
admin@nfc405-3:~$ show reboot-cause
User issued 'Reboot from Supervisor' command [User: Supervisor, Time: Tue Jul 15 07:03:09 PM UTC 2025]
admin@nfc405-3:~$
admin@nfc405-3:~$ show reboot-cause history
Name                 Cause                                                                                                     Time                             User        Comment
-------------------  --------------------------------------------------------------------------------------------------------  -------------------------------  ----------  -----------------------------------------------------------------------------------------------------------------------------------
2025_07_15_19_07_16  Reboot from Supervisor                                                                                    Tue Jul 15 07:03:09 PM UTC 2025  Supervisor  N/A
  1. When there's ungraceful restart on Supervisor
admin@nfc405-3:~$ show reboot-cause
Heartbeat with the Supervisor card lost
admin@nfc405-3:~$
admin@nfc405-3:~$ show reboot-cause history
Name                 Cause                                                                                                     Time                             User        Comment
-------------------  --------------------------------------------------------------------------------------------------------  -------------------------------  ----------  -----------------------------------------------------------------------------------------------------------------------------------
2025_07_15_19_30_31  Heartbeat with the Supervisor card lost                                                                   N/A                              N/A         N/A

When the supervisor reboots ungracefully (e.g., kernel panic), LCs lose connection and are subsequently rebooted. This change adds 'heartbeat loss' as a software reboot cause for LCs.

It also modifies the reboot cause logic to prioritize this software heartbeat loss cause over any hardware triggers that occur during the supervisor-initiated LC reboot. This ensures accurate reporting of the LC's reboot reason.

With new changes, output is:

1. When there's a graceful restart on Supervisor

```
admin@nfc405-3:~$ show reboot-cause
User issued 'Reboot from Supervisor' command [User: Supervisor, Time: Tue Jul 15 07:03:09 PM UTC 2025]
admin@nfc405-3:~$
admin@nfc405-3:~$ show reboot-cause history
Name                 Cause                                                                                                     Time                             User        Comment
-------------------  --------------------------------------------------------------------------------------------------------  -------------------------------  ----------  -----------------------------------------------------------------------------------------------------------------------------------
2025_07_15_19_07_16  Reboot from Supervisor                                                                                    Tue Jul 15 07:03:09 PM UTC 2025  Supervisor  N/A
```

2. When there's ungraceful restart on Supervisor

```
admin@nfc405-3:~$ show reboot-cause
Heartbeat with the Supervisor card lost
admin@nfc405-3:~$
admin@nfc405-3:~$ show reboot-cause history
Name                 Cause                                                                                                     Time                             User        Comment
-------------------  --------------------------------------------------------------------------------------------------------  -------------------------------  ----------  -----------------------------------------------------------------------------------------------------------------------------------
2025_07_15_19_30_31  Heartbeat with the Supervisor card lost                                                                   N/A                              N/A         N/A
```
@mssonicbld
Copy link
Copy Markdown
Author

Original PR: #271

@mssonicbld
Copy link
Copy Markdown
Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld mssonicbld merged commit 5f79c8f into sonic-net:202505 Jul 16, 2025
5 checks passed
gpunathilell pushed a commit to gpunathilell/sonic-host-services that referenced this pull request Sep 24, 2025
```<br>* 5f79c8f - (HEAD -> 202506, origin/202505) Improve LC reboot cause for supervisor heartbeat loss (sonic-net#291) (2025-07-16) [mssonicbld]
* d86b612 - Fix ProcessStatsST column name issue and add test case to cover check (sonic-net#286) (2025-07-10) [mssonicbld]<br>```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant