Skip to content

[SmartSwitch] Add graceful shutdown and startup handling in platform daemons#703

Merged
yxieca merged 14 commits intosonic-net:masterfrom
vvolam:graceful-shutdown
Nov 21, 2025
Merged

[SmartSwitch] Add graceful shutdown and startup handling in platform daemons#703
yxieca merged 14 commits intosonic-net:masterfrom
vvolam:graceful-shutdown

Conversation

@vvolam
Copy link
Copy Markdown
Contributor

@vvolam vvolam commented Nov 1, 2025

Description

HLD: https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/graceful-shutdown/graceful-shutdown.md
These changes build upon enhancements in sonic-platform-daemons#667

This PR introduces graceful shutdown and startup orchestration across SONiC platform daemons to ensure safe DPU and peripheral module transitions during reboot or administrative state changes.

Key updates include:

  • Integration of ModuleBase lifecycle methods (module_pre_shutdown, module_post_startup, and set_admin_state_gracefully) into platform daemons.
  • Move graceful handling of PCIe detach/reattach and sensor reload sequences into set_admin_state_gracefully.
  • State tracking in CHASSIS_MODULE_TABLE via STATE_DB to synchronize transition state across processes.
  • File-based operation locks to prevent concurrent access to shared hardware resources.

Motivation and Context

Platform daemons currently perform shutdown and startup independently, leading to:

  • Race conditions during DPU detachment.
  • Inconsistent Redis state across PMON daemons.
  • Uncoordinated sensor and PCIe transitions during reboot.

This change introduces a unified graceful shutdown framework for SmartSwitch modules.
It ensures predictable module transitions, preserves hardware health, and supports orchestrated restarts without transient hardware errors.

How Has This Been Tested?

Testing performed on both DPU-enabled (SmartSwitch).

Functional validation

  • Verified end-to-end reboot flow with DPU detach/reattach sequence.
  • PCIe state (detaching/attaching) reflected in STATE_DB.
  • pcied daemon logs confirm ordered detach before reboot and reattach after startup.
  • Confirmed no stale Redis entries or orphaned locks post-reboot.

Unit tests executed

  • tests/test_DaemonPcied.py
  • tests/test_chassisd_graceful.py

Coverage includes:

  • Transition flag handling
  • Timeout behavior
  • DB write/read operations
  • Graceful admin state flow

Manual validation

Additional Information (Optional)

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors module admin state management by introducing a new set_admin_state_gracefully method that encapsulates the pre-shutdown and post-startup hooks alongside the admin state change. The refactor simplifies the code by removing the ModuleTransitionFlagHelper class and duplicate logic for managing module state transitions.

  • Replaces explicit module_pre_shutdown, set_admin_state, and module_post_startup calls with a single set_admin_state_gracefully method
  • Removes the ModuleTransitionFlagHelper class and all transition flag tracking logic
  • Updates tests to reflect the new simplified API

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
sonic-chassisd/scripts/chassisd Removes ModuleTransitionFlagHelper class, simplifies submit_callback and submit_dpu_callback methods to use set_admin_state_gracefully, removes duplicate initialization code
sonic-chassisd/tests/mock_platform.py Adds mock implementation of set_admin_state_gracefully method
sonic-chassisd/tests/test_chassisd.py Updates tests to mock set_admin_state_gracefully instead of individual pre/post hooks, adjusts assertions accordingly

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@vvolam vvolam changed the title Add graceful shutdown and startup handling in platform daemons [SmartSwitch] Add graceful shutdown and startup handling in platform daemons Nov 1, 2025
@vvolam
Copy link
Copy Markdown
Contributor Author

vvolam commented Nov 4, 2025

@rameshraghupathy @gpunathilell could you please review this latest PR

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

sonic-chassisd/tests/test_chassisd.py:1738

  • Lines 1652-1738 contain orphaned code that is not inside any function definition. This code appears to be leftover from an old test that was removed or refactored. Since this code is at module level, it will execute during test file import rather than as part of a test function, which could cause unintended side effects or test failures. This code block should be removed entirely.
    # Test the chassisd run
    chassis = MockSmartSwitchChassis()

    # DPU0 details
    index = 0
    name = "DPU0"
    desc = "DPU Module 0"
    slot = 0
    sup_slot = 0
    serial = "DPU0-0000"
    module_type = ModuleBase.MODULE_TYPE_DPU
    module = MockModule(index, name, desc, module_type, slot, serial)
    module.set_midplane_ip()

    # Set initial state for DPU0
    status = ModuleBase.MODULE_STATUS_PRESENT
    module.set_oper_status(status)
    chassis.module_list.append(module)

    # Supervisor ModuleUpdater
    module_updater = SmartSwitchModuleUpdater(SYSLOG_IDENTIFIER, chassis)
    module_updater.module_db_update()
    module_updater.modules_num_update()

    # ChassisdDaemon setup
    daemon_chassisd = ChassisdDaemon(SYSLOG_IDENTIFIER, chassis)
    daemon_chassisd.module_updater = module_updater
    daemon_chassisd.stop = MagicMock()
    daemon_chassisd.stop.wait.return_value = True
    daemon_chassisd.smartswitch = True

    # Import platform and use chassis as platform_chassis
    import sonic_platform.platform
    platform_chassis = chassis

    # Mock objects
    mock_chassis = MagicMock()
    mock_module_updater = MagicMock()

    # Mock the module (DPU0)
    mock_module = MagicMock()
    mock_module.get_name.return_value = "DPU0"

    # Mock chassis.get_module to return the mock_module for DPU0
    def mock_get_module(index):
        if index == 0:  # For DPU0
            return mock_module
        return None  # No other modules available in this test case

    # Apply the side effect for chassis.get_module
    mock_chassis.get_module.side_effect = mock_get_module

    # Mock state_db
    mock_state_db = MagicMock()
    # fvs_mock = [True, {CHASSIS_MIDPLANE_INFO_ACCESS_FIELD: 'True'}]
    # mock_state_db.get.return_value = fvs_mock

    # Mock db_connect
    mock_db_connect = MagicMock()
    mock_db_connect.return_value = mock_state_db

    # Mock admin_status
    # mock_module_updater.get_module_admin_status.return_value = 'up'

    # Set access of DPU0 up
    midplane_table = module_updater.midplane_table
    module.set_midplane_reachable(False)
    module_updater.check_midplane_reachability()
    fvs = midplane_table.get(name)
    assert fvs != None
    if isinstance(fvs, list):
        fvs = dict(fvs[-1])
    assert module.get_midplane_ip() == fvs[CHASSIS_MIDPLANE_INFO_IP_FIELD]
    assert str(module.is_midplane_reachable()) == fvs[CHASSIS_MIDPLANE_INFO_ACCESS_FIELD]

    # Patching platform's Chassis object to return the mocked module
    with patch.object(sonic_platform.platform.Chassis, 'is_smartswitch') as mock_is_smartswitch, \
         patch.object(sonic_platform.platform.Chassis, 'get_module', side_effect=mock_get_module):

        # Simulate that the system is a SmartSwitch
        mock_is_smartswitch.return_value = True

        # Patch num_modules for the updater
        with patch.object(daemon_chassisd.module_updater, 'num_modules', 1), \
             patch.object(daemon_chassisd.module_updater, 'get_module_admin_status', return_value='up'):
            # Now run the function that sets the initial admin state
            daemon_chassisd.set_initial_dpu_admin_state()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam
Copy link
Copy Markdown
Contributor Author

vvolam commented Nov 20, 2025

/azpw run

@mssonicbld
Copy link
Copy Markdown
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam requested a review from gpunathilell November 20, 2025 01:11
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@yxieca yxieca merged commit 10b787c into sonic-net:master Nov 21, 2025
5 checks passed
@vvolam vvolam deleted the graceful-shutdown branch November 21, 2025 19:00
@mssonicbld
Copy link
Copy Markdown
Collaborator

Cherry-pick PR to 202511: #726

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.