Skip to content

Warm/Fast reboot stability degradation #15076

@nazariig

Description

@nazariig

Description

The issue is caused by recent changes aimed to improve fast boot performance: the dependency on interfaces-config has been removed from swss/syncd.
This end up in a race between networking service restarting and accessing redis db from within swss/syncd.

sonic-buildimage/files/scripts/syncd_common.sh

function check_warm_boot()
{
    SYSTEM_WARM_START=`$SONIC_DB_CLI STATE_DB hget "WARM_RESTART_ENABLE_TABLE|system" enable`
    SERVICE_WARM_START=`$SONIC_DB_CLI STATE_DB hget "WARM_RESTART_ENABLE_TABLE|${SERVICE}" enable`
    # SYSTEM_WARM_START could be empty, always make WARM_BOOT meaningful.
    if [[ x"$SYSTEM_WARM_START" == x"true" ]] || [[ x"$SERVICE_WARM_START" == x"true" ]]; then
        WARM_BOOT="true"
    else
        WARM_BOOT="false"
    fi
}

#
# system boot: reboot type - fastfast
#

May  6 11:59:25.207554 r-leopard-01 INFO kernel: [    0.000000] Command line: BOOT_IMAGE=/image-202211_RC12.1-4ee027200_Internal/boot/vmlinuz-5.10.0-18-2-amd64 root=UU
ID=f25e4512-e4b1-4534-90fb-fc3ac1dba0e7 rw console=tty0 console=ttyS0,9600n8 quiet intel_idle.max_cstate=0 net.ifnames=0 biosdevname=0 loop=image-202211_RC12.1-4ee0272
00_Internal/fs.squashfs loopfstype=squashfs systemd.unified_cgroup_hierarchy=0 apparmor=1 security=apparmor varlog_size=4096 usbcore.autosuspend=-1 SONIC_BOOT_TYPE=fas
tfast

#
# swss/syncd service start: failed to get warm-boot flag - Unable to connect to redis: Cannot assign requested address
#

May  6 11:59:36.699943 r-leopard-01 INFO interfaces-config.sh[5085]: net.ipv6.conf.eth0.accept_ra_defrtr = 0
May  6 11:59:36.700028 r-leopard-01 INFO interfaces-config.sh[5085]: net.ipv6.conf.eth0.accept_ra = 0
May  6 11:59:36.710982 r-leopard-01 INFO systemd[1]: Stopping Network initialization...
May  6 11:59:36.734004 r-leopard-01 INFO networking[5090]: networking: Deconfiguring network interfaces
May  6 11:59:36.828247 r-leopard-01 INFO caclmgrd[4657]: Issuing the following iptables commands:
May  6 11:59:36.849028 r-leopard-01 INFO pcie-check.sh[5099]: 1
May  6 11:59:36.854216 r-leopard-01 NOTICE root: Started swss service...
May  6 11:59:36.858349 r-leopard-01 NOTICE root: /usr/bin/pcie-check.sh : PCIe check passed
May  6 11:59:36.858756 r-leopard-01 INFO systemd[1]: pcie-check.service: Succeeded.
May  6 11:59:36.861054 r-leopard-01 NOTICE root: Unlocking /tmp/swss-syncd-lock (10) from swss service
May  6 11:59:36.868151 r-leopard-01 INFO systemd[1]: Started switch state service.
May  6 11:59:36.871365 r-leopard-01 INFO systemd[1]: Starting Enable backend acl on storage backend ToRs...
May  6 11:59:36.875347 r-leopard-01 INFO systemd[1]: Starting syncd service...
May  6 11:59:36.880671 r-leopard-01 INFO systemd[1]: Starting TEAMD container...
May  6 11:59:36.880853 r-leopard-01 INFO systemd[1]: Started watchdog control service.
May  6 11:59:36.887150 r-leopard-01 NOTICE root: Starting syncd service...
May  6 11:59:36.890145 r-leopard-01 NOTICE root: /usr/local/bin/watchdog-control.sh : Disabling Watchdog during bootup after fastfast
May  6 11:59:36.891808 r-leopard-01 NOTICE admin: Starting teamd service...
May  6 11:59:36.898266 r-leopard-01 NOTICE root: Locking /tmp/swss-syncd-lock from syncd service
May  6 11:59:36.910646 r-leopard-01 NOTICE root: Locked /tmp/swss-syncd-lock (10) from syncd service
May  6 11:59:36.952308 r-leopard-01 INFO backend_acl.py: executing cmd =  ['/usr/local/bin/sonic-cfggen', '-m', '-v', 'DEVICE_METADATA.localhost.type']
May  6 11:59:36.957352 r-leopard-01 NOTICE admin: Warm boot flag: teamd false.
May  6 11:59:36.965040 r-leopard-01 NOTICE admin: Fast boot flag: teamd .
May  6 11:59:36.966048 r-leopard-01 INFO swss.sh[5238]: Invalid database name input : 'STATE_DB'
May  6 11:59:36.966142 r-leopard-01 INFO swss.sh[5238]: Unable to connect to redis: Cannot assign requested address
May  6 11:59:36.972625 r-leopard-01 INFO syncd.sh[5241]: Invalid database name input : 'STATE_DB'
May  6 11:59:36.972722 r-leopard-01 INFO syncd.sh[5241]: Unable to connect to redis: Cannot assign requested address
May  6 11:59:36.979634 r-leopard-01 INFO swss.sh[5244]: Invalid database name input : 'STATE_DB'
May  6 11:59:36.979993 r-leopard-01 INFO swss.sh[5244]: Unable to connect to redis: Cannot assign requested address
May  6 11:59:36.985177 r-leopard-01 INFO syncd.sh[5248]: Invalid database name input : 'STATE_DB'
May  6 11:59:36.985266 r-leopard-01 INFO syncd.sh[5248]: Unable to connect to redis: Cannot assign requested address
May  6 11:59:36.989408 r-leopard-01 NOTICE root: Warm boot flag: syncd false.
May  6 11:59:37.010108 r-leopard-01 NOTICE root: Starting Firmware update procedure

#
# syncd is started in cold mode
#

May  6 12:00:25.801003 r-leopard-01 NOTICE syncd#syncd: :- Syncd: command line:  EnableDiagShell=NO EnableTempView=YES DisableExitSleep=NO EnableUnittests=NO EnableConsistencyCheck=NO EnableSyncMode=YES RedisCommunicationMode=redis_async EnableSaiBulkSuport=YES StartType=cold ProfileMapFile=/tmp/sai.profile GlobalContext=0 ContextConfig= BreakConfig=
May  6 12:00:25.801047 r-leopard-01 NOTICE syncd#syncd: :- loadFromFile: no context config specified, will load default context config
May  6 12:00:25.801047 r-leopard-01 NOTICE syncd#syncd: :- insert: added switch: idx 0, hwinfo ''
May  6 12:00:25.801047 r-leopard-01 WARNING syncd#syncd: :- Syncd: enable sync mode is deprecated, please use communication mode, FORCING redis sync mode
May  6 12:00:25.802340 r-leopard-01 NOTICE syncd#syncd: :- RedisSelectableChannel: opened redis channel
May  6 12:00:25.804310 r-leopard-01 NOTICE syncd#syncd: :- isVeryFirstRun: First Run: True
May  6 12:00:25.804310 r-leopard-01 WARNING syncd#syncd: :- performStartupLogic: override command line startType=cold via SAI_START_TYPE_WARM_BOOT
May  6 12:00:25.804334 r-leopard-01 NOTICE syncd#syncd: :- profileGetValue: SAI_WARM_BOOT_READ_FILE: /var/warmboot/sai-warmboot.bin
May  6 12:00:25.804352 r-leopard-01 NOTICE syncd#syncd: :- performStartupLogic: using warmBootReadFile: '/var/warmboot/sai-warmboot.bin'
May  6 12:00:25.804376 r-leopard-01 WARNING syncd#syncd: :- performStartupLogic: user requested warmStart but warmBootReadFile is not specified or not accessible, forcing cold start
May  6 12:00:25.804400 r-leopard-01 NOTICE syncd#syncd: :- initialize: SAI API version: 11100

Since, this is a timing issue - it is very hard to reproduce.

The possible solutions are:

  1. Add a dependency on interfaces-config for all sonic services which will guarantee redis availability over TCP/IP for all use cases
  2. Switch to UNIX socket IPC communication for all the processes

In case TCP/IP availability is a must, the entire SONiC architecture must be revisited in order to guarantee loopback (localhost) reachability for all the situations.

Steps to reproduce the issue:

  1. Run warm-reboot sad test in a loop

Describe the results you received:

syncd is started in a cold mode after 'warm-reboot'

Describe the results you expected:

syncd is started in a warm mode after 'warm-reboot'

Output of show version:

  • N/A

Output of show techsupport:

  • N/A

Additional information you deem important (e.g. issue happens only occasionally):

More on this:

  1. SONiC DB CLI doesn't work when UNIX socket option is provided: verified by stopping networking service
  2. Verified fast-reboot on MSN4700 with dependency on interfaces-config for swss/syncd: degradation is 2 sec

No dependency "After":

Summary:
2023-05-12 15:24:21 : --------------------------------------------------
2023-05-12 15:24:21 : Longest downtime period was 0:00:10.806707
2023-05-12 15:24:21 : Reboot time was 0:00:38.100113
2023-05-12 15:24:21 : Expected downtime is less then 0:00:30
2023-05-12 15:24:21 : --------------------------------------------------
2023-05-12 15:24:21 : Additional info:
2023-05-12 15:24:21 : --------------------------------------------------
2023-05-12 15:24:21 : INFO:10.213.103.9:PortChannel interface state changed 1 times
2023-05-12 15:24:21 : INFO:10.213.103.9:BGP v6 routes must be down just for once
2023-05-12 15:24:21 : INFO:10.213.103.8:PortChannel interface state changed 1 times
2023-05-12 15:24:21 : INFO:10.213.103.8:BGP v6 routes must be down just for once
2023-05-12 15:24:21 : INFO:10.213.103.7:PortChannel interface state changed 1 times
2023-05-12 15:24:21 : INFO:10.213.103.6:PortChannel interface state changed 1 times
2023-05-12 15:24:21 : --------------------------------------------------
2023-05-12 15:24:21 : =========

With dependency "After":

Summary:
2023-05-12 15:55:51 : --------------------------------------------------
2023-05-12 15:55:51 : Longest downtime period was 0:00:10.492857
2023-05-12 15:55:51 : Reboot time was 0:00:40.035968
2023-05-12 15:55:51 : Expected downtime is less then 0:00:30
2023-05-12 15:55:51 : --------------------------------------------------
2023-05-12 15:55:51 : Additional info:
2023-05-12 15:55:51 : --------------------------------------------------
2023-05-12 15:55:51 : INFO:10.213.103.9:PortChannel interface state changed 1 times
2023-05-12 15:55:51 : INFO:10.213.103.9:BGP v6 routes must be down just for once
2023-05-12 15:55:51 : INFO:10.213.103.8:PortChannel interface state changed 1 times
2023-05-12 15:55:51 : INFO:10.213.103.8:BGP v6 routes must be down just for once
2023-05-12 15:55:51 : INFO:10.213.103.7:PortChannel interface state changed 1 times
2023-05-12 15:55:51 : INFO:10.213.103.7:BGP v6 routes must be down just for once
2023-05-12 15:55:51 : INFO:10.213.103.6:PortChannel interface state changed 1 times
2023-05-12 15:55:51 : INFO:10.213.103.6:BGP v6 routes must be down just for once
2023-05-12 15:55:51 : --------------------------------------------------
2023-05-12 15:55:51 : =========

The culprit:

  1. [swss/syncd] remove dependency on interfaces-config.service #13084

Attempt to fix the bug:

  1. Delay mux/sflow/snmp timer after interface-config service #14506

Related to:

  1. [boot performance] long swss/syncd/teamd start due to interfaces-config.service #12948
  2. Log messages missed at boot for containers not depending on interfaces-config.service #14444

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions