Fast-reboot Flow Improvements HLD by shlomibitton · Pull Request #980 · sonic-net/SONiC

shlomibitton · 2022-04-13T14:28:26Z

Related PRs:

master

PR title	Dependencies	Unitest	Owner
[fastboot] fastboot enhancement: Use warm-boot infrastructure for fast-boot	None	Done, part of the PR	@arfeigin
[sonic-sairedis] [fastboot] fastboot enhancement: Use warm-boot infrastructure for fast-boot	None	Done, part of the PR	@arfeigin
[sonic-utilities] [fastboot] fastboot enhancement: Use warm-boot infrastructure for fast-boot	None	Done, part of the PR	@arfeigin
[sonic-mgmt] Added a test case for fast reboot from other vendor NOS to SONiC	None	Done	@arfeigin

202205

PR title	Dependencies	Unitest	Owner
[202205] Use warm-boot infrastructure for fast-boot	None	Done, part of the PR	@arfeigin
[sonic-sairedis] [202205]Use warm-boot infrastructure for fast-boot	None	Done, part of the PR	@arfeigin
[sonic-utilities][202205]Use warm-boot infrastructure for fast-boot	None	Done, part of the PR	@arfeigin

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>

paulmenzel

[I haven’t followed SONiC a lot (outside the Linux kernel changes) lately, so please excuse my ignorance and ignore my comments, if they do not apply.]

Thank you for drafting this. Unfortunately, it’s quite hard to review, as the sentences do not line-wrap, and the text still has some style and the wording often is bumpy.

A commit message should also be added with a very short summary of the goal. Maybe with an example of the analysis of the state of a current system, so it’s clear what needs to be improved.

paulmenzel · 2022-04-19T07:17:04Z

doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md

+
+# 1 Overview
+
+The goal of SONiC fast-reboot is to be able to restart and upgrade SONiC software with a data plane disruption less than 30 seconds and control plane less than 90 seconds. Today we don't have any indication of the fast-reboot status and some flows are delayed with a timer because of it, like enablement of flex counters. In order to have such indicator, re-use of the fastfast-reboot infrastructure can be used.


Where does 30 s and 90 s come from? Sounds not very fast at all? Probably should be explained further below also with data of the current timings.

Please link to the timer for the flex counters.

This is MSFT requirements, we can discuss it on the design review and clarify the motivation behind these numbers.

Added

paulmenzel · 2022-04-19T07:23:51Z

doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md

+
+The goal of SONiC fast-reboot is to be able to restart and upgrade SONiC software with a data plane disruption less than 30 seconds and control plane less than 90 seconds. Today we don't have any indication of the fast-reboot status and some flows are delayed with a timer because of it, like enablement of flex counters. In order to have such indicator, re-use of the fastfast-reboot infrastructure can be used.
+
+Each network application will experience similar processing flow.


I have a hard time to understand the relation in the paragraph. Can you rephrase?

Done. is it clear now?
If not can you specify what is not clear?

paulmenzel · 2022-04-19T07:24:19Z

doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md

+The goal of SONiC fast-reboot is to be able to restart and upgrade SONiC software with a data plane disruption less than 30 seconds and control plane less than 90 seconds. Today we don't have any indication of the fast-reboot status and some flows are delayed with a timer because of it, like enablement of flex counters. In order to have such indicator, re-use of the fastfast-reboot infrastructure can be used.
+
+Each network application will experience similar processing flow.
+Application and corresponding orchagent sub modules need to work together to restore the original data and push it to the ASIC.


Applications?

neighsyncd, fpmsyncd etc...

Minor - instead of original data, prefer preboot state

paulmenzel · 2022-04-19T07:24:56Z

doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md

+
+Each network application will experience similar processing flow.
+Application and corresponding orchagent sub modules need to work together to restore the original data and push it to the ASIC.
+Take neighbor as an example, upon restart operation every neighbor we had prior the reboot should be created again after resetting the ASIC.


paulmenzel · 2022-04-19T07:25:52Z

doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md

+Application and corresponding orchagent sub modules need to work together to restore the original data and push it to the ASIC.
+Take neighbor as an example, upon restart operation every neighbor we had prior the reboot should be created again after resetting the ASIC.
+We should also synchronize the actual neighbor state after recovering it, the MAC of the neighbor could have changed, went down for some reason etc.
+In this case, restore_neighbors.py script will align the network state with the switch state by sending ARP/NDP to all known neighbors prior the reboot.


Please mark up and link to restore_neighbors.py.

paulmenzel · 2022-04-19T07:30:39Z

doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md

+- Recover all applications state with the new image to the previous state prior the reboot.
+- Recover ASIC state after reset to the previous state prior the reboot.
+- Recover the Kernel state after reset to the previous state prior the reboot.
+- Synd the Kernel and ASIC with changes on the network which happen during fast-reboot.


paulmenzel · 2022-04-19T07:31:42Z

doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md

+
+### SWSS docker
+
+When swss docker start with the new kernel, all the port/LAG, vlan, interface, arp and route data should be restored from CONFIG DB, APP DB, Linux Kernel and other reliable sources. There could be ARP, FDB changes during the restart window, proper sync processing should be performed.


… SWSS Docker starts …

paulmenzel · 2022-04-19T07:32:07Z

doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md

+
+### Syncd docker
+
+The restart of syncd docker should leave data plane intact until it starts again with the new kernel. After restart, syncd configure the HW with the state prior the reboot by all network applications.


paulmenzel · 2022-04-19T07:32:42Z

doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md

+
+## 4.1 Orchagent Point Of View
+
+When orchagent start with the new SONiC image, the same infrastructure we use to reconsile fastfast-boot will start.


starts

reconcile

paulmenzel · 2022-04-19T07:35:01Z

doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md

+
+### NOTICE
+
+'warmRestoreValidation' might fail the operation just like in fastfast-reboot case, if the way orchagent process an event from the DB is handled differently with the new software version the task will fail to execute and fast-reboot will fail along with it.


fastfast is another reboot type we support.

shlomibitton · 2022-04-19T09:58:55Z

[I haven’t followed SONiC a lot (outside the Linux kernel changes) lately, so please excuse my ignorance and ignore my comments, if they do not apply.]

Thank you for drafting this. Unfortunately, it’s quite hard to review, as the sentences do not line-wrap, and the text still has some style and the wording often is bumpy.

A commit message should also be added with a very short summary of the goal. Maybe with an example of the analysis of the state of a current system, so it’s clear what needs to be improved.

Hi @paulmenzel,

You can read the file as .md format to make it more clear, you can use this link:
https://github.com/Azure/SONiC/blob/23ff3186a05e8de7c3d725a69e9518801ae0b4db/doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md
Currently when we upload a HLD, we track open PR's related to the feature on the description part, check this as an example:
DHCP relay for IPv6 HLD #765
The main idea here is to use the infrastructure we have for other advanced reboot scenarios, so we can maintain it once.
In addition, to have a reconciliation flag we can use to trigger other parts of SONiC, this is currently not available.

paulmenzel · 2022-04-19T10:05:58Z

[…]

1. You can read the file as .md format to make it more clear, you can use this link:
   https://github.com/Azure/SONiC/blob/23ff3186a05e8de7c3d725a69e9518801ae0b4db/doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md

Thank you. I know, but then I cannot comment directly. :(

2. Currently when we upload a HLD, we track open PR's related to the feature on the description part, check this as an example:
   [DHCP relay for IPv6 HLD #765](https://github.com/Azure/SONiC/pull/765)

I did not know. Thank you. (Maybe the commit message could have that added though.)

3. The main idea here is to use the infrastructure we have for other advanced reboot scenarios, so we can maintain it once.
   In addition, to have a reconciliation flag we can use to trigger other parts of SONiC, this is currently not available.

Can you elaborate on the “advanced reboot scenarios”? My gut says, switches shouldn’t do a lot of “advanced” stuff, as this complicates things, is more error-prone, and often slower.

shlomibitton · 2022-04-19T10:31:24Z

Can you elaborate on the “advanced reboot scenarios”? My gut says, switches shouldn’t do a lot of “advanced” stuff, as this complicates things, is more error-prone, and often slower.

Currently we support 3 types of advance reboot:

Fast-reboot - The type we are discussing on this PR.
Warm-reboot - Reboot the switch with no data plane disruption but 90 sec of control plane disruption.
Fastfast-reboot - Another flavor of warm-reboot which currently implemented only by Nvidia.
This is implemented on lower layers (divide resources into two and operate on one bank (TCAM) each time every time a warm-boot is executed.

liat-grozovik · 2022-04-25T09:05:22Z

@vaibhavhd could you please review?

vaibhavhd

Minor comments.

vaibhavhd · 2022-04-28T02:59:42Z

doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md

+The goal of SONiC fast-reboot is to be able to restart and upgrade SONiC software with a data plane disruption less than 30 seconds and control plane less than 90 seconds.
+With current implementation there is no indication of the fast-reboot status, meaning we don't have a way to determine if the flow has finished or not.
+Some feature flows in SONiC are delayed with a timer to keep the CPU dedicated to the fast-reboot init flow for best perforamnce, like enablement of flex counters.
+In order to have such indicator, re-use of the fastfast-reboot infrastructure can be used.


Minor: can we just call it warm-reboot instead of fastfast-reboot. Warm-reboot is generic.

warm-reboot is indeed generic, but the approach is different.
warm-reboot is divided into 2 implementations, the normal warm-reboot and fastfast-reboot.
Although the CLI command is the same "warm-reboot" under the hood the flow is different.
Call it warm-reboot might be confusing, so this is why I mentioned it as fastfast-reboot - to be more accurate.
Do you agree?

vaibhavhd · 2022-04-28T03:00:58Z

doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md

+The goal of SONiC fast-reboot is to be able to restart and upgrade SONiC software with a data plane disruption less than 30 seconds and control plane less than 90 seconds. Today we don't have any indication of the fast-reboot status and some flows are delayed with a timer because of it, like enablement of flex counters. In order to have such indicator, re-use of the fastfast-reboot infrastructure can be used.
+
+Each network application will experience similar processing flow.
+Application and corresponding orchagent sub modules need to work together to restore the original data and push it to the ASIC.


Minor - instead of original data, prefer preboot state

vaibhavhd · 2022-04-28T03:13:56Z

doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md

+In addition to the recover mechanism, the warmboot-finalizer can be enhanced to finalize fast-reboot as well and introduce a new flag indicating the process is done.
+This new flag can be used later on for any functionality which we want to start only after init flow finished in case of fast-reboot.
+This is to prevent interference in the fast-reboot reconciliation process and impair the performance, for example enablement of flex counters.


Minor: Link known issues and past patch-fixes that pressed for a need of this better design of fast-reboot indicator.

sonic-net/sonic-buildimage#7140

sonic-net/sonic-buildimage#7987
sonic-net/sonic-buildimage#8157
sonic-net/sonic-buildimage#7965
sonic-net/sonic-buildimage#8117
sonic-net/sonic-swss#1803
sonic-net/sonic-swss#1804

vaibhavhd · 2022-04-28T03:19:16Z

doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md

+- Recover the Kernel state after reset to the previous state prior the reboot.
+- Sync the Kernel and ASIC with changes on the network which happen during fast-reboot.
+- Control plane downtime will not exceed 90 seconds.
+- Data plane downtime will not exceed 30 seconds.


All these requirements are already mostly being met.

Do you want to instead only emphasize the new-requirements that are expected to be met by new design? Like "Do not exceed CPU/mem consumption during bootup path until fast-reboot flow is finished.

I thought it will be better to gather all requirements, even the ones which are already satisfied with previous implementation since we are changing it to a new one.
I think it is important to mention all when the implementation is different.
What do you think?

vaibhavhd · 2022-04-28T06:42:33Z

doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md

+### NOTICE
+
+'warmRestoreValidation' might fail the operation just like in fastfast-reboot case, if the way orchagent process an event from the DB is handled differently with the new software version the task will fail to execute and fast-reboot will fail along with it.
+This is solvable by the db migrator.


This isn't true, DB_MIGRATOR cannot solve all the issues. We still continue to have failures when unexpected APPLY_VIEW operations are performed, or when SAI objects (or attr) change between old and new software versions. Or OID mismatches.

Will this design create potential for new issues in fast-reboot?

First this is based on fastfast-reboot infra and not warm-reboot infra, so temp view is not used at all and INIT/APPLY view does nothing basically.
For DB migration and OID mismatches, this could be resolved by a modification to the way we backup the DB.
This should also address the problem when fast rebooting from a EOS to SONiC, as we discussed on mail.
From the mail thread:

_If I understand correctly, we basically want to keep the json dump files we use on the current design instead of a Redis rdb backup file and it will be backward compatible along with this scenario as well.
So I am thinking maybe instead of taking a backup copy of Redis DB, keep the current approach we push these dumps to the DB with swssconfig tool by swssconfig.sh.
We can still use the fastfast-reboot reconciliation logic on each network application and orchagent since they will start after, and all data will be present in the DB at this point.
What do you think?

vaibhavhd · 2022-04-28T06:56:01Z

doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md

+
+## 4.2 Syncd Point Of View - INIT/APPLY view framework
+
+Syncd starts with the fast-reboot flag, trigger the ASIC reset when create_switch is requested from orchagent.


Is there anything changing in the new design wrt to syncd and orchagent flow?

The only change here is to disable temp view when running syncd since it is not needed.
The ASIC DB is empty at this point and we have only one view since we reset the ASIC.

vaibhavhd · 2022-04-28T07:04:44Z

doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md

+## 4.5 Reboot finalizer
+
+Today we have a tool used for warm-reboot to collect all reconsiliation flags from the different network applications.
+This tool can be enhanced to consider fast-reboot as well and introduce a new flag indicating the end of the process.


Do you need new flag? We already have a flag to detect fast-reboot. Example:
https://github.com/Azure/sonic-buildimage/blob/bc305283417a1c305bce4035547191e1d480c1d5/files/scripts/swss.sh#L65

This flag is a dummy flag and actually I am not sure it is even used somewhere.
It is a flag with expired timer of 180 seconds, the one referred on this HLD is in sync with all network applications going through a reconciliation process.
I am referring to the WARM_REBOOT_TABLE on State DB, what we can see by running : "show warm_restart state"

Address the scenario of upgrading to SONiC from a vendor NOS.

liat-grozovik · 2022-05-21T07:53:11Z

@vaibhavhd could you please help to review the recent changes?
the idea is to have it approved and then work on providing the PR for it and have it as part of 202205.

vaibhavhd · 2022-05-23T15:50:43Z

doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md

+   - On this scenario all should work exacly the same as the switch rebooted from SONiC to SONiC.
+
+ - Dump files of default gateway, neighbors and fdb tables are not provided to the new image as SONiC does prior the reboot.
+   - On this scenario fast-reboot will finish successfully, but with low performance since all neighbors and fdb entries will be created by the slow path.


The claim here is that without dump files the performance will be degraded. What metrics are you referring to?

Additionally, I see that without dump files, the downtime is >30s, and the test failed. I believe that is unexpected.

2022-05-18 15:56:33 : FAILED:dut:Total downtime period must be less then 0:00:30 seconds. It was 48.5044789314

What platform was tested here? Even for good-case scenario the downtime is 28s, which is concerningly close to 30s SLA.

@vaibhavhd what do you mean by metrics?
This is the current results we have with the current implementation, these test results added to the HLD as you requested but actually not related to this new feature.
The new design will support the flow you are requesting but as for performance for the new flow, if this is not suffice it should be handled as a different feature, We can discuss it and add it as FR.

If you want we can have a discussion today to discuss it.

By metrics I was meaning to ask what you meant by low performance in this line On this scenario fast-reboot will finish successfully, but with low performance

Regarding the failure case (48s downtime), I did not notice that this was from current implementation. I assumed you tested it based on your local changes for new design.

vaibhavhd · 2022-05-23T15:52:45Z

doc/fast-reboot/Fast-reboot_Flow_Improvements_HLD.md


+If dump files provided by the previous image prior the reboot, all tables should be pushed to APP DB for reconciling orchagent.
+If no dumps are provided, orchagent will reconcile with no information from prior the reboot.
+On this case all ARP and FDB entries will be created by the slow path.


Please also describe what is meant by slow-path here for better clarity.

zhangyanzhao · 2022-06-01T22:50:01Z

@shlomibitton can you please add the code PRs by following EVPN VxLAN update for platforms using P2MP tunnel based L2 forwarding by dgsudharsan · Pull Request #806 · Azure/SONiC (github.com) as an example

zhangyanzhao · 2022-08-08T15:12:16Z

code PRs need be updated. @liat-grozovik

Fast-reboot Flow Improvements HLD

23ff318

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>

yxieca force-pushed the master branch 2 times, most recently from 8498931 to 8837dc2 Compare April 15, 2022 16:51

liat-grozovik mentioned this pull request Apr 18, 2022

Add counters enabling redesign document #918

Closed

paulmenzel suggested changes Apr 19, 2022

View reviewed changes

Fix review comments

d2629eb

liat-grozovik requested a review from vaibhavhd April 25, 2022 09:05

vaibhavhd reviewed Apr 28, 2022

View reviewed changes

Shlomi Bitton added 2 commits May 1, 2022 15:33

Fix more review comments

316979e

Fix HLD to use dump files as we use today instead of a redis rdb backup.

d63b27f

Address the scenario of upgrading to SONiC from a vendor NOS.

vaibhavhd reviewed May 23, 2022

View reviewed changes

Clarify what slow path mean

436cb6a

vaibhavhd approved these changes May 24, 2022

View reviewed changes

liat-grozovik merged commit a8596f6 into sonic-net:master May 25, 2022


		# 1 Overview

		The goal of SONiC fast-reboot is to be able to restart and upgrade SONiC software with a data plane disruption less than 30 seconds and control plane less than 90 seconds. Today we don't have any indication of the fast-reboot status and some flows are delayed with a timer because of it, like enablement of flex counters. In order to have such indicator, re-use of the fastfast-reboot infrastructure can be used.


		The goal of SONiC fast-reboot is to be able to restart and upgrade SONiC software with a data plane disruption less than 30 seconds and control plane less than 90 seconds. Today we don't have any indication of the fast-reboot status and some flows are delayed with a timer because of it, like enablement of flex counters. In order to have such indicator, re-use of the fastfast-reboot infrastructure can be used.

		Each network application will experience similar processing flow.


		### SWSS docker

		When swss docker start with the new kernel, all the port/LAG, vlan, interface, arp and route data should be restored from CONFIG DB, APP DB, Linux Kernel and other reliable sources. There could be ARP, FDB changes during the restart window, proper sync processing should be performed.


		### Syncd docker

		The restart of syncd docker should leave data plane intact until it starts again with the new kernel. After restart, syncd configure the HW with the state prior the reboot by all network applications.


		## 4.1 Orchagent Point Of View

		When orchagent start with the new SONiC image, the same infrastructure we use to reconsile fastfast-boot will start.


		### NOTICE

		'warmRestoreValidation' might fail the operation just like in fastfast-reboot case, if the way orchagent process an event from the DB is handled differently with the new software version the task will fail to execute and fast-reboot will fail along with it.


		## 4.2 Syncd Point Of View - INIT/APPLY view framework

		Syncd starts with the fast-reboot flag, trigger the ASIC reset when create_switch is requested from orchagent.

Conversation

shlomibitton commented Apr 13, 2022 • edited by liat-grozovik Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paulmenzel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shlomibitton commented Apr 19, 2022

Uh oh!

paulmenzel commented Apr 19, 2022

Uh oh!

shlomibitton commented Apr 19, 2022

Uh oh!

liat-grozovik commented Apr 25, 2022

Uh oh!

vaibhavhd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shlomibitton commented Apr 13, 2022 •

edited by liat-grozovik

Loading

shlomibitton Apr 28, 2022 •

edited

Loading

shlomibitton May 24, 2022 •

edited

Loading