Routing-stack warm-reboot feature. by rodnymolina · Pull Request #602 · sonic-net/sonic-swss

rodnymolina · 2018-08-29T17:25:07Z

[ dependencies with PR/600: https://github.com//pull/600 ]

Please refer to the corresponding design document for more details: sonic-net/SONiC#239

The following manual UT plan has been successfully executed. UT automation pending.

Physical topology formed by various BGP peers connecting to the DUT. Both non-ecmp and ecmp prefixes are learned/tested.

Testcase Suite 1: Making use of “pkill -9 bgpd && pkill -9 zebra” as trigger.
=============

IPv4 prefixes
==========

    * Restart zebra/bgpd:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart zebra/bgpd and add one new non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and add one new path to an IPv4 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one ecmp-path from an existing ecmp IPv4
      prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

IPv6 prefixes
==========

    * Restart zebra/bgpd:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart zebra/bgpd and add one new non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and add one new path to an IPv6 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one ecmp-path from an existing ecmp IPv6
      prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

Testcase Suite 2: Making use of sudo service bgp restart” as trigger.
=============

IPv4 prefixes
==========

    * Restart bgp service/docker:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart bgp service/docker and add one new non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and add one new path to an IPv4 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one ecmp-path from an existing
      ecmp IPv4 prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

IPv6 prefixes
==========

    * Restart bgp service/docker:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart bgp service/docker and add one new non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and add one new path to an IPv6 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one ecmp-path from an existing
      ecmp IPv6 prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

Signed-off-by: Rodny Molina rmolina@linkedin.com

nikos-github

Couple of initial comments. More to come.

nikos-github · 2018-08-29T20:25:29Z

fpmsyncd/fpmsyncd.cpp

+                    s.addSelectable(&warmStartTimer);
+                }
+            }
+


I believe there is work to be done if there is a timer running or a previous reconciliation effort is in progress so likely you need an else part. While the connection was severed (not due to fpmsyncd restart) the warmStartEnabled may have changed state and lots of things won't work after that.

@nikos-github thanks for your comments.

You have a point here. For the above to happen we need an unlikely sequence of events to take place:

two or more zebra crashes happening in a small window of time, and

a user deactivating routing-warm-reboot feature right between the two zebra crashes

Unlikely but possible. I'll adjust my code to deal with this scenario.

If warm restart is found to be disabled, we should still recover and cleanup. There should be no requirement to disrupt the lower sonic layers now that we are adding such infra/functionality.

Please amend the code to address initial comment.

@nikos-github i missed this comment. Yes, i'm about to send one (hopefully final) update that explicitly clears the state of the data-structures used within WarmStartHelper class. That should take care of your previous concerns for the bgp/zebra restarting case.

nikos-github · 2018-08-29T20:26:10Z

fpmsyncd/fpmsyncd.cpp

+            if (warmStartEnabled)
+            {
+                /* Obtain warm-restart timer defined for routing application */
+                uint32_t warmRestartIval = sync.m_warmStartHelper.getRestartTimer();


I believe some cleanup is required if there is a timer already running.

Please address the comment.

Yes, i'm taking care of it, see my other comment on this matter.

nikos-github · 2018-09-06T07:50:25Z

fpmsyncd/fpmsyncd.cpp

+                }
+
+                /* Execute recovery instruction and kick off warm-restart timer */
+                if (sync.m_warmStartHelper.runRecovery())


Before recovering the appdb state, you need to either clear or check that the producer queue is empty.

Yep, already took care of that, my next patch will have it.

jipanyang

Will do further review once the dependee PR is settled down and code updated.

jipanyang · 2018-09-06T18:34:23Z

fpmsyncd/fpmsyncd.cpp

+                else if (!warmStartEnabled || sync.m_warmStartHelper.isReconciled())
+                {
+                    pipeline.flush();
+                    SWSS_LOG_NOTICE("Pipeline flushed");


Will there be too many logs for pipeline flush in normal running?

@jipanyang I'm just adding one extra "Pipeline flushed" log, which will be executed only once upon restart-timer expiration. During normal operation we will continue generating this log the same way we are doing it today -- see that i'm not adding this.

The original log level is DEBUG, now it is changed to NOTICE.

Sure, i can set all "Pipeline flushed" logs to debug.

jipanyang · 2018-09-06T18:39:08Z

fpmsyncd/routesync.cpp

-        m_routeTable.del(destipprefix);
-        return;
+        if (warmState == WarmStart::INITIALIZED ||
+            warmState == WarmStart::RECONCILED)


"warmState == WarmStart::INITIALIZED", It is confusing, while warm state is INITIALIZED and route is injected to appDB directly. Do you mean warm start has been disabled?

My understanding, based on a recent discussion where we discussed/agreed on this issue, is that INITIALIZED state will be utilized for applications with warm-reboot feature disabled -- so these ones will permanently display INITIALIZED state. Based on that discussion i had to re-adjust my code and add this INITIALIZED checkpoint. My original code was setting all applications to RECONCILED state, regardless of WarmReboot being enable or not, as this is the last FSM stage. Looks like our problem here is simply a consequence of INITIALIZED state being slightly overloaded. The simplest solution is to provided another FSM state (DEACTIVATED). Will ping you offline to discuss FSM semantics.

Maybe I missed some of the discussions. My understanding for WarmStart state is that it is only meaningful when warm start has been enabled. If warm start is disabled, people won't care about WarmStart state.
Putting too much state change seems to make the problem unnecessary complicated.

As you know, applications like fpmSyncd need to be able to invoke WarmRestart logic on-demand, and not only when when it boots-up. So the initialization of WarmRestart logic is not necessarily connected to its activation: you could have WarmRestart being turned on at any given point in the future without having to restart fpmSyncd. To fix this problem there are two possible solutions: 1) we overload one of the current FSM states (INITIALIZED), or 2) we create a new UNINITIALIZED/DEACTIVATED FSM state.

For fpmsyncd case, whenever warm restore/reconcile is needed and warm start has been enabled, the INITIALIZED state could be set before starting restore to indicate that warm start is initiated, after restore, RESTORED; then RECONCILED.

I don't see the need to overload INITIALIZED or add new DEACTIVATED state.

Actually "INITIATED" looks like a better name than "INITIALIZED" if we are talking about warm restart state.

I suggest you use RECONCILED always. I agree with Jipan that INITIALIZED is confusing and at the same time UNINITIALIZED/DEACTIVATED will be equally confusing if warm restart is turned on after fpmsyncd is up because the state won't match with what's configured.

Nikos, you are agreeing with me then. This is precisely my original approach: all apps end up in RECONCILED state. As i said above, i had to re-adjust my code to address the 'requirement' of having apps (with warm-reboot OFF) using the INITIALIZED state.

Jipan's approach is not using any state to represent apps with warmReboot turned off, which is something that i'm struggling with. Btw, the approach that you're suggesting wouldn't add a new state, but RECONCILED state would be conveying two semantics: 1) application with warm-reboot ON going through all FSM states, and 2) application with warm-reboot OFF skipping all FSM states. Either way, we are slightly overloading the FSM, which i'm totally fine with. But there's no free lunch here: either we overload or we add a new FSM state.

I'm trying to point out a distinction between FSM state and whether the feature is enabled or not. Trying to infer what's the feature's state (on or off) from the FSM isn't the right thing in my opinion. If you want to convey both then you either need 2 variables, no overloading of FSM state and introduce another state in the FSM, OR you need 1 variable that has states plus feature enabled/disabled (via a bitmask), no overloading of FSM state or use reconciled. If you don't want any of the above, then from my point of view reconciled is the right state whether the feature is disabled or later enabled.

@nikos-github just to close the loop here. As discussed with the rest of the folks working in warm-reboot, we will rely on the assumption that no FSM state implicitly indicates the lack of warm-reboot functionality, or the fact that this one hasn't been turned on. The three FSM states (initialized, restored, reconciled) will be exclusively used for apps that are warm-reboot capable.

I have no problem with that. The state should be reconciled though when the app is warm reboot capable but warm reboot isn't enabled or is enabled later.

nikos-github · 2018-09-06T22:59:06Z

fpmsyncd/routesync.cpp

-        m_routeTable.del(destipprefix);
-        return;
+        if (warmState == WarmStart::INITIALIZED ||
+            warmState == WarmStart::RECONCILED)


I suggest you use RECONCILED always. I agree with Jipan that INITIALIZED is confusing and at the same time UNINITIALIZED/DEACTIVATED will be equally confusing if warm restart is turned on after fpmsyncd is up because the state won't match with what's configured.

nikos-github · 2018-09-07T21:07:30Z

warmrestart/warmRestartHelper.cpp

+ *   from AppDB.
+ *
+ */
+void WarmStartHelper::reconciliate(void)


reconcile not reconciliate

nikos-github · 2018-09-07T21:08:37Z

warmrestart/warmRestartHelper.cpp

+    }
+
+    /* Recovery map should be entirely empty by now */
+    assert(m_recoveryMap.size() == 0);


We should recover gracefully here by falling back to cold restart.

Not sure what do you mean by 'gracefully'. I understand that we should do something about it, but that's something the process-manager (supervisord) should take care of upon the app being cold-booted after the failed assertion.

How will supervisord achieve a cold boot since warm restart is enabled? In any case, asserting is shouldn't be here in my opinion - that's way too harsh. Flushing the db and re-establishing the connection will cause the routes to re-download. That's why I mean by gracefully.

lguohan · 2018-09-12T18:39:50Z

depend on #600

jipanyang

How are existing AppRestartAssist class and neighsyncd reconciliation to be handled?
There should not be two classes performing similar tasks.

jipanyang · 2018-09-13T01:44:08Z

fpmsyncd/fpmsyncd.cpp

+                else if (!warmStartEnabled || sync.m_warmStartHelper.isReconciled())
+                {
+                    pipeline.flush();
+                    SWSS_LOG_NOTICE("Pipeline flushed");


The original log level is DEBUG, now it is changed to NOTICE.

jipanyang · 2018-09-13T01:47:36Z

fpmsyncd/fpmsyncd.cpp

+                    pipeline.flush();
+                    SWSS_LOG_NOTICE("Pipeline flushed");
+                }
+                else if (!warmStartEnabled || sync.m_warmStartHelper.isReconciled())


Is there "else" to take care of remaining conditions?

if() { } else if { }

Which other conditions are you referring to? There are only three conditions that we care about here: 1) warmStartTimer expired, which triggers reconciliation, 2) warm-restart disabled -- to deal with legacy/current scenarios, and 3) scenarios with warm-restart reconciliation completed. In 2) and 3) the actions are exactly the same, that's why i'm bundling them together. What would be the use case for the 'else'?

I was not checking the internal logic. If the "else if" covers everything, why don't you just change it to else {}.

Based on your current implementation, it looks some default processing is missing.

if (condition 1) { } else if (condition2) { } else { // default processing; }

Got your point now. The "else if" is there to make it more explicit to the reader that those two conditions could take place. Performance-wise there's no doubt the "else" is slightly cheaper. Have no strong opinion here, i can make use of an "else" if everyone agrees.

jipanyang

Another general comment, VS test cases should be prepared for the changes.

rodnymolina · 2018-09-13T02:39:13Z

I'm working on VS test-cases right now.

Answering your other question: the goal is to replace the existing reconciliation logic with the one i'm presenting here (which offers more flexibility). That will require some refactoring work on neighsyncd which i'm planning to carry out once done with the routing portion. At the end there will be a single reconciliation class.

jipanyang · 2018-09-13T05:49:54Z

warmrestart/warmRestartHelper.cpp

+            splitValues.push_back("");
+        }
+
+        for (int j = 0; j < static_cast<int>(splitValues.size()); ++j)


could you get away from static_cast? Using something like size_type j = 0, or auto j = splitValues.size();

jipanyang · 2018-09-13T05:50:15Z

warmrestart/warmRestartHelper.cpp

+
+        for (int j = 0; j < static_cast<int>(splitValues.size()); ++j)
+        {
+            if (j < static_cast<int>(fvVector.size()))


same as previous comment.

jipanyang · 2018-09-13T06:25:22Z

warmrestart/warmRestartHelper.h

+         * The sizes of both tuple-vectors should always match within any given
+         * application, otherwise we are running into some form of bug.
+         */
+        assert(left.size() == right.size());


When comparing two vectors, how to ensure they are of same size?
Say, a route has two fields in usual case: nexthop and intf, what if for some routes, the third fields "blackhole" is set?

### ROUTE_TABLE ;Stores a list of routes ;Status: Mandatory key = ROUTE_TABLE:prefix nexthop = *prefix, ;IP addresses separated “,” (empty indicates no gateway) intf = ifindex? PORT_TABLE.key ; zero or more separated by “,” (zero indicates no interface) blackhole = BIT ; Set to 1 if this route is a blackhole (or null0)

Good point. For routing-app there's this "blackhole" corner-case which brakes the uniformity for routing schema. Basically we are allowing a route to be defined as: "prefix/len | blackhole = true". No other app/scenarios that i'm aware of allow this inconsistency. In my humble opinion, we should keep uniformity by allowing the common schema to be re-utilized as this: "prefix/len | nh: blackhole | if: null". This approach will also simplify the addition of new attributes in the future (AFI, SAFI, etc).

Also, if you look at orchagent you'll see that we are not passing down "blackhole" routes -- they are being eliminated right there in routeOrch. The only blackhole route being pushed down to ASIC_DB is the default route, and this one is not even exposed to AppDB. So in reality, no one is really consuming these black-hole schema.

In my opinion, I think this assert() should be left here as it prevents real problems down the road. And concerning the blackhole routes, i would like to make a very minor adjustment in its representation to avoid this uniformity i talked about.

jipanyang · 2018-09-13T06:29:33Z

fpmsyncd/fpmsyncd.cpp

+                    pipeline.flush();
+                    SWSS_LOG_NOTICE("Pipeline flushed");
+                }
+                else if (!warmStartEnabled || sync.m_warmStartHelper.isReconciled())


I was not checking the internal logic. If the "else if" covers everything, why don't you just change it to else {}.

Based on your current implementation, it looks some default processing is missing.

if (condition 1) { } else if (condition2) { } else { // default processing; }

jipanyang · 2018-09-13T06:50:08Z

warmrestart/warmRestartHelper.h

+            {
+                if (fvField(fvRight) == fvField(fvLeft))
+                {
+                    return fvLeft < fvRight;


I had a difficult time understanding the logic here.
What if there are multiple fields? It looks whenever the field is checked, it return immediately without checking other fields.

It doesn't really matter how many fields we have as long as the schema they utilize keeps its consistency. The purpose of this function is to enforce an order within the map, so technically there's no real need to compare all the elements within a tuple. For example, if you are comparing these two tuples corresponding to a particular prefix...

left -> nh: 10.1.1.1, if: eth0
right-> nh: 10.1.1.2, if: eth1

... you can exit() the 'comparator' method as long as you have found a common-field within each tuple that can be used for comparison. Please let me know if you can think of a use-case that would require the complete iteration of the tuple.

As i said before, the above can only work if the schema is kept uniform, which is the case for all apps so far (with the exception of the 'blackhole' route).

And yes, i'm planning to add some comments to document what this function does.

jipanyang · 2018-09-13T06:58:26Z

warmrestart/warmRestartHelper.cpp

+
+        std::string newVal = fvValue(fvVector[i]) + "," + fvValue(data[i]);
+
+        fvVector[i].second = newVal;


"fvVector[i].second" and "fvValue(fvVector[i])" should be equal. Use one format.

jipanyang · 2018-09-13T07:02:33Z

warmrestart/warmRestartHelper.cpp

+            else if (recElemState == CLEAN)
+            {
+                cleanRecElems++;
+                transformKFV(recElem, fvVector);


Is the order of fields in fvRestorationMap guaranteed?
I saw you put comment in fvComparator

/* * Notice that we are forced to iterate through all the tuples in the * right-vector given that the order of fields within a tuple is not * fully deterministic (i.e. 'left' could hold 'nh: 1.1.1.1 / if: eth0' * and 'right' could be 'if: eth0, nh: 1.1.1.1'). */

jipanyang · 2018-09-13T23:51:33Z

My general feeling about the implementation is that it might have been focusing too much on the specific syntax of multiple sub-value of kfv: 1.1.1.1/30, vector{nexthop: 10.1.1.1, 10.1.1.2, ifname: eth1, eth2}, it added a lot of logic processing complexity.

Some wild thoughts, could we simplify the logic like this:

Read old data from appDB table.
Start with accumulating KFV data from Applications
Keep them separate from the old data, do not touch the old data.
If same key arrives for the new data, just overwrite the previous one.
Timer expires.
4 Do the comparison in one shot.
yes, we should compare field sub values, but there is no need to mark state for each of them.

What we care about is the whole entry.

rodnymolina · 2018-09-14T03:12:00Z

@jipanyang thanks for your comments.

There are good reasons for this added complexity, which may also provide extra flexibility to other applications in the future:

We need to deal with multi-field values (e.g. 1.1.1.1/30 nh: 10.1.1.1, 10.2.2.2, if: eth1, eth2)
We need to deal with different orders in multi-field values (e.g. 1.1.1.1/30 nh: 10.2.2.2, 10.1.1.1, if: eth2, eth1)
And we need to deal with partial updates coming from the north-app (e.g. first we get "1.1.1.1/30 nh: 10.1.1.1, if: eth1", and then we get the complete prefix: "1.1.1.1/30 nh: 10.1.1.1, 10.2.2.2, if: eth1, eth2")

I'm not sure how the solution that you proposed would simplify my code while addressing the above requirements. Let's say you keep new/refreshed state in a separated location from the one collected from AppDB. How would you compare these two without doing any splitting:

AppDB state: 1.1.1.1/30 nh: 10.1.1.1, 10.2.2.2, if: eth1, eth2)
Zebra state: 1.1.1.1/30 nh: 10.2.2.2, 10.1.1.1, if: eth2, eth1)

I do understand though, that what we care about is the whole entry. But given that we have to split the tuples anyway, why assigning a STATE to them represents an issue? That gives us the possibility of making use of a single data-structure to run the entire reconciliation logic.

jipanyang · 2018-09-14T04:41:58Z

@rodnymolina What is really needed to compare
"
AppDB state: 1.1.1.1/30 nh: 10.1.1.1, 10.2.2.2, if: eth1, eth2)
Zebra state: 1.1.1.1/30 nh: 10.2.2.2, 10.1.1.1, if: eth2, eth1)
"
and Also handle the case of different number of fields.

That is different from spitting the fields, assigning state to each of them, and making a lot of hard to understand processing.

If it makes it clear, I could create a temporary pr to your branch which will be based on your current framework and major logic flow but with reduced complexity to show how it works.

nikos-github · 2018-09-14T05:54:15Z

@rodnymolina

One way to compare

AppDB state: 1.1.1.1/30 nh: 10.1.1.1, 10.2.2.2, if: eth1, eth2)
and
Zebra state: 1.1.1.1/30 nh: 10.2.2.2, 10.1.1.1, if: eth2, eth1)

is to create a nexthopgroup object and you have a pointer from each prefix to a nexthopgroup. For 2 identical prefixes, one recovered from the appdb and the other created after recovery, all you need to do is compare if they point to the same object.

nikos-github · 2018-09-14T21:36:51Z

@rodnymolina @jipanyang @lguohan

Design Comments

I always viewed sonic's *syncd parts (fpmsyncd, neighoborsyncd etc.) as thin stateless layers that provide intermediate yet very simple functionality and means of communication, between major sonic parts. Their functionality only goes as far as knowing the format of the data they are manipulating but not necessarily the meaning of it. This view is also reinforced if one looks at the current code and understands the functionality.

With this and other PRs I see that we are diverging from that paradigm. At the same time I haven't seen compelling technical justification to do so or some existing technical debt that needs to be addressed which pushes us in that direction.

We seem to be converting *syncd parts into a stateful layer with additional needless complexity, code duplication and exploding memory requirements just to satisfy the needs of warm restart.

A divergence from the current design will allow us to get the functionality we want without the drawbacks exhibited here. I will point out the issues I see in the current design and then I will outline solutions to counter them in order of preference.

Current Design

Keeping state per path is needless and results in convoluted comparison code. Routes always get implicitly updated rather than incremental per path updates. Eliminating this will probably result in a much simpler design with likely half the code changes.
Path sharing is not happening. This means every prefix will point to its own set of paths duplicating information on a per prefix basis. For a 4-way ecmp prefix, the memory increase will be roughly 8x. 12K prefixes in fpmsyncd will feel like 100K when doing warm restart in terms of memory consumption. This is too much to ignore.
We have now turned fpmsyncd into a stateful part of sonic. Adding state always increases interaction and complexity by definition.

Alternative Designs

Keep fpmsyncd stateless
This personally is my preferred approach and aligns best with the existing sonic design. During warm restart, fpmsyncd clears the appdb queue, starts a timer and informs orchagent (through appdb). Upon timer expiration it tells orchagent again. The orchagent will be the one responsible for marking prefixes as stale and cleaning up when the timer expires. Currently orchagent has code that suppresses routes where nothing has changed. So it has all the comparison logic in place to achieve that already. Quagga and FRR don't keep quiet upon restarts and churn the prefixes as they are learned. The appdb queue can suppress that if we want to or we can fix the issue in frr with minimum effort and quagga with moderate effort.

Pros:

fpmsyncd remains stateless and with no memory increase.
orchagent is the one doing the route reconciliation (which is where it should be IMO).
With new path attributes, only the format knowledge part of fpmsyncd changes so that it understands what it's reading and how to push that info to the db. So fpmsyncd's warm restart part doesn't need to be retested.

Cons:

Slight increase in complexity from the signalling of the start/stop timer through the appdb.

Stateful fpmsyncd and with path sharing.
Read routes from appdb and partially use the code from routeorch to reconstruct prefixes, ecmp groups etc. and share ecmp groups across prefixes. Read from fpm socket and construct a different set of prefixes but reuse the ecmp groups. On timer expiration, go through the prefixes and compare if they point to the same ecmp group and send only the delta down. Alternatively we can make this even simpler and just download the whole table paying attention only to the stale prefixes which need to be deleted. As I said above, routeorch already has code that suppresses prefixes where nothing has changed so sai/asic won't be notified needlessly.

Pros:

Can minimize to only the delta on what is sent to the db.
Partially share code with routeorch for reconstructing view.
Need to change 2 different components and retest both when new path attributes are introduced (orchagent and fpmsyncd) and retest warm restart.
Path sharing with this solution.

Cons:

fpmsyncd is now stateful and a bit more complex with increased memory needs during a restart.

Stateful fpmsyncd and no path sharing.
Similar to the above but without any path sharing.

Pros:

Can minimize to only the delta on what is sent to the db.

Cons:

fpmsyncd is now stateful and more complex.
Exploding memory consumption requirements.
No code sharing with orchagent for reconstructing view.
Need to change 2 different components and retest both when new path attributes are introduced (orchagent and fpmsyncd) and retest warm restart.

jipanyang · 2018-10-04T19:00:03Z

warmrestart/warmRestartHelper.h

+    bool compareAllFV(const std::vector<FieldValueTuple> &left,
+                      const std::vector<FieldValueTuple> &right);
+
+    bool compareOneFV(const std::string &v1, const std::string &v2);


Should they be private?

sure, i can make them private.

jipanyang · 2018-10-04T19:02:20Z

warmrestart/warmRestartHelper.cpp

+         * also present within v2. Otherwise we are running into some form of
+         * bug or an unsupported functionality.
+         */
+        assert(v1Iter != v1Map.end());


What if v1 is superset of v2?

If by 'superset' you're referring to v1 having more fields than v2, that shouldn't ever happen as you commented further below. I'll add more details to my code-comments about this expectation.

jipanyang · 2018-10-04T20:14:32Z

So the current reconciliation implementation has the assumption that application will always have fix number of FieldValueTuple. Ex. for route BGP always provide two fields "nexthop" and "ifname", for neighbor, they are "family" and "neigh".

If there is any violation of this assumption, the reconciliation won't work. It is also related to the ProducerStateTable issue: sonic-net/sonic-swss-common#236

I think it is ok for now. Further update could be done when more flexible cases emerge. Could you put a little more information about this in the code comment field?

tests/test_warm_reboot.py

jipanyang

Some minor comments. Overall the changes look good to me.

jipanyang · 2018-10-10T23:13:35Z

tests/test_warm_reboot.py

+        db,
+        swsscommon.CFG_WARM_RESTART_TABLE_NAME, app_name,
+        [
+            ("enable", "true"),


Could you use CLI to config it as other test cases?
config warm_restart enable/disable [APP]

Agree, that's the way to go.

jipanyang · 2018-10-10T23:16:11Z

tests/test_warm_reboot.py

+        db,
+        swsscommon.CFG_WARM_RESTART_TABLE_NAME, app_name,
+        [
+            (app_name + "_timer", value),


"config warm_restart bgp_timer" should be prepared as that for arp:

root@sonic:/home/admin# config warm_restart --help Usage: config warm_restart [OPTIONS] COMMAND [ARGS]... warm_restart-related configuration tasks Options: -s, --redis-unix-socket-path TEXT unix socket path for redis connection --help Show this message and exit. Commands: disable enable neighsyncd_timer root@sonic:/home/admin#

To avoid deferring this PR merge, i'll make this small adjustment right after i write this addition to sonic-utils. Then i'll come back to this UT code and will adjust accordingly.

jipanyang · 2018-10-10T23:27:01Z

warmrestart/warmRestartHelper.cpp

+                                 const std::string  &syncTableName,
+                                 const std::string  &dockerName,
+                                 const std::string  &appName) :
+    m_restorationTable(pipeline, syncTableName, false),


What is the reason for using external pipeline for m_restorationTable here? m_restorationTable is for get only. Asking the caller to provide pipeline brings no benefit.

Not sure i got this one. How can we create the table without its pipeline? You mean, using the other table constructor? -- Table(const DBConnector *db, const std::string &tableName). Even if that's what you imply, we would need to create a DBConnector if we don't want the caller to provide it, right?

Please clarify...

Sorry, my previous comment is confusing. WarmStartHelper() has both pipeline and syncTableName as parameters, the only purpose of them is to prepare a Table handle in WarmStartHelper class. Could a Table handle be prepared by the callers directly and remove pipeline/syncTableName, just like what has been done for "ProducerStateTable *syncTable"?

Never mind, I noticed that it is more convenient to have pipeline and table name as parameter with current routesync and neighsync.

jipanyang · 2018-10-10T23:37:00Z

warmrestart/warmRestartHelper.cpp

+ * To be called by each application to obtain the active/inactive state of
+ * warm-restart functionality, and proceed to initialize the FSM accordingly.
+ */
+bool WarmStartHelper::isEnabled(void)


I could not think of a good name for this function. isEnabled() makes people feel it is safe to call it anytime they like to check the state, similar to isReconciled(void) or inProgress(void) , which is actually not true. How about bool WarmStartHelper::checkAndStart(void)?

Yes, i see your point. CheckAndStart() sounds good to me, will change.

heidinet2007 · 2018-10-12T05:13:53Z

fpmsyncd/fpmsyncd.cpp

+                {
+                    warmStartTimer.setInterval(timespec{warmRestartIval, 0});
+                }
+


I have concern on assigning a static value to the warmStartTimer, because it's hard for an admin to identify a good number for such timer:

if it's too short, obviously some routes will be missing.

if it's too long, black hole may occur, as shown in the following example: a new prefix is advertised to BGP peers, peers have the route/path installed immediately and start to forward traffic to the warm rebooted box. However, the prefix is NOT yet installed in warm rebooted box's forwarding table due to the running warmStartTimer (not expire yet), hence packet drop. It looks worse, if the peers do have other paths to reach the prefix, which is true in most of DC setup with ECMP. I think loop may also occur, as long as the rebooted box's control plane is inconsistent with its forwarding plane. The longer the inconsistency, the higher chance the problems happen.

In order to reduce the chance, the forwarding path needs to coordinate with routing/zebra stack and set timer value accordingly.

@heidinet2007 absolutely, we covered these cases that you are describing in my design doc (find link below -- search for Known-Limitations section and also look at the CLI portion).

As mentioned in this doc, for the time being user will need to set both timers (bgp and warm-restart one) to similar/same values to avoid inconsistencies. And yes, i'm already working in an FRR extension that will allow bgp/zebra to notify fpmsyncd whenever all BGP EoR's are received. That should address all these issues.

https://github.com/Azure/SONiC/blob/dcac72377f23521a394694214678ea4450f6f70a/doc/routing-warm-reboot/Routing_Warm_Reboot.md)

It would be hard to deploy routing warm reboot without the auto sync between zebra and fpmsyncd (even though the update-delay timer and warmstarttimer can be tuned), because admins have alternative ways to achieve zero packet loss during routing stack upgrade. When will the auto sync be done? Is it possible to include it in this release?

Second to this request.

Once we have current reconciliation processing merged, the channel for notification of EoR (all) received from BGP to fpmsyncd is necessary for a production quality BGP docker warm reboot feature.

@heidinet2007 Can you please elaborate on this: "because admins have alternative ways to achieve zero packet loss during routing stack upgrade".

I do realize that there's room for suboptimal routing (or even drops) in the interval comprehended between the bgp-peers sending their EoRs, and the warmrestart-timer kicking off, but this is expected to be a very short interval if the timers are properly set.

I have already started to prototype the EoR-to-fpmsyncd solution but that won't be ready for this coming release (only two weeks).

@nikos-github Zooming out a little bit, the fundamental problem that we are trying to fix here is how to close the 'gap' between sonic's control and fwd plane to expedite the reconciliation process. So in my mind, given that SONiC is AFI/SAFI agnostic, there's no real need to treat each AFI/SAFI independently; as long as there's one missing EoR for a given AFI/SAFI, we know that no reconciliation process should take place (obviously, within a certain time boundary).

Of course, this imply that we will be penalizing faster AFI/SAFIs, but the alternative to this approach would be to introduce AFI/SAFI semantics in SONiC, which as you know, is a longer-term project.

And yep, i'm aware of EoR being used outside GR context, it's just a requirement.

I think we'll have AFI info up to fpmsyncd, it is just up to how/when we could use it. Yes, multiple alternatives exist to utilized that.

I agree with @rodnymolina, currently the most urgent issue is to close the 'gap' between sonic's control and fwd plane to expedite the reconciliation process.

Event without explicit EoR, there is implicit EoR, we may refer to bgp update-delay processing.

@jipanyang I believe you are confusing things. There is no notion of an implicit EoR through the expiration of update-delay. The update-delay timer (not processing) is a catch all mechanism for BGP that controls the maximum time BGP will wait after the first neighbor is established and until it starts calculating bestpaths and sending out updates. You can certainly implement an EoR message based on stopping or upon expiration of the update-delay timer and while it is simple, it would also be wrong since you are not guaranteed that bestpath selection is finished, RIB installation is done, etc. In other words sending an EoR to RIB when the update-delay timer expires or is stopped, can most certainly be premature.

@rodnymolina What you are describing can potentially lead you to wait until the update-delay timer expires. Per my comment above, this has a good chance it will be premature in those cases.

It might be confusing to use EoR here, as it's more like a BGP term. What we need here is a RIB/ZEBRA CONVERGED signal to fpm. RIB convergence time has dependancy on routing protocols (here we consider only BGP), current BGP implementation already has a scheme to tell RIB when it receives EoR from all its neighbours. We should be able to utilize it.

Agree, I vote for using something more generic, like FRR's existing EOIU terminology, which can be potentially extended to other protocols in the future too.

jipanyang · 2018-10-19T16:54:22Z

@rodnymolina While we should continue the discussion, the implementation in this PR needs to be closed as soon as possible. It is the base for further optimization.

@nikos-github We have some work to be done to make use of the explicit EoR/Implicit EoR. I'm not saying that update-delay should be used, but the implementation of it could be a reference point. There is more processing needed at zebra. Having more to be done should not be the reason stating it is wrong. I prefer seeing actual effort trying the change and refine it step by step.

All the configured peers, except the shutdown peers, have sent explicit EOR
      (End-Of-RIB) or an implicit-EOR. The first keep-alive after BGP has reached
      Established is considered an implicit-EOR.

nikos-github · 2018-10-19T18:14:05Z

@jipanyang "We have some work to be done to make use of the explicit EoR/Implicit EoR. I'm not saying that update-delay should be used, but the implementation of it could be a reference point." i

It can not - that's precisely the point I was making. You can not infer anything from the explicit EoR/implicit EoR that the nbrs sent as to when the EoR to zebra should be sent. All that tell you is that you can not send earlier than them but that is useless anyway since we didn't want to do that in the first place.

"I prefer seeing actual effort trying the change and refine it step by step."

First you will need to write this up so that it's clear what you proposing. There is no processing needed at zebra other than to make sure the EoR goes through. So for zebra this will be a passthrough because it doesn't hold on to the BGP routes.

We can discuss more offline.

rodnymolina · 2018-10-19T18:40:41Z

@nikos-github @jipanyang i'll have a design-doc ready next week. Let's discuss this EoR feature outside this PR.

lguohan · 2018-10-23T09:09:56Z

@rodnymolina , can you rebase your pr against the latest master and fix the vstest issue?

rodnymolina · 2018-10-24T18:41:42Z

@lguohan, i don't see that VS failure in my setup, which btw is unrelated to my feature. I just rebased and re-tested everything locally, and again all testcases passed. Will send another patch to force our VS tests to run again.

tests/test_warm_reboot.py

lguohan · 2018-10-25T22:15:10Z

@rodnymolina , do not worry about the test_interfacesetmtu, in this case, teamsyncd crashed due to a race condition, we are investigating it.

tests/test_warm_reboot.py

warmrestart/warmRestartHelper.cpp

nikos-github · 2018-10-29T07:50:50Z

Unit tests are missing for some very important cases:

Withdrawals generated for prefixes not in the db.
Withdrawals and then changes for prefixes not in the db.

nikos-github · 2018-10-31T04:16:45Z

warmrestart/warmRestartHelper.cpp

+         * 'delete' from being pushed down to AppDB, so we are handling this case
+         * differently than the 'add' one.
+         */
+        if(refreshedOp == DEL_COMMAND)


Minor thing - please add a space after if and before the (.

lguohan · 2018-10-31T08:36:32Z

tests/conftest.py

-        self.ctn.exec_run("logger {}".format(marker))
+
+        if file:
+            self.runcmd(['sh', '-c', "echo \"{}\" >> {}".format(marker, file)])


there is no guarantee that this will not cut in the middle of a line. I prefer not to add log marker for swss or sairedis file. Instead, we should use SubscribeAsicDbObject to detect changes on asic db or app db instead of relying on the log file.

lguohan · 2018-10-31T09:43:49Z

please refer to https://github.com/Azure/sonic-swss/pull/664/files

Please refer to the corresponding design document for more details: sonic-net/SONiC#239 The following manual UT plan has been successfully executed. UT automation pending. Physical topology formed by various BGP peers connecting to the DUT. Both non-ecmp and ecmp prefixes are learned/tested. Testcase Suite 1: Making use of “pkill -9 bgpd && pkill -9 zebra” as trigger. ============= IPv4 prefixes ========== * Restart zebra/bgpd: - Verify that all prefixes are properly stale-marked and that no change is pushed to AppDB during reconciliation. - Result: PASSED * Restart zebra/bgpd and add one new non-ecmp IPv4 prefix. - To produce a route-state-inconsistency, add prefix in adjacent node before bgp sessions are re-established. - Verify that new-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart zebra/bgpd and withdraw one non-ecmp IPv4 prefix. - To produce a route-state-inconsistency, remove prefix in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart zebra/bgpd and add one new path to an IPv4 ecmp-prefix. - To produce a route-state-inconsistency, add prefix-path in adjacent node before bgp sessions are re-established. - Verify that new prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart zebra/bgpd and withdraw one ecmp-path from an existing ecmp IPv4 prefix. - To produce a route-state-inconsistency, remove prefix-path in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Eventually, during reconciliation, this path will be eliminated as it’s not being refreshed by remote-peer. - Result: PASSED IPv6 prefixes ========== * Restart zebra/bgpd: - Verify that all prefixes are properly stale-marked and that no change is pushed to AppDB during reconciliation. - Result: PASSED * Restart zebra/bgpd and add one new non-ecmp IPv6 prefix. - To produce a route-state-inconsistency, add prefix in adjacent node before bgp sessions are re-established. - Verify that new-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart zebra/bgpd and withdraw one non-ecmp IPv6 prefix. - To produce a route-state-inconsistency, remove prefix in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart zebra/bgpd and add one new path to an IPv6 ecmp-prefix. - To produce a route-state-inconsistency, add prefix-path in adjacent node before bgp sessions are re-established. - Verify that new prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart zebra/bgpd and withdraw one ecmp-path from an existing ecmp IPv6 prefix. - To produce a route-state-inconsistency, remove prefix-path in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Eventually, during reconciliation, this path will be eliminated as it’s not being refreshed by remote-peer. - Result: PASSED Testcase Suite 2: Making use of sudo service bgp restart” as trigger. ============= IPv4 prefixes ========== * Restart bgp service/docker: - Verify that all prefixes are properly stale-marked and that no change is pushed to AppDB during reconciliation. - Result: PASSED * Restart bgp service/docker and add one new non-ecmp IPv4 prefix. - To produce a route-state-inconsistency, add prefix in adjacent node before bgp sessions are re-established. - Verify that new-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart bgp service/docker and withdraw one non-ecmp IPv4 prefix. - To produce a route-state-inconsistency, remove prefix in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart bgp service/docker and add one new path to an IPv4 ecmp-prefix. - To produce a route-state-inconsistency, add prefix-path in adjacent node before bgp sessions are re-established. - Verify that new prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart bgp service/docker and withdraw one ecmp-path from an existing ecmp IPv4 prefix. - To produce a route-state-inconsistency, remove prefix-path in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Eventually, during reconciliation, this path will be eliminated as it’s not being refreshed by remote-peer. - Result: PASSED IPv6 prefixes ========== * Restart bgp service/docker: - Verify that all prefixes are properly stale-marked and that no change is pushed to AppDB during reconciliation. - Result: PASSED * Restart bgp service/docker and add one new non-ecmp IPv6 prefix. - To produce a route-state-inconsistency, add prefix in adjacent node before bgp sessions are re-established. - Verify that new-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart bgp service/docker and withdraw one non-ecmp IPv6 prefix. - To produce a route-state-inconsistency, remove prefix in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart bgp service/docker and add one new path to an IPv6 ecmp-prefix. - To produce a route-state-inconsistency, add prefix-path in adjacent node before bgp sessions are re-established. - Verify that new prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart bgp service/docker and withdraw one ecmp-path from an existing ecmp IPv6 prefix. - To produce a route-state-inconsistency, remove prefix-path in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Eventually, during reconciliation, this path will be eliminated as it’s not being refreshed by remote-peer. - Result: PASSED RB= G=lnos-reviewers R=pchaudhary,pmao,rmolina,samaity,sfardeen,zxu A= Signed-off-by: Rodny Molina <rmolina@linkedin.com>

Signed-off-by: Rodny Molina <rmolina@linkedin.com>

Code has been refactored to reduce the complexity associated to carrying 'state' for every received FV-tuple. Obviously, there's no free-lunch here, this is only possible at the expense of more memory utilization: we are now using two different buffers to store 'old' and 'new' state. Yet, this is a relatively-small price to pay for a much simpler implementation. Signed-off-by: Rodny Molina <rmolina@linkedin.com>

Signed-off-by: Rodny Molina <rmolina@linkedin.com>

…case from passing As part of these changes i'm also modifying the ip-addresses of my UT setup to avoid clashes with existing/previous testcases. Signed-off-by: Rodny Molina <rmolina@linkedin.com>

rodnymolina · 2018-11-09T19:01:31Z

@lguohan I just finished refactoring my UTs as per your suggestion. Please take a look at this one when have a chance.

lguohan · 2018-11-09T20:55:56Z

retest this please

lguohan

* Routing-stack warm-reboot feature. Please refer to the corresponding design document for more details: sonic-net/SONiC#239 The following manual UT plan has been successfully executed. UT automation pending. Physical topology formed by various BGP peers connecting to the DUT. Both non-ecmp and ecmp prefixes are learned/tested. Testcase Suite 1: Making use of “pkill -9 bgpd && pkill -9 zebra” as trigger. ============= IPv4 prefixes ========== * Restart zebra/bgpd: - Verify that all prefixes are properly stale-marked and that no change is pushed to AppDB during reconciliation. - Result: PASSED * Restart zebra/bgpd and add one new non-ecmp IPv4 prefix. - To produce a route-state-inconsistency, add prefix in adjacent node before bgp sessions are re-established. - Verify that new-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart zebra/bgpd and withdraw one non-ecmp IPv4 prefix. - To produce a route-state-inconsistency, remove prefix in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart zebra/bgpd and add one new path to an IPv4 ecmp-prefix. - To produce a route-state-inconsistency, add prefix-path in adjacent node before bgp sessions are re-established. - Verify that new prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart zebra/bgpd and withdraw one ecmp-path from an existing ecmp IPv4 prefix. - To produce a route-state-inconsistency, remove prefix-path in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Eventually, during reconciliation, this path will be eliminated as it’s not being refreshed by remote-peer. - Result: PASSED IPv6 prefixes ========== * Restart zebra/bgpd: - Verify that all prefixes are properly stale-marked and that no change is pushed to AppDB during reconciliation. - Result: PASSED * Restart zebra/bgpd and add one new non-ecmp IPv6 prefix. - To produce a route-state-inconsistency, add prefix in adjacent node before bgp sessions are re-established. - Verify that new-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart zebra/bgpd and withdraw one non-ecmp IPv6 prefix. - To produce a route-state-inconsistency, remove prefix in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart zebra/bgpd and add one new path to an IPv6 ecmp-prefix. - To produce a route-state-inconsistency, add prefix-path in adjacent node before bgp sessions are re-established. - Verify that new prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart zebra/bgpd and withdraw one ecmp-path from an existing ecmp IPv6 prefix. - To produce a route-state-inconsistency, remove prefix-path in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Eventually, during reconciliation, this path will be eliminated as it’s not being refreshed by remote-peer. - Result: PASSED Testcase Suite 2: Making use of sudo service bgp restart” as trigger. ============= IPv4 prefixes ========== * Restart bgp service/docker: - Verify that all prefixes are properly stale-marked and that no change is pushed to AppDB during reconciliation. - Result: PASSED * Restart bgp service/docker and add one new non-ecmp IPv4 prefix. - To produce a route-state-inconsistency, add prefix in adjacent node before bgp sessions are re-established. - Verify that new-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart bgp service/docker and withdraw one non-ecmp IPv4 prefix. - To produce a route-state-inconsistency, remove prefix in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart bgp service/docker and add one new path to an IPv4 ecmp-prefix. - To produce a route-state-inconsistency, add prefix-path in adjacent node before bgp sessions are re-established. - Verify that new prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart bgp service/docker and withdraw one ecmp-path from an existing ecmp IPv4 prefix. - To produce a route-state-inconsistency, remove prefix-path in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Eventually, during reconciliation, this path will be eliminated as it’s not being refreshed by remote-peer. - Result: PASSED IPv6 prefixes ========== * Restart bgp service/docker: - Verify that all prefixes are properly stale-marked and that no change is pushed to AppDB during reconciliation. - Result: PASSED * Restart bgp service/docker and add one new non-ecmp IPv6 prefix. - To produce a route-state-inconsistency, add prefix in adjacent node before bgp sessions are re-established. - Verify that new-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart bgp service/docker and withdraw one non-ecmp IPv6 prefix. - To produce a route-state-inconsistency, remove prefix in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart bgp service/docker and add one new path to an IPv6 ecmp-prefix. - To produce a route-state-inconsistency, add prefix-path in adjacent node before bgp sessions are re-established. - Verify that new prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart bgp service/docker and withdraw one ecmp-path from an existing ecmp IPv6 prefix. - To produce a route-state-inconsistency, remove prefix-path in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Eventually, during reconciliation, this path will be eliminated as it’s not being refreshed by remote-peer. - Result: PASSED RB= G=lnos-reviewers R=pchaudhary,pmao,rmolina,samaity,sfardeen,zxu A= Signed-off-by: Rodny Molina <rmolina@linkedin.com> * Renaming 'restoration' code and making minor adjustments to fpmsyncd Signed-off-by: Rodny Molina <rmolina@linkedin.com> * Eliminating 'state' associated to field-value-tuples Code has been refactored to reduce the complexity associated to carrying 'state' for every received FV-tuple. Obviously, there's no free-lunch here, this is only possible at the expense of more memory utilization: we are now using two different buffers to store 'old' and 'new' state. Yet, this is a relatively-small price to pay for a much simpler implementation. Signed-off-by: Rodny Molina <rmolina@linkedin.com> * Adding UTs to cover routing-warm-reboot logic. Signed-off-by: Rodny Molina <rmolina@linkedin.com> * Fixing an issue with warm-reboot UTs that prevented an existing test-case from passing As part of these changes i'm also modifying the ip-addresses of my UT setup to avoid clashes with existing/previous testcases. Signed-off-by: Rodny Molina <rmolina@linkedin.com> * Making some small adjustments * Addressing review comments for my UT code * Addressing more review comments. * Refactoring UTs to rely on pubsub notifications instead of logs

- dump: dumps the entire state of all the units - blame: provides startup time for each service - plot: generate a svg of the boot order Co-authored-by: lguohan <lguohan@gmail.com>

* [sairedis] [meta] Support connect to existing switch * Remove comment

* Routing-stack warm-reboot feature. Please refer to the corresponding design document for more details: sonic-net/SONiC#239 The following manual UT plan has been successfully executed. UT automation pending. Physical topology formed by various BGP peers connecting to the DUT. Both non-ecmp and ecmp prefixes are learned/tested. Testcase Suite 1: Making use of “pkill -9 bgpd && pkill -9 zebra” as trigger. ============= IPv4 prefixes ========== * Restart zebra/bgpd: - Verify that all prefixes are properly stale-marked and that no change is pushed to AppDB during reconciliation. - Result: PASSED * Restart zebra/bgpd and add one new non-ecmp IPv4 prefix. - To produce a route-state-inconsistency, add prefix in adjacent node before bgp sessions are re-established. - Verify that new-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart zebra/bgpd and withdraw one non-ecmp IPv4 prefix. - To produce a route-state-inconsistency, remove prefix in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart zebra/bgpd and add one new path to an IPv4 ecmp-prefix. - To produce a route-state-inconsistency, add prefix-path in adjacent node before bgp sessions are re-established. - Verify that new prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart zebra/bgpd and withdraw one ecmp-path from an existing ecmp IPv4 prefix. - To produce a route-state-inconsistency, remove prefix-path in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Eventually, during reconciliation, this path will be eliminated as it’s not being refreshed by remote-peer. - Result: PASSED IPv6 prefixes ========== * Restart zebra/bgpd: - Verify that all prefixes are properly stale-marked and that no change is pushed to AppDB during reconciliation. - Result: PASSED * Restart zebra/bgpd and add one new non-ecmp IPv6 prefix. - To produce a route-state-inconsistency, add prefix in adjacent node before bgp sessions are re-established. - Verify that new-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart zebra/bgpd and withdraw one non-ecmp IPv6 prefix. - To produce a route-state-inconsistency, remove prefix in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart zebra/bgpd and add one new path to an IPv6 ecmp-prefix. - To produce a route-state-inconsistency, add prefix-path in adjacent node before bgp sessions are re-established. - Verify that new prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart zebra/bgpd and withdraw one ecmp-path from an existing ecmp IPv6 prefix. - To produce a route-state-inconsistency, remove prefix-path in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Eventually, during reconciliation, this path will be eliminated as it’s not being refreshed by remote-peer. - Result: PASSED Testcase Suite 2: Making use of sudo service bgp restart” as trigger. ============= IPv4 prefixes ========== * Restart bgp service/docker: - Verify that all prefixes are properly stale-marked and that no change is pushed to AppDB during reconciliation. - Result: PASSED * Restart bgp service/docker and add one new non-ecmp IPv4 prefix. - To produce a route-state-inconsistency, add prefix in adjacent node before bgp sessions are re-established. - Verify that new-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart bgp service/docker and withdraw one non-ecmp IPv4 prefix. - To produce a route-state-inconsistency, remove prefix in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart bgp service/docker and add one new path to an IPv4 ecmp-prefix. - To produce a route-state-inconsistency, add prefix-path in adjacent node before bgp sessions are re-established. - Verify that new prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart bgp service/docker and withdraw one ecmp-path from an existing ecmp IPv4 prefix. - To produce a route-state-inconsistency, remove prefix-path in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Eventually, during reconciliation, this path will be eliminated as it’s not being refreshed by remote-peer. - Result: PASSED IPv6 prefixes ========== * Restart bgp service/docker: - Verify that all prefixes are properly stale-marked and that no change is pushed to AppDB during reconciliation. - Result: PASSED * Restart bgp service/docker and add one new non-ecmp IPv6 prefix. - To produce a route-state-inconsistency, add prefix in adjacent node before bgp sessions are re-established. - Verify that new-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart bgp service/docker and withdraw one non-ecmp IPv6 prefix. - To produce a route-state-inconsistency, remove prefix in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart bgp service/docker and add one new path to an IPv6 ecmp-prefix. - To produce a route-state-inconsistency, add prefix-path in adjacent node before bgp sessions are re-established. - Verify that new prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Result: PASSED * Restart bgp service/docker and withdraw one ecmp-path from an existing ecmp IPv6 prefix. - To produce a route-state-inconsistency, remove prefix-path in adjacent node before bgp sessions are re-established. - Verify that deleted-prefix-path msg is NOT pushed down to AppDB till reconciliation takes place. - Eventually, during reconciliation, this path will be eliminated as it’s not being refreshed by remote-peer. - Result: PASSED RB= G=lnos-reviewers R=pchaudhary,pmao,rmolina,samaity,sfardeen,zxu A= Signed-off-by: Rodny Molina <rmolina@linkedin.com> * Renaming 'restoration' code and making minor adjustments to fpmsyncd Signed-off-by: Rodny Molina <rmolina@linkedin.com> * Eliminating 'state' associated to field-value-tuples Code has been refactored to reduce the complexity associated to carrying 'state' for every received FV-tuple. Obviously, there's no free-lunch here, this is only possible at the expense of more memory utilization: we are now using two different buffers to store 'old' and 'new' state. Yet, this is a relatively-small price to pay for a much simpler implementation. Signed-off-by: Rodny Molina <rmolina@linkedin.com> * Adding UTs to cover routing-warm-reboot logic. Signed-off-by: Rodny Molina <rmolina@linkedin.com> * Fixing an issue with warm-reboot UTs that prevented an existing test-case from passing As part of these changes i'm also modifying the ip-addresses of my UT setup to avoid clashes with existing/previous testcases. Signed-off-by: Rodny Molina <rmolina@linkedin.com> * Making some small adjustments * Addressing review comments for my UT code * Addressing more review comments. * Refactoring UTs to rely on pubsub notifications instead of logs

stcheng added the Enhancement ➕ label Aug 29, 2018

nikos-github suggested changes Aug 29, 2018

View reviewed changes

nikos-github suggested changes Sep 6, 2018

View reviewed changes

jipanyang reviewed Sep 6, 2018

View reviewed changes

nikos-github suggested changes Sep 6, 2018

View reviewed changes

nikos-github reviewed Sep 7, 2018

View reviewed changes

jipanyang reviewed Sep 13, 2018

View reviewed changes

rodnymolina force-pushed the routing_warmreboot branch from 0852d8e to ba145db Compare October 4, 2018 00:05

jipanyang reviewed Oct 4, 2018

View reviewed changes

rodnymolina commented Oct 6, 2018

View reviewed changes

tests/test_warm_reboot.py Outdated Show resolved Hide resolved

jipanyang reviewed Oct 10, 2018

View reviewed changes

heidinet2007 reviewed Oct 12, 2018

View reviewed changes

rodnymolina force-pushed the routing_warmreboot branch from d8b58e6 to f1dcaa1 Compare October 20, 2018 07:47

lguohan reviewed Oct 25, 2018

View reviewed changes

tests/test_warm_reboot.py Outdated Show resolved Hide resolved

lguohan reviewed Oct 25, 2018

View reviewed changes

tests/test_warm_reboot.py Outdated Show resolved Hide resolved

lguohan reviewed Oct 25, 2018

View reviewed changes

tests/test_warm_reboot.py Outdated Show resolved Hide resolved

lguohan reviewed Oct 25, 2018

View reviewed changes

tests/test_warm_reboot.py Outdated Show resolved Hide resolved

nikos-github suggested changes Oct 29, 2018

View reviewed changes

tests/test_warm_reboot.py Outdated Show resolved Hide resolved

tests/test_warm_reboot.py Show resolved Hide resolved

nikos-github suggested changes Oct 29, 2018

View reviewed changes

warmrestart/warmRestartHelper.cpp Outdated Show resolved Hide resolved

nikos-github approved these changes Oct 31, 2018

View reviewed changes

lguohan reviewed Oct 31, 2018

View reviewed changes

Rodny Molina and others added 9 commits November 8, 2018 17:16

Renaming 'restoration' code and making minor adjustments to fpmsyncd

0493490

Signed-off-by: Rodny Molina <rmolina@linkedin.com>

Adding UTs to cover routing-warm-reboot logic.

da4b861

Signed-off-by: Rodny Molina <rmolina@linkedin.com>

Fixing an issue with warm-reboot UTs that prevented an existing test-…

8f3a4d3

…case from passing As part of these changes i'm also modifying the ip-addresses of my UT setup to avoid clashes with existing/previous testcases. Signed-off-by: Rodny Molina <rmolina@linkedin.com>

Making some small adjustments

e63c762

Addressing review comments for my UT code

4298647

Addressing more review comments.

6089403

Refactoring UTs to rely on pubsub notifications instead of logs

976803a

rodnymolina force-pushed the routing_warmreboot branch from 88ea718 to 976803a Compare November 9, 2018 07:17

lguohan approved these changes Nov 9, 2018

View reviewed changes

jipanyang approved these changes Nov 9, 2018

View reviewed changes

lguohan merged commit f380685 into sonic-net:master Nov 10, 2018

oleksandrivantsiv pushed a commit to oleksandrivantsiv/sonic-swss that referenced this pull request Mar 1, 2023

[sairedis] [meta] Support connect to existing switch (sonic-net#602)

a77fc0a

* [sairedis] [meta] Support connect to existing switch * Remove comment


		std::string newVal = fvValue(fvVector[i]) + "," + fvValue(data[i]);

		fvVector[i].second = newVal;

Conversation

rodnymolina commented Aug 29, 2018

Uh oh!

nikos-github left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jipanyang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jipanyang Sep 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rodnymolina Sep 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lguohan commented Sep 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jipanyang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jipanyang Sep 6, 2018 •

edited

Loading

rodnymolina Sep 6, 2018 •

edited

Loading

lguohan commented Sep 12, 2018 •

edited

Loading