Skip to content

Routing-stack warm-reboot feature.#602

Merged
lguohan merged 9 commits intosonic-net:masterfrom
rodnymolina:routing_warmreboot
Nov 10, 2018
Merged

Routing-stack warm-reboot feature.#602
lguohan merged 9 commits intosonic-net:masterfrom
rodnymolina:routing_warmreboot

Conversation

@rodnymolina
Copy link
Copy Markdown
Contributor

[ dependencies with PR/600: https://github.com//pull/600 ]

Please refer to the corresponding design document for more details: sonic-net/SONiC#239

The following manual UT plan has been successfully executed. UT automation pending.

Physical topology formed by various BGP peers connecting to the DUT. Both non-ecmp and ecmp prefixes are learned/tested.

Testcase Suite 1: Making use of “pkill -9 bgpd && pkill -9 zebra” as trigger.
=============

IPv4 prefixes
==========

    * Restart zebra/bgpd:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart zebra/bgpd and add one new non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and add one new path to an IPv4 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one ecmp-path from an existing ecmp IPv4
      prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

IPv6 prefixes
==========

    * Restart zebra/bgpd:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart zebra/bgpd and add one new non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and add one new path to an IPv6 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one ecmp-path from an existing ecmp IPv6
      prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

Testcase Suite 2: Making use of sudo service bgp restart” as trigger.
=============

IPv4 prefixes
==========

    * Restart bgp service/docker:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart bgp service/docker and add one new non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and add one new path to an IPv4 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one ecmp-path from an existing
      ecmp IPv4 prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

IPv6 prefixes
==========

    * Restart bgp service/docker:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart bgp service/docker and add one new non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and add one new path to an IPv6 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one ecmp-path from an existing
      ecmp IPv6 prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

Signed-off-by: Rodny Molina rmolina@linkedin.com

Copy link
Copy Markdown
Contributor

@nikos-github nikos-github left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of initial comments. More to come.

s.addSelectable(&warmStartTimer);
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is work to be done if there is a timer running or a previous reconciliation effort is in progress so likely you need an else part. While the connection was severed (not due to fpmsyncd restart) the warmStartEnabled may have changed state and lots of things won't work after that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikos-github thanks for your comments.

You have a point here. For the above to happen we need an unlikely sequence of events to take place:

  1. two or more zebra crashes happening in a small window of time, and
  2. a user deactivating routing-warm-reboot feature right between the two zebra crashes

Unlikely but possible. I'll adjust my code to deal with this scenario.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If warm restart is found to be disabled, we should still recover and cleanup. There should be no requirement to disrupt the lower sonic layers now that we are adding such infra/functionality.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please amend the code to address initial comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikos-github i missed this comment. Yes, i'm about to send one (hopefully final) update that explicitly clears the state of the data-structures used within WarmStartHelper class. That should take care of your previous concerns for the bgp/zebra restarting case.

if (warmStartEnabled)
{
/* Obtain warm-restart timer defined for routing application */
uint32_t warmRestartIval = sync.m_warmStartHelper.getRestartTimer();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe some cleanup is required if there is a timer already running.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i'm taking care of it, see my other comment on this matter.

}

/* Execute recovery instruction and kick off warm-restart timer */
if (sync.m_warmStartHelper.runRecovery())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before recovering the appdb state, you need to either clear or check that the producer queue is empty.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, already took care of that, my next patch will have it.

Copy link
Copy Markdown
Contributor

@jipanyang jipanyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do further review once the dependee PR is settled down and code updated.

else if (!warmStartEnabled || sync.m_warmStartHelper.isReconciled())
{
pipeline.flush();
SWSS_LOG_NOTICE("Pipeline flushed");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be too many logs for pipeline flush in normal running?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jipanyang I'm just adding one extra "Pipeline flushed" log, which will be executed only once upon restart-timer expiration. During normal operation we will continue generating this log the same way we are doing it today -- see that i'm not adding this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original log level is DEBUG, now it is changed to NOTICE.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, i can set all "Pipeline flushed" logs to debug.

m_routeTable.del(destipprefix);
return;
if (warmState == WarmStart::INITIALIZED ||
warmState == WarmStart::RECONCILED)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"warmState == WarmStart::INITIALIZED", It is confusing, while warm state is INITIALIZED and route is injected to appDB directly. Do you mean warm start has been disabled?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding, based on a recent discussion where we discussed/agreed on this issue, is that INITIALIZED state will be utilized for applications with warm-reboot feature disabled -- so these ones will permanently display INITIALIZED state. Based on that discussion i had to re-adjust my code and add this INITIALIZED checkpoint. My original code was setting all applications to RECONCILED state, regardless of WarmReboot being enable or not, as this is the last FSM stage. Looks like our problem here is simply a consequence of INITIALIZED state being slightly overloaded. The simplest solution is to provided another FSM state (DEACTIVATED). Will ping you offline to discuss FSM semantics.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed some of the discussions. My understanding for WarmStart state is that it is only meaningful when warm start has been enabled. If warm start is disabled, people won't care about WarmStart state.
Putting too much state change seems to make the problem unnecessary complicated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you know, applications like fpmSyncd need to be able to invoke WarmRestart logic on-demand, and not only when when it boots-up. So the initialization of WarmRestart logic is not necessarily connected to its activation: you could have WarmRestart being turned on at any given point in the future without having to restart fpmSyncd. To fix this problem there are two possible solutions: 1) we overload one of the current FSM states (INITIALIZED), or 2) we create a new UNINITIALIZED/DEACTIVATED FSM state.

Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fpmsyncd case, whenever warm restore/reconcile is needed and warm start has been enabled, the INITIALIZED state could be set before starting restore to indicate that warm start is initiated, after restore, RESTORED; then RECONCILED.

I don't see the need to overload INITIALIZED or add new DEACTIVATED state.

Actually "INITIATED" looks like a better name than "INITIALIZED" if we are talking about warm restart state.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you use RECONCILED always. I agree with Jipan that INITIALIZED is confusing and at the same time UNINITIALIZED/DEACTIVATED will be equally confusing if warm restart is turned on after fpmsyncd is up because the state won't match with what's configured.

Copy link
Copy Markdown
Contributor Author

@rodnymolina rodnymolina Sep 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nikos, you are agreeing with me then. This is precisely my original approach: all apps end up in RECONCILED state. As i said above, i had to re-adjust my code to address the 'requirement' of having apps (with warm-reboot OFF) using the INITIALIZED state.

Jipan's approach is not using any state to represent apps with warmReboot turned off, which is something that i'm struggling with. Btw, the approach that you're suggesting wouldn't add a new state, but RECONCILED state would be conveying two semantics: 1) application with warm-reboot ON going through all FSM states, and 2) application with warm-reboot OFF skipping all FSM states. Either way, we are slightly overloading the FSM, which i'm totally fine with. But there's no free lunch here: either we overload or we add a new FSM state.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to point out a distinction between FSM state and whether the feature is enabled or not. Trying to infer what's the feature's state (on or off) from the FSM isn't the right thing in my opinion. If you want to convey both then you either need 2 variables, no overloading of FSM state and introduce another state in the FSM, OR you need 1 variable that has states plus feature enabled/disabled (via a bitmask), no overloading of FSM state or use reconciled. If you don't want any of the above, then from my point of view reconciled is the right state whether the feature is disabled or later enabled.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikos-github just to close the loop here. As discussed with the rest of the folks working in warm-reboot, we will rely on the assumption that no FSM state implicitly indicates the lack of warm-reboot functionality, or the fact that this one hasn't been turned on. The three FSM states (initialized, restored, reconciled) will be exclusively used for apps that are warm-reboot capable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem with that. The state should be reconciled though when the app is warm reboot capable but warm reboot isn't enabled or is enabled later.

m_routeTable.del(destipprefix);
return;
if (warmState == WarmStart::INITIALIZED ||
warmState == WarmStart::RECONCILED)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you use RECONCILED always. I agree with Jipan that INITIALIZED is confusing and at the same time UNINITIALIZED/DEACTIVATED will be equally confusing if warm restart is turned on after fpmsyncd is up because the state won't match with what's configured.

* from AppDB.
*
*/
void WarmStartHelper::reconciliate(void)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconcile not reconciliate

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

/* Recovery map should be entirely empty by now */
assert(m_recoveryMap.size() == 0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should recover gracefully here by falling back to cold restart.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what do you mean by 'gracefully'. I understand that we should do something about it, but that's something the process-manager (supervisord) should take care of upon the app being cold-booted after the failed assertion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will supervisord achieve a cold boot since warm restart is enabled? In any case, asserting is shouldn't be here in my opinion - that's way too harsh. Flushing the db and re-establishing the connection will cause the routes to re-download. That's why I mean by gracefully.

@lguohan
Copy link
Copy Markdown
Contributor

lguohan commented Sep 12, 2018

depend on #600

Copy link
Copy Markdown
Contributor

@jipanyang jipanyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are existing AppRestartAssist class and neighsyncd reconciliation to be handled?
There should not be two classes performing similar tasks.

else if (!warmStartEnabled || sync.m_warmStartHelper.isReconciled())
{
pipeline.flush();
SWSS_LOG_NOTICE("Pipeline flushed");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original log level is DEBUG, now it is changed to NOTICE.

pipeline.flush();
SWSS_LOG_NOTICE("Pipeline flushed");
}
else if (!warmStartEnabled || sync.m_warmStartHelper.isReconciled())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there "else" to take care of remaining conditions?

if()
    {
    }
else if
    {
    }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which other conditions are you referring to? There are only three conditions that we care about here: 1) warmStartTimer expired, which triggers reconciliation, 2) warm-restart disabled -- to deal with legacy/current scenarios, and 3) scenarios with warm-restart reconciliation completed. In 2) and 3) the actions are exactly the same, that's why i'm bundling them together. What would be the use case for the 'else'?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not checking the internal logic. If the "else if" covers everything, why don't you just change it to else {}.

Based on your current implementation, it looks some default processing is missing.

if (condition 1)
{
}
else if (condition2)
{
}
else {
// default processing;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got your point now. The "else if" is there to make it more explicit to the reader that those two conditions could take place. Performance-wise there's no doubt the "else" is slightly cheaper. Have no strong opinion here, i can make use of an "else" if everyone agrees.

Copy link
Copy Markdown
Contributor

@jipanyang jipanyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another general comment, VS test cases should be prepared for the changes.

@rodnymolina
Copy link
Copy Markdown
Contributor Author

I'm working on VS test-cases right now.

Answering your other question: the goal is to replace the existing reconciliation logic with the one i'm presenting here (which offers more flexibility). That will require some refactoring work on neighsyncd which i'm planning to carry out once done with the routing portion. At the end there will be a single reconciliation class.

splitValues.push_back("");
}

for (int j = 0; j < static_cast<int>(splitValues.size()); ++j)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you get away from static_cast? Using something like size_type j = 0, or auto j = splitValues.size();


for (int j = 0; j < static_cast<int>(splitValues.size()); ++j)
{
if (j < static_cast<int>(fvVector.size()))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as previous comment.

* The sizes of both tuple-vectors should always match within any given
* application, otherwise we are running into some form of bug.
*/
assert(left.size() == right.size());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When comparing two vectors, how to ensure they are of same size?
Say, a route has two fields in usual case: nexthop and intf, what if for some routes, the third fields "blackhole" is set?

### ROUTE_TABLE
    ;Stores a list of routes
    ;Status: Mandatory
    key           = ROUTE_TABLE:prefix
    nexthop       = *prefix, ;IP addresses separated “,” (empty indicates no gateway)
    intf          = ifindex? PORT_TABLE.key  ; zero or more separated by “,” (zero indicates no interface)
    blackhole     = BIT ; Set to 1 if this route is a blackhole (or null0)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. For routing-app there's this "blackhole" corner-case which brakes the uniformity for routing schema. Basically we are allowing a route to be defined as: "prefix/len | blackhole = true". No other app/scenarios that i'm aware of allow this inconsistency. In my humble opinion, we should keep uniformity by allowing the common schema to be re-utilized as this: "prefix/len | nh: blackhole | if: null". This approach will also simplify the addition of new attributes in the future (AFI, SAFI, etc).

Also, if you look at orchagent you'll see that we are not passing down "blackhole" routes -- they are being eliminated right there in routeOrch. The only blackhole route being pushed down to ASIC_DB is the default route, and this one is not even exposed to AppDB. So in reality, no one is really consuming these black-hole schema.

In my opinion, I think this assert() should be left here as it prevents real problems down the road. And concerning the blackhole routes, i would like to make a very minor adjustment in its representation to avoid this uniformity i talked about.

pipeline.flush();
SWSS_LOG_NOTICE("Pipeline flushed");
}
else if (!warmStartEnabled || sync.m_warmStartHelper.isReconciled())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not checking the internal logic. If the "else if" covers everything, why don't you just change it to else {}.

Based on your current implementation, it looks some default processing is missing.

if (condition 1)
{
}
else if (condition2)
{
}
else {
// default processing;
}

{
if (fvField(fvRight) == fvField(fvLeft))
{
return fvLeft < fvRight;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a difficult time understanding the logic here.
What if there are multiple fields? It looks whenever the field is checked, it return immediately without checking other fields.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't really matter how many fields we have as long as the schema they utilize keeps its consistency. The purpose of this function is to enforce an order within the map, so technically there's no real need to compare all the elements within a tuple. For example, if you are comparing these two tuples corresponding to a particular prefix...

left -> nh: 10.1.1.1, if: eth0
right-> nh: 10.1.1.2, if: eth1

... you can exit() the 'comparator' method as long as you have found a common-field within each tuple that can be used for comparison. Please let me know if you can think of a use-case that would require the complete iteration of the tuple.

As i said before, the above can only work if the schema is kept uniform, which is the case for all apps so far (with the exception of the 'blackhole' route).

And yes, i'm planning to add some comments to document what this function does.


std::string newVal = fvValue(fvVector[i]) + "," + fvValue(data[i]);

fvVector[i].second = newVal;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"fvVector[i].second" and "fvValue(fvVector[i])" should be equal. Use one format.

else if (recElemState == CLEAN)
{
cleanRecElems++;
transformKFV(recElem, fvVector);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the order of fields in fvRestorationMap guaranteed?
I saw you put comment in fvComparator

/*
             * Notice that we are forced to iterate through all the tuples in the
             * right-vector given that the order of fields within a tuple is not
             * fully deterministic (i.e. 'left' could hold 'nh: 1.1.1.1 / if: eth0'
             * and 'right' could be 'if: eth0, nh: 1.1.1.1').
             */

@jipanyang
Copy link
Copy Markdown
Contributor

jipanyang commented Sep 13, 2018

My general feeling about the implementation is that it might have been focusing too much on the specific syntax of multiple sub-value of kfv: 1.1.1.1/30, vector{nexthop: 10.1.1.1, 10.1.1.2, ifname: eth1, eth2}, it added a lot of logic processing complexity.

Some wild thoughts, could we simplify the logic like this:

  1. Read old data from appDB table.

  2. Start with accumulating KFV data from Applications
    Keep them separate from the old data, do not touch the old data.
    If same key arrives for the new data, just overwrite the previous one.

  3. Timer expires.
    4 Do the comparison in one shot.
    yes, we should compare field sub values, but there is no need to mark state for each of them.

What we care about is the whole entry.

@rodnymolina
Copy link
Copy Markdown
Contributor Author

@jipanyang thanks for your comments.

There are good reasons for this added complexity, which may also provide extra flexibility to other applications in the future:

  • We need to deal with multi-field values (e.g. 1.1.1.1/30 nh: 10.1.1.1, 10.2.2.2, if: eth1, eth2)
  • We need to deal with different orders in multi-field values (e.g. 1.1.1.1/30 nh: 10.2.2.2, 10.1.1.1, if: eth2, eth1)
  • And we need to deal with partial updates coming from the north-app (e.g. first we get "1.1.1.1/30 nh: 10.1.1.1, if: eth1", and then we get the complete prefix: "1.1.1.1/30 nh: 10.1.1.1, 10.2.2.2, if: eth1, eth2")

I'm not sure how the solution that you proposed would simplify my code while addressing the above requirements. Let's say you keep new/refreshed state in a separated location from the one collected from AppDB. How would you compare these two without doing any splitting:

AppDB state: 1.1.1.1/30 nh: 10.1.1.1, 10.2.2.2, if: eth1, eth2)
Zebra state: 1.1.1.1/30 nh: 10.2.2.2, 10.1.1.1, if: eth2, eth1)

I do understand though, that what we care about is the whole entry. But given that we have to split the tuples anyway, why assigning a STATE to them represents an issue? That gives us the possibility of making use of a single data-structure to run the entire reconciliation logic.

@jipanyang
Copy link
Copy Markdown
Contributor

@rodnymolina What is really needed to compare
"
AppDB state: 1.1.1.1/30 nh: 10.1.1.1, 10.2.2.2, if: eth1, eth2)
Zebra state: 1.1.1.1/30 nh: 10.2.2.2, 10.1.1.1, if: eth2, eth1)
"
and Also handle the case of different number of fields.

That is different from spitting the fields, assigning state to each of them, and making a lot of hard to understand processing.

If it makes it clear, I could create a temporary pr to your branch which will be based on your current framework and major logic flow but with reduced complexity to show how it works.

@nikos-github
Copy link
Copy Markdown
Contributor

@rodnymolina

One way to compare

AppDB state: 1.1.1.1/30 nh: 10.1.1.1, 10.2.2.2, if: eth1, eth2)
and
Zebra state: 1.1.1.1/30 nh: 10.2.2.2, 10.1.1.1, if: eth2, eth1)

is to create a nexthopgroup object and you have a pointer from each prefix to a nexthopgroup. For 2 identical prefixes, one recovered from the appdb and the other created after recovery, all you need to do is compare if they point to the same object.

@nikos-github
Copy link
Copy Markdown
Contributor

nikos-github commented Sep 14, 2018

@rodnymolina @jipanyang @lguohan

Design Comments

I always viewed sonic's *syncd parts (fpmsyncd, neighoborsyncd etc.) as thin stateless layers that provide intermediate yet very simple functionality and means of communication, between major sonic parts. Their functionality only goes as far as knowing the format of the data they are manipulating but not necessarily the meaning of it. This view is also reinforced if one looks at the current code and understands the functionality.

With this and other PRs I see that we are diverging from that paradigm. At the same time I haven't seen compelling technical justification to do so or some existing technical debt that needs to be addressed which pushes us in that direction.

We seem to be converting *syncd parts into a stateful layer with additional needless complexity, code duplication and exploding memory requirements just to satisfy the needs of warm restart.

A divergence from the current design will allow us to get the functionality we want without the drawbacks exhibited here. I will point out the issues I see in the current design and then I will outline solutions to counter them in order of preference.

Current Design

  • Keeping state per path is needless and results in convoluted comparison code. Routes always get implicitly updated rather than incremental per path updates. Eliminating this will probably result in a much simpler design with likely half the code changes.
  • Path sharing is not happening. This means every prefix will point to its own set of paths duplicating information on a per prefix basis. For a 4-way ecmp prefix, the memory increase will be roughly 8x. 12K prefixes in fpmsyncd will feel like 100K when doing warm restart in terms of memory consumption. This is too much to ignore.
  • We have now turned fpmsyncd into a stateful part of sonic. Adding state always increases interaction and complexity by definition.

Alternative Designs

  1. Keep fpmsyncd stateless
    This personally is my preferred approach and aligns best with the existing sonic design. During warm restart, fpmsyncd clears the appdb queue, starts a timer and informs orchagent (through appdb). Upon timer expiration it tells orchagent again. The orchagent will be the one responsible for marking prefixes as stale and cleaning up when the timer expires. Currently orchagent has code that suppresses routes where nothing has changed. So it has all the comparison logic in place to achieve that already. Quagga and FRR don't keep quiet upon restarts and churn the prefixes as they are learned. The appdb queue can suppress that if we want to or we can fix the issue in frr with minimum effort and quagga with moderate effort.

Pros:

  • fpmsyncd remains stateless and with no memory increase.
  • orchagent is the one doing the route reconciliation (which is where it should be IMO).
  • With new path attributes, only the format knowledge part of fpmsyncd changes so that it understands what it's reading and how to push that info to the db. So fpmsyncd's warm restart part doesn't need to be retested.

Cons:

  • Slight increase in complexity from the signalling of the start/stop timer through the appdb.
  1. Stateful fpmsyncd and with path sharing.
    Read routes from appdb and partially use the code from routeorch to reconstruct prefixes, ecmp groups etc. and share ecmp groups across prefixes. Read from fpm socket and construct a different set of prefixes but reuse the ecmp groups. On timer expiration, go through the prefixes and compare if they point to the same ecmp group and send only the delta down. Alternatively we can make this even simpler and just download the whole table paying attention only to the stale prefixes which need to be deleted. As I said above, routeorch already has code that suppresses prefixes where nothing has changed so sai/asic won't be notified needlessly.

Pros:

  • Can minimize to only the delta on what is sent to the db.
  • Partially share code with routeorch for reconstructing view.
  • Need to change 2 different components and retest both when new path attributes are introduced (orchagent and fpmsyncd) and retest warm restart.
  • Path sharing with this solution.

Cons:

  • fpmsyncd is now stateful and a bit more complex with increased memory needs during a restart.
  1. Stateful fpmsyncd and no path sharing.
    Similar to the above but without any path sharing.

Pros:

  • Can minimize to only the delta on what is sent to the db.

Cons:

  • fpmsyncd is now stateful and more complex.
  • Exploding memory consumption requirements.
  • No code sharing with orchagent for reconstructing view.
  • Need to change 2 different components and retest both when new path attributes are introduced (orchagent and fpmsyncd) and retest warm restart.

bool compareAllFV(const std::vector<FieldValueTuple> &left,
const std::vector<FieldValueTuple> &right);

bool compareOneFV(const std::string &v1, const std::string &v2);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should they be private?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, i can make them private.

* also present within v2. Otherwise we are running into some form of
* bug or an unsupported functionality.
*/
assert(v1Iter != v1Map.end());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if v1 is superset of v2?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If by 'superset' you're referring to v1 having more fields than v2, that shouldn't ever happen as you commented further below. I'll add more details to my code-comments about this expectation.

@jipanyang
Copy link
Copy Markdown
Contributor

So the current reconciliation implementation has the assumption that application will always have fix number of FieldValueTuple. Ex. for route BGP always provide two fields "nexthop" and "ifname", for neighbor, they are "family" and "neigh".

If there is any violation of this assumption, the reconciliation won't work. It is also related to the ProducerStateTable issue: sonic-net/sonic-swss-common#236

I think it is ok for now. Further update could be done when more flexible cases emerge. Could you put a little more information about this in the code comment field?

Copy link
Copy Markdown
Contributor

@jipanyang jipanyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments. Overall the changes look good to me.

db,
swsscommon.CFG_WARM_RESTART_TABLE_NAME, app_name,
[
("enable", "true"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use CLI to config it as other test cases?
config warm_restart enable/disable [APP]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, that's the way to go.

db,
swsscommon.CFG_WARM_RESTART_TABLE_NAME, app_name,
[
(app_name + "_timer", value),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"config warm_restart bgp_timer" should be prepared as that for arp:

root@sonic:/home/admin# config warm_restart --help
Usage: config warm_restart [OPTIONS] COMMAND [ARGS]...

  warm_restart-related configuration tasks

Options:
  -s, --redis-unix-socket-path TEXT
                                  unix socket path for redis connection
  --help                          Show this message and exit.

Commands:
  disable
  enable
  neighsyncd_timer
root@sonic:/home/admin# 

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid deferring this PR merge, i'll make this small adjustment right after i write this addition to sonic-utils. Then i'll come back to this UT code and will adjust accordingly.

const std::string &syncTableName,
const std::string &dockerName,
const std::string &appName) :
m_restorationTable(pipeline, syncTableName, false),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for using external pipeline for m_restorationTable here? m_restorationTable is for get only. Asking the caller to provide pipeline brings no benefit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure i got this one. How can we create the table without its pipeline? You mean, using the other table constructor? -- Table(const DBConnector *db, const std::string &tableName). Even if that's what you imply, we would need to create a DBConnector if we don't want the caller to provide it, right?

Please clarify...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my previous comment is confusing. WarmStartHelper() has both pipeline and syncTableName as parameters, the only purpose of them is to prepare a Table handle in WarmStartHelper class. Could a Table handle be prepared by the callers directly and remove pipeline/syncTableName, just like what has been done for "ProducerStateTable *syncTable"?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, I noticed that it is more convenient to have pipeline and table name as parameter with current routesync and neighsync.

* To be called by each application to obtain the active/inactive state of
* warm-restart functionality, and proceed to initialize the FSM accordingly.
*/
bool WarmStartHelper::isEnabled(void)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not think of a good name for this function. isEnabled() makes people feel it is safe to call it anytime they like to check the state, similar to isReconciled(void) or inProgress(void) , which is actually not true. How about bool WarmStartHelper::checkAndStart(void)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, i see your point. CheckAndStart() sounds good to me, will change.

{
warmStartTimer.setInterval(timespec{warmRestartIval, 0});
}

Copy link
Copy Markdown

@heidinet2007 heidinet2007 Oct 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have concern on assigning a static value to the warmStartTimer, because it's hard for an admin to identify a good number for such timer:

  • if it's too short, obviously some routes will be missing.
  • if it's too long, black hole may occur, as shown in the following example: a new prefix is advertised to BGP peers, peers have the route/path installed immediately and start to forward traffic to the warm rebooted box. However, the prefix is NOT yet installed in warm rebooted box's forwarding table due to the running warmStartTimer (not expire yet), hence packet drop. It looks worse, if the peers do have other paths to reach the prefix, which is true in most of DC setup with ECMP. I think loop may also occur, as long as the rebooted box's control plane is inconsistent with its forwarding plane. The longer the inconsistency, the higher chance the problems happen.

In order to reduce the chance, the forwarding path needs to coordinate with routing/zebra stack and set timer value accordingly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heidinet2007 absolutely, we covered these cases that you are describing in my design doc (find link below -- search for Known-Limitations section and also look at the CLI portion).

As mentioned in this doc, for the time being user will need to set both timers (bgp and warm-restart one) to similar/same values to avoid inconsistencies. And yes, i'm already working in an FRR extension that will allow bgp/zebra to notify fpmsyncd whenever all BGP EoR's are received. That should address all these issues.

https://github.com/Azure/SONiC/blob/dcac72377f23521a394694214678ea4450f6f70a/doc/routing-warm-reboot/Routing_Warm_Reboot.md)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be hard to deploy routing warm reboot without the auto sync between zebra and fpmsyncd (even though the update-delay timer and warmstarttimer can be tuned), because admins have alternative ways to achieve zero packet loss during routing stack upgrade. When will the auto sync be done? Is it possible to include it in this release?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second to this request.

Once we have current reconciliation processing merged, the channel for notification of EoR (all) received from BGP to fpmsyncd is necessary for a production quality BGP docker warm reboot feature.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heidinet2007 Can you please elaborate on this: "because admins have alternative ways to achieve zero packet loss during routing stack upgrade".

I do realize that there's room for suboptimal routing (or even drops) in the interval comprehended between the bgp-peers sending their EoRs, and the warmrestart-timer kicking off, but this is expected to be a very short interval if the timers are properly set.

I have already started to prototype the EoR-to-fpmsyncd solution but that won't be ready for this coming release (only two weeks).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikos-github Zooming out a little bit, the fundamental problem that we are trying to fix here is how to close the 'gap' between sonic's control and fwd plane to expedite the reconciliation process. So in my mind, given that SONiC is AFI/SAFI agnostic, there's no real need to treat each AFI/SAFI independently; as long as there's one missing EoR for a given AFI/SAFI, we know that no reconciliation process should take place (obviously, within a certain time boundary).

Of course, this imply that we will be penalizing faster AFI/SAFIs, but the alternative to this approach would be to introduce AFI/SAFI semantics in SONiC, which as you know, is a longer-term project.

And yep, i'm aware of EoR being used outside GR context, it's just a requirement.

Copy link
Copy Markdown
Contributor

@jipanyang jipanyang Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll have AFI info up to fpmsyncd, it is just up to how/when we could use it. Yes, multiple alternatives exist to utilized that.

I agree with @rodnymolina, currently the most urgent issue is to close the 'gap' between sonic's control and fwd plane to expedite the reconciliation process.

Event without explicit EoR, there is implicit EoR, we may refer to bgp update-delay processing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jipanyang I believe you are confusing things. There is no notion of an implicit EoR through the expiration of update-delay. The update-delay timer (not processing) is a catch all mechanism for BGP that controls the maximum time BGP will wait after the first neighbor is established and until it starts calculating bestpaths and sending out updates. You can certainly implement an EoR message based on stopping or upon expiration of the update-delay timer and while it is simple, it would also be wrong since you are not guaranteed that bestpath selection is finished, RIB installation is done, etc. In other words sending an EoR to RIB when the update-delay timer expires or is stopped, can most certainly be premature.

@rodnymolina What you are describing can potentially lead you to wait until the update-delay timer expires. Per my comment above, this has a good chance it will be premature in those cases.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be confusing to use EoR here, as it's more like a BGP term. What we need here is a RIB/ZEBRA CONVERGED signal to fpm. RIB convergence time has dependancy on routing protocols (here we consider only BGP), current BGP implementation already has a scheme to tell RIB when it receives EoR from all its neighbours. We should be able to utilize it.

Copy link
Copy Markdown
Contributor Author

@rodnymolina rodnymolina Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I vote for using something more generic, like FRR's existing EOIU terminology, which can be potentially extended to other protocols in the future too.

@jipanyang
Copy link
Copy Markdown
Contributor

@rodnymolina While we should continue the discussion, the implementation in this PR needs to be closed as soon as possible. It is the base for further optimization.

@nikos-github We have some work to be done to make use of the explicit EoR/Implicit EoR. I'm not saying that update-delay should be used, but the implementation of it could be a reference point. There is more processing needed at zebra. Having more to be done should not be the reason stating it is wrong. I prefer seeing actual effort trying the change and refine it step by step.

All the configured peers, except the shutdown peers, have sent explicit EOR
      (End-Of-RIB) or an implicit-EOR. The first keep-alive after BGP has reached
      Established is considered an implicit-EOR.

@nikos-github
Copy link
Copy Markdown
Contributor

nikos-github commented Oct 19, 2018

@jipanyang "We have some work to be done to make use of the explicit EoR/Implicit EoR. I'm not saying that update-delay should be used, but the implementation of it could be a reference point." i

It can not - that's precisely the point I was making. You can not infer anything from the explicit EoR/implicit EoR that the nbrs sent as to when the EoR to zebra should be sent. All that tell you is that you can not send earlier than them but that is useless anyway since we didn't want to do that in the first place.

"I prefer seeing actual effort trying the change and refine it step by step."

First you will need to write this up so that it's clear what you proposing. There is no processing needed at zebra other than to make sure the EoR goes through. So for zebra this will be a passthrough because it doesn't hold on to the BGP routes.

We can discuss more offline.

@rodnymolina
Copy link
Copy Markdown
Contributor Author

@nikos-github @jipanyang i'll have a design-doc ready next week. Let's discuss this EoR feature outside this PR.

@lguohan
Copy link
Copy Markdown
Contributor

lguohan commented Oct 23, 2018

@rodnymolina , can you rebase your pr against the latest master and fix the vstest issue?

@rodnymolina
Copy link
Copy Markdown
Contributor Author

@lguohan, i don't see that VS failure in my setup, which btw is unrelated to my feature. I just rebased and re-tested everything locally, and again all testcases passed. Will send another patch to force our VS tests to run again.

@lguohan
Copy link
Copy Markdown
Contributor

lguohan commented Oct 25, 2018

@rodnymolina , do not worry about the test_interfacesetmtu, in this case, teamsyncd crashed due to a race condition, we are investigating it.

@nikos-github
Copy link
Copy Markdown
Contributor

Unit tests are missing for some very important cases:

  • Withdrawals generated for prefixes not in the db.
  • Withdrawals and then changes for prefixes not in the db.

* 'delete' from being pushed down to AppDB, so we are handling this case
* differently than the 'add' one.
*/
if(refreshedOp == DEL_COMMAND)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor thing - please add a space after if and before the (.

self.ctn.exec_run("logger {}".format(marker))

if file:
self.runcmd(['sh', '-c', "echo \"{}\" >> {}".format(marker, file)])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no guarantee that this will not cut in the middle of a line. I prefer not to add log marker for swss or sairedis file. Instead, we should use SubscribeAsicDbObject to detect changes on asic db or app db instead of relying on the log file.

@lguohan
Copy link
Copy Markdown
Contributor

lguohan commented Oct 31, 2018

Rodny Molina and others added 9 commits November 8, 2018 17:16
Please refer to the corresponding design document for more details: sonic-net/SONiC#239

The following manual UT plan has been successfully executed. UT automation pending.

Physical topology formed by various BGP peers connecting to the DUT. Both non-ecmp and ecmp prefixes are learned/tested.

Testcase Suite 1: Making use of “pkill -9 bgpd && pkill -9 zebra” as trigger.
=============

IPv4 prefixes
==========

    * Restart zebra/bgpd:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart zebra/bgpd and add one new non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and add one new path to an IPv4 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one ecmp-path from an existing ecmp IPv4
      prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

IPv6 prefixes
==========

    * Restart zebra/bgpd:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart zebra/bgpd and add one new non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and add one new path to an IPv6 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one ecmp-path from an existing ecmp IPv6
      prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

Testcase Suite 2: Making use of sudo service bgp restart” as trigger.
=============

IPv4 prefixes
==========

    * Restart bgp service/docker:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart bgp service/docker and add one new non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and add one new path to an IPv4 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one ecmp-path from an existing
      ecmp IPv4 prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

IPv6 prefixes
==========

    * Restart bgp service/docker:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart bgp service/docker and add one new non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and add one new path to an IPv6 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one ecmp-path from an existing
      ecmp IPv6 prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

RB=
G=lnos-reviewers
R=pchaudhary,pmao,rmolina,samaity,sfardeen,zxu
A=

Signed-off-by: Rodny Molina <rmolina@linkedin.com>
Signed-off-by: Rodny Molina <rmolina@linkedin.com>
Code has been refactored to reduce the complexity associated to carrying 'state' for every received FV-tuple. Obviously, there's no free-lunch here, this is only possible at the expense of more memory utilization: we are now using two different buffers to store 'old' and 'new' state. Yet, this is a relatively-small price to pay for a much simpler implementation.

Signed-off-by: Rodny Molina <rmolina@linkedin.com>
Signed-off-by: Rodny Molina <rmolina@linkedin.com>
…case from passing

As part of these changes i'm also modifying the ip-addresses of my UT setup to
avoid clashes with existing/previous testcases.

Signed-off-by: Rodny Molina <rmolina@linkedin.com>
@rodnymolina
Copy link
Copy Markdown
Contributor Author

@lguohan I just finished refactoring my UTs as per your suggestion. Please take a look at this one when have a chance.

@lguohan
Copy link
Copy Markdown
Contributor

lguohan commented Nov 9, 2018

retest this please

Copy link
Copy Markdown
Contributor

@lguohan lguohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@lguohan lguohan merged commit f380685 into sonic-net:master Nov 10, 2018
batmancn pushed a commit to batmancn/sonic-swss that referenced this pull request Feb 13, 2019
* Routing-stack warm-reboot feature.

Please refer to the corresponding design document for more details: sonic-net/SONiC#239

The following manual UT plan has been successfully executed. UT automation pending.

Physical topology formed by various BGP peers connecting to the DUT. Both non-ecmp and ecmp prefixes are learned/tested.

Testcase Suite 1: Making use of “pkill -9 bgpd && pkill -9 zebra” as trigger.
=============

IPv4 prefixes
==========

    * Restart zebra/bgpd:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart zebra/bgpd and add one new non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and add one new path to an IPv4 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one ecmp-path from an existing ecmp IPv4
      prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

IPv6 prefixes
==========

    * Restart zebra/bgpd:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart zebra/bgpd and add one new non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and add one new path to an IPv6 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one ecmp-path from an existing ecmp IPv6
      prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

Testcase Suite 2: Making use of sudo service bgp restart” as trigger.
=============

IPv4 prefixes
==========

    * Restart bgp service/docker:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart bgp service/docker and add one new non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and add one new path to an IPv4 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one ecmp-path from an existing
      ecmp IPv4 prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

IPv6 prefixes
==========

    * Restart bgp service/docker:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart bgp service/docker and add one new non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and add one new path to an IPv6 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one ecmp-path from an existing
      ecmp IPv6 prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

RB=
G=lnos-reviewers
R=pchaudhary,pmao,rmolina,samaity,sfardeen,zxu
A=

Signed-off-by: Rodny Molina <rmolina@linkedin.com>

* Renaming 'restoration' code and making minor adjustments to fpmsyncd

Signed-off-by: Rodny Molina <rmolina@linkedin.com>

* Eliminating 'state' associated to field-value-tuples

Code has been refactored to reduce the complexity associated to carrying 'state' for every received FV-tuple. Obviously, there's no free-lunch here, this is only possible at the expense of more memory utilization: we are now using two different buffers to store 'old' and 'new' state. Yet, this is a relatively-small price to pay for a much simpler implementation.

Signed-off-by: Rodny Molina <rmolina@linkedin.com>

* Adding UTs to cover routing-warm-reboot logic.

Signed-off-by: Rodny Molina <rmolina@linkedin.com>

* Fixing an issue with warm-reboot UTs that prevented an existing test-case from passing

As part of these changes i'm also modifying the ip-addresses of my UT setup to
avoid clashes with existing/previous testcases.

Signed-off-by: Rodny Molina <rmolina@linkedin.com>

* Making some small adjustments

* Addressing review comments for my UT code

* Addressing more review comments.

* Refactoring UTs to rely on pubsub notifications instead of logs
EdenGri pushed a commit to EdenGri/sonic-swss that referenced this pull request Feb 28, 2022
- dump: dumps the entire state of all the units
 - blame: provides startup time for each service
 - plot: generate a svg of the boot order

Co-authored-by: lguohan <lguohan@gmail.com>
oleksandrivantsiv pushed a commit to oleksandrivantsiv/sonic-swss that referenced this pull request Mar 1, 2023
* [sairedis] [meta] Support connect to existing switch

* Remove comment
Janetxxx pushed a commit to Janetxxx/sonic-swss that referenced this pull request Nov 10, 2025
* Routing-stack warm-reboot feature.

Please refer to the corresponding design document for more details: sonic-net/SONiC#239

The following manual UT plan has been successfully executed. UT automation pending.

Physical topology formed by various BGP peers connecting to the DUT. Both non-ecmp and ecmp prefixes are learned/tested.

Testcase Suite 1: Making use of “pkill -9 bgpd && pkill -9 zebra” as trigger.
=============

IPv4 prefixes
==========

    * Restart zebra/bgpd:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart zebra/bgpd and add one new non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and add one new path to an IPv4 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one ecmp-path from an existing ecmp IPv4
      prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

IPv6 prefixes
==========

    * Restart zebra/bgpd:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart zebra/bgpd and add one new non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and add one new path to an IPv6 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart zebra/bgpd and withdraw one ecmp-path from an existing ecmp IPv6
      prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

Testcase Suite 2: Making use of sudo service bgp restart” as trigger.
=============

IPv4 prefixes
==========

    * Restart bgp service/docker:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart bgp service/docker and add one new non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one non-ecmp IPv4 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and add one new path to an IPv4 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one ecmp-path from an existing
      ecmp IPv4 prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

IPv6 prefixes
==========

    * Restart bgp service/docker:
            - Verify that all prefixes are properly stale-marked and that no
    change is pushed to AppDB during reconciliation.
            - Result: PASSED

    * Restart bgp service/docker and add one new non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, add prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that new-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one non-ecmp IPv6 prefix.
            - To produce a route-state-inconsistency, remove prefix in adjacent
    node before bgp sessions are re-established.
            - Verify that deleted-prefix msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and add one new path to an IPv6 ecmp-prefix.
            - To produce a route-state-inconsistency, add prefix-path in
    adjacent node before bgp sessions are re-established.
            - Verify that new prefix-path msg is NOT pushed down to AppDB till
    reconciliation takes place.
            - Result: PASSED

    * Restart bgp service/docker and withdraw one ecmp-path from an existing
      ecmp IPv6 prefix.
              - To produce a route-state-inconsistency, remove prefix-path in
      adjacent node before bgp sessions are re-established.
              - Verify that deleted-prefix-path msg is NOT pushed down to AppDB
      till reconciliation takes place.
              - Eventually, during reconciliation, this path will be eliminated
      as it’s not being refreshed by remote-peer.
              - Result: PASSED

RB=
G=lnos-reviewers
R=pchaudhary,pmao,rmolina,samaity,sfardeen,zxu
A=

Signed-off-by: Rodny Molina <rmolina@linkedin.com>

* Renaming 'restoration' code and making minor adjustments to fpmsyncd

Signed-off-by: Rodny Molina <rmolina@linkedin.com>

* Eliminating 'state' associated to field-value-tuples

Code has been refactored to reduce the complexity associated to carrying 'state' for every received FV-tuple. Obviously, there's no free-lunch here, this is only possible at the expense of more memory utilization: we are now using two different buffers to store 'old' and 'new' state. Yet, this is a relatively-small price to pay for a much simpler implementation.

Signed-off-by: Rodny Molina <rmolina@linkedin.com>

* Adding UTs to cover routing-warm-reboot logic.

Signed-off-by: Rodny Molina <rmolina@linkedin.com>

* Fixing an issue with warm-reboot UTs that prevented an existing test-case from passing

As part of these changes i'm also modifying the ip-addresses of my UT setup to
avoid clashes with existing/previous testcases.

Signed-off-by: Rodny Molina <rmolina@linkedin.com>

* Making some small adjustments

* Addressing review comments for my UT code

* Addressing more review comments.

* Refactoring UTs to rely on pubsub notifications instead of logs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants