[ warm-reboot ] adapted advanced-reboot for warm-reboot by romankachur-mlnx · Pull Request #776 · sonic-net/sonic-mgmt

romankachur-mlnx · 2019-01-16T15:25:33Z

Description of PR

Summary: Support of warm-reboot for the advanced-reboot test.
Fixes # (issue)

Type of change

[] Bug fix
Testbed and Framework(new/improvement)
Test case(new/improvement)

Approach

How did you do it?

Advanced reboot now may be called from ansible playbook as next:
ansible-playbook ... -e testcase_name=[ warm-reboot | fast-reboot ] [ -e reboot_limit=1 ]

warm-reboot.yml and fast-reboot.yml now both call advanced-reboot.yml with values:

reboot_type (fast-reboot or warm-reboot, defined in fast-reboot.yml and warm-reboot.yml, accordingly).
reboot_limit (time in seconds, which defines the limit in which the chosen reboot must fit),
Default value (if not given explicitly by ansible-playbook) are: 30 for fast_reboot, 0 for warm-reboot (defined in fast-reboot.yml and warm-reboot.yml, accordingly).

advanced-reboot.yml calls advanced-reboot.py with the above values (reboot_type, reboot_limit).

advanced-reboot.py for the warm-reboot has now different flow (switched from the previous fast-reboot flow by if).

The flow of warm-reboot is next:
577: During the setUp phase, pre-generate a bidirectional list of packets (t1<->vlan) to be sent later in background. IP address randomization is the same as in generate_from_t1.
The beforehand pre-generation will save a time for the PTF docker later, and focus it on fast sending of ready packets.
Functionality implemented in line 722.

RunTest phase:

823: Switch the generic flow to fast-reboot only.
Fast-reboot implementation has not been affected neither changed.
856: Switch the generic flow to warm-reboot:

Stop the watcher (it was used to detect Control Plane down, and not needed further in warm-reboot flow).
Start two background threads for 180 seconds:
One sending pre-generated packet list (implementation in line 1032).
Another - captures in/out traffic on PTF, and dump it to a pcap file for later debugging (implementation in line 1047).
Examine the captured traffic (implementation in line 1105):
Identify the longest (by continuous packets losses and duration) disruption.
Calculate total packets looses and total duration of all disruptions.

875: back to generic flow.

How did you verify/test it?

I ran the playbook with -e testcase_name=warm-reboot
I ran the playbook with -e testcase_name=fast-reboot
I ran the playbook with -e testcase_name=fast-reboot being directed to warm-reboot flow
(to check if warm-reboot flow is suitable for fast reboot as well).
I ran the playbook with -e testcase_name=warm-reboot being redirected to the temporary flow without reboot
(to check if the test shows no disruptions when a reboot actually does not happen).

Any platform specific information?

Supported testbed topology if it's a new test case?

I ran the test only on T0 topology.

Documentation

pavel-shirshov

Looks good, but please fix my comments

pavel-shirshov · 2019-01-17T01:27:17Z

ansible/roles/test/files/ptftests/advanced-reboot.py

+        sender_start = datetime.datetime.now()
+        self.log("Sender started at %s" % str(sender_start))
+        for entry in packets_list:
+            time.sleep(interval)


I'd remove this. python is slow enough

The sleep interval is quite small (0.0035s), I suspect you won't get that granularity. Plus as Pavel mentioned, you send rate might be already slower that that. Curious to know if you have checked the sending rate without sleep?

Python is slow, but is able to send ~500 packets/second.
Its good for precision, but not for scapy.

The advisable time to detect warm-reboot disrupt is ~180 sec.
In this case:
180 x 500 x 2 = 180 000 packets (in and out).

Besides, after the warm-reboot the swtich becomes flat and floods packets for ~10 seconds out of 32 ports:

10 x 500 x 30 = 150 000 packets more.
Total 330 000 packets in capture.

In my builds, the biggest capture which scapy.sniff produced, was 200 000 packets.

Having added inter-packet interval, I reduced packet speed to ~250 packets/second
180 x 250 x 2 = 90 000 (in and out).

10 x 250 x 30 = 75 000 flooded packets
Total 165 000 packets in capture.

I added inter-packet interval to slow down the speed, sacrificing precision to ~5-10ms).
It made capture twice less.

To overcome this case I would consider another approach:
Make the Sniffer sniff only on selected 5 ports (1 vlan + 4 Lag), it will get rid of floods.
But in this case Sniff must be split to 5 threads (as scapy.sniff accepts one port). Then combine all captures and analize them.

But I think its a matter of further upgrade of sender and sniffer.

pavel-shirshov · 2019-01-17T01:39:11Z

ansible/roles/test/files/ptftests/advanced-reboot.py

+        self.sender_thr = threading.Thread(target = self.send_in_background)
+        self.sniff_thr = threading.Thread(target = self.sniff_in_background)
+        self.sniff_thr.start()
+        time.sleep(1)           # Let the listener initialize completely.


probably it's better to use some sync. primitive to be sure you run your sender, when your sniffer is ready.

scapy.sniff is slow and unstable on start (it misses packets),
and the stable capturing actually happens ~1s later after python has called scapy.sniff.

I offloaded scapy.sniff() to a separate thread,
and implemented explicit time.sleep() inside the sniff_in_background(), not in the main thread.

Now there is a waiter event between sender and sniffer.

pavel-shirshov · 2019-01-17T01:40:30Z

ansible/roles/test/files/ptftests/advanced-reboot.py

+            # Pre-generate list of packets to be sent in send_in_background method.
+            generate_start = datetime.datetime.now()
+            self.generate_bidirectional()
+            self.log("%s packets are ready after: %s" % (len(self.packets_list), str(datetime.datetime.now() - generate_start)))


Probably it's better to use %d, instead of the first %s.

Imroved few more loggings.

pavel-shirshov · 2019-01-17T01:43:28Z

ansible/roles/test/files/ptftests/advanced-reboot.py

+                    ip_src = self.from_server_src_addr,\
+                    ip_dst = self.from_server_dst_addr,\
+                    udp_sport = 1234,\
+                    udp_dport = 5000,\


You don't need '' here.

Removed slashes.

ansible/roles/test/files/ptftests/advanced-reboot.py

yxieca

It is a great direction you are moving towards. Solves quite a few issues I was pondering! :-)

Thanks,
Ying

yxieca · 2019-01-17T23:25:38Z

I get your change for a quick test. and I have following error:

ok: [str-7260cx3-acs-2] => {
"out.stdout_lines": [
"WARNING: No route found for IPv6 destination :: (no default route?)",
"Traceback (most recent call last):",
" File "/usr/bin/ptf", line 580, in ",
" test_suite.append(test())",
" File "ptftests/advanced-reboot.py", line 416, in init",
" self.check_param('dut_username', '', required = True)",
" File "ptftests/advanced-reboot.py", line 502, in check_param",
" raise Exception("Test parameter '%s' is required" % param)",
"Exception: Test parameter 'dut_username' is required"
]
}

Am I missing something? Can you make the matching change in test script?

test failure observed

yxieca

Please fix the test failure

romankachur-mlnx · 2019-01-18T10:55:01Z

Hi Ying,
I see that the error you got, references to the required parameter being passed to the test:
dut_username
Its interesting that is always has been mandatory.
And I see that no changes were done lately at that point.

Here is an example (from advanced-reboot.py, line 2) of how to call the test:

#ptf --test-dir ptftests fast-reboot --qlen=1000 --platform remote -t 'verbose=True;dut_username="admin";dut_hostname="10.0.0.243";reboot_limit_in_seconds=30;portchannel_ports_file="/tmp/portchannel_interfaces.json";vlan_ports_file="/tmp/vlan_interfaces.json";ports_file="/tmp/ports.json";dut_mac="4c:76:25:f5:48:80";default_ip_range="192.168.0.0/16";vlan_ip_range="172.0.0.0/22";arista_vms="["10.0.0.200","10.0.0.201","10.0.0.202","10.0.0.203"]"' --platform-dir ptftests --disable-vxlan --disable-geneve --disable-erspan --disable-mpls --disable-nvgre

And here is how we call it from Ansible:

- include: ptf_runner.yml
  vars:
    ptf_test_name: Advanced-reboot test
    ptf_test_dir: ptftests
    ptf_test_path: advanced-reboot.ReloadTest
    ptf_platform: remote
    ptf_platform_dir: ptftests
    ptf_qlen: 1000
    ptf_test_params:
    - verbose=False
    - dut_username=\"{{ ansible_ssh_user }}\"
    - dut_hostname=\"{{ ansible_host }}\"
    - reboot_limit_in_seconds={{ reboot_limit }}
    - reboot_type=\"{{ reboot_type }}\"
    - portchannel_ports_file=\"/tmp/portchannel_interfaces.json\"
    - vlan_ports_file=\"/tmp/vlan_interfaces.json\"
    - ports_file=\"/tmp/ports.json\"
    - dut_mac='{{ ansible_Ethernet0['macaddress'] }}'
    - dut_vlan_ip='192.168.0.1'
    - default_ip_range='192.168.0.0/16'
    - vlan_ip_range=\"{{ minigraph_vlan_interfaces[0]['subnet'] }}\"
    - lo_v6_prefix=\"{{ minigraph_lo_interfaces | map(attribute='addr') | ipv6 | first | ipsubnet(64) }}\"
    - arista_vms=\"['{{ vm_hosts | list | join("','") }}']\"

ansible/roles/test/tasks/warm-reboot.yml

ansible/roles/test/tasks/fast-reboot.yml

yxieca · 2019-01-25T20:59:50Z

Made to 201811 branch on 1/25/2019

…sonic-net#2679) Submodule src/sonic-utilities f95da07..2fe01fe: > neighbor advertiser script (sonic-net#469) > [aclshow] restore PRIO column and sort entries by priority (sonic-net#476) > Update watermark default polling interval to 10s (sonic-net#470) > show interface status <interface-name> throws error (fixes sonic-net#427) (sonic-net#440) Submodule src/sonic-swss 90eb25d..91171b6: > fix a unstable swss egress acl test (sonic-net#776) > [aclorch] Remove L4 port range support limitation on egress ACL table and add new SWSS virtual test. (sonic-net#741) > Fix orchagent SEGV when PortConfigDone not set (sonic-net#803) Submodule src/sonic-swss-common 2592b0c..5f4abd9: > Force only supported commands on consumer table (sonic-net#261) > Add multiple fields hdel support (sonic-net#267) Signed-off-by: Ying Xie <ying.xie@microsoft.com>

af0d084 2021-02-08 [sairedis] Add get response timeout knob (sonic-net#776) Signed-off-by: liora <liora@nvidia.com>

Roman Kachur added 4 commits January 15, 2019 11:08

Adapted advanced-reboot test for warm-boot

00e88ae

Improved disrupt calculation. Improved logging.

7649a1f

Fixed typo at 1041 and 1056

01d54c9

Fixed typo at 865

e11c785

romankachur-mlnx changed the title ~~warm-reboot test~~ [ warm-reboot ] adapted advanced-reboot for warm-reboot Jan 16, 2019

Merge branch 'master' into master

17d0c23

pavel-shirshov requested review from pavel-shirshov and yxieca January 17, 2019 00:55

pavel-shirshov suggested changes Jan 17, 2019

View reviewed changes