[ warm-reboot ] adapted advanced-reboot for warm-reboot #776
[ warm-reboot ] adapted advanced-reboot for warm-reboot #776liat-grozovik merged 9 commits intosonic-net:masterfrom
Conversation
pavel-shirshov
left a comment
There was a problem hiding this comment.
Looks good, but please fix my comments
| sender_start = datetime.datetime.now() | ||
| self.log("Sender started at %s" % str(sender_start)) | ||
| for entry in packets_list: | ||
| time.sleep(interval) |
There was a problem hiding this comment.
I'd remove this. python is slow enough
There was a problem hiding this comment.
The sleep interval is quite small (0.0035s), I suspect you won't get that granularity. Plus as Pavel mentioned, you send rate might be already slower that that. Curious to know if you have checked the sending rate without sleep?
There was a problem hiding this comment.
Python is slow, but is able to send ~500 packets/second.
Its good for precision, but not for scapy.
The advisable time to detect warm-reboot disrupt is ~180 sec.
In this case:
180 x 500 x 2 = 180 000 packets (in and out).
Besides, after the warm-reboot the swtich becomes flat and floods packets for ~10 seconds out of 32 ports:
- 10 x 500 x 30 = 150 000 packets more.
Total 330 000 packets in capture.
In my builds, the biggest capture which scapy.sniff produced, was 200 000 packets.
Having added inter-packet interval, I reduced packet speed to ~250 packets/second
180 x 250 x 2 = 90 000 (in and out).
- 10 x 250 x 30 = 75 000 flooded packets
Total 165 000 packets in capture.
I added inter-packet interval to slow down the speed, sacrificing precision to ~5-10ms).
It made capture twice less.
To overcome this case I would consider another approach:
Make the Sniffer sniff only on selected 5 ports (1 vlan + 4 Lag), it will get rid of floods.
But in this case Sniff must be split to 5 threads (as scapy.sniff accepts one port). Then combine all captures and analize them.
But I think its a matter of further upgrade of sender and sniffer.
| self.sender_thr = threading.Thread(target = self.send_in_background) | ||
| self.sniff_thr = threading.Thread(target = self.sniff_in_background) | ||
| self.sniff_thr.start() | ||
| time.sleep(1) # Let the listener initialize completely. |
There was a problem hiding this comment.
probably it's better to use some sync. primitive to be sure you run your sender, when your sniffer is ready.
There was a problem hiding this comment.
scapy.sniff is slow and unstable on start (it misses packets),
and the stable capturing actually happens ~1s later after python has called scapy.sniff.
I offloaded scapy.sniff() to a separate thread,
and implemented explicit time.sleep() inside the sniff_in_background(), not in the main thread.
Now there is a waiter event between sender and sniffer.
| # Pre-generate list of packets to be sent in send_in_background method. | ||
| generate_start = datetime.datetime.now() | ||
| self.generate_bidirectional() | ||
| self.log("%s packets are ready after: %s" % (len(self.packets_list), str(datetime.datetime.now() - generate_start))) |
There was a problem hiding this comment.
Probably it's better to use %d, instead of the first %s.
There was a problem hiding this comment.
Imroved few more loggings.
| ip_src = self.from_server_src_addr,\ | ||
| ip_dst = self.from_server_dst_addr,\ | ||
| udp_sport = 1234,\ | ||
| udp_dport = 5000,\ |
There was a problem hiding this comment.
You don't need '' here.
There was a problem hiding this comment.
Removed slashes.
|
I get your change for a quick test. and I have following error: ok: [str-7260cx3-acs-2] => { Am I missing something? Can you make the matching change in test script? |
yxieca
left a comment
There was a problem hiding this comment.
Please fix the test failure
|
Hi Ying, Here is an example (from advanced-reboot.py, line 2) of how to call the test:
And here is how we call it from Ansible: |
|
Made to 201811 branch on 1/25/2019 |
…sonic-net#2679) Submodule src/sonic-utilities f95da07..2fe01fe: > neighbor advertiser script (sonic-net#469) > [aclshow] restore PRIO column and sort entries by priority (sonic-net#476) > Update watermark default polling interval to 10s (sonic-net#470) > show interface status <interface-name> throws error (fixes sonic-net#427) (sonic-net#440) Submodule src/sonic-swss 90eb25d..91171b6: > fix a unstable swss egress acl test (sonic-net#776) > [aclorch] Remove L4 port range support limitation on egress ACL table and add new SWSS virtual test. (sonic-net#741) > Fix orchagent SEGV when PortConfigDone not set (sonic-net#803) Submodule src/sonic-swss-common 2592b0c..5f4abd9: > Force only supported commands on consumer table (sonic-net#261) > Add multiple fields hdel support (sonic-net#267) Signed-off-by: Ying Xie <ying.xie@microsoft.com>
af0d084 2021-02-08 [sairedis] Add get response timeout knob (sonic-net#776) Signed-off-by: liora <liora@nvidia.com>
Description of PR
Summary: Support of warm-reboot for the advanced-reboot test.
Fixes # (issue)
Type of change
Approach
How did you do it?
Advanced reboot now may be called from ansible playbook as next:
ansible-playbook ... -e testcase_name=[ warm-reboot | fast-reboot ] [ -e reboot_limit=1 ]
warm-reboot.yml and fast-reboot.yml now both call advanced-reboot.yml with values:
Default value (if not given explicitly by ansible-playbook) are: 30 for fast_reboot, 0 for warm-reboot (defined in fast-reboot.yml and warm-reboot.yml, accordingly).
advanced-reboot.yml calls advanced-reboot.py with the above values (reboot_type, reboot_limit).
advanced-reboot.py for the warm-reboot has now different flow (switched from the previous fast-reboot flow by if).
The flow of warm-reboot is next:
577: During the setUp phase, pre-generate a bidirectional list of packets (t1<->vlan) to be sent later in background. IP address randomization is the same as in generate_from_t1.
The beforehand pre-generation will save a time for the PTF docker later, and focus it on fast sending of ready packets.
Functionality implemented in line 722.
RunTest phase:
Fast-reboot implementation has not been affected neither changed.
One sending pre-generated packet list (implementation in line 1032).
Another - captures in/out traffic on PTF, and dump it to a pcap file for later debugging (implementation in line 1047).
Identify the longest (by continuous packets losses and duration) disruption.
Calculate total packets looses and total duration of all disruptions.
How did you verify/test it?
(to check if warm-reboot flow is suitable for fast reboot as well).
(to check if the test shows no disruptions when a reboot actually does not happen).
Any platform specific information?
Supported testbed topology if it's a new test case?
I ran the test only on T0 topology.
Documentation