Skip to content

[ warm-reboot ] adapted advanced-reboot for warm-reboot #776

Merged
liat-grozovik merged 9 commits intosonic-net:masterfrom
romankachur-mlnx:master
Jan 24, 2019
Merged

[ warm-reboot ] adapted advanced-reboot for warm-reboot #776
liat-grozovik merged 9 commits intosonic-net:masterfrom
romankachur-mlnx:master

Conversation

@romankachur-mlnx
Copy link
Copy Markdown
Contributor

Description of PR

Summary: Support of warm-reboot for the advanced-reboot test.
Fixes # (issue)

Type of change

  • [] Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Approach

How did you do it?

Advanced reboot now may be called from ansible playbook as next:
ansible-playbook ... -e testcase_name=[ warm-reboot | fast-reboot ] [ -e reboot_limit=1 ]

warm-reboot.yml and fast-reboot.yml now both call advanced-reboot.yml with values:

  • reboot_type (fast-reboot or warm-reboot, defined in fast-reboot.yml and warm-reboot.yml, accordingly).
  • reboot_limit (time in seconds, which defines the limit in which the chosen reboot must fit),
    Default value (if not given explicitly by ansible-playbook) are: 30 for fast_reboot, 0 for warm-reboot (defined in fast-reboot.yml and warm-reboot.yml, accordingly).

advanced-reboot.yml calls advanced-reboot.py with the above values (reboot_type, reboot_limit).

advanced-reboot.py for the warm-reboot has now different flow (switched from the previous fast-reboot flow by if).

The flow of warm-reboot is next:
577: During the setUp phase, pre-generate a bidirectional list of packets (t1<->vlan) to be sent later in background. IP address randomization is the same as in generate_from_t1.
The beforehand pre-generation will save a time for the PTF docker later, and focus it on fast sending of ready packets.
Functionality implemented in line 722.

RunTest phase:

  1. 823: Switch the generic flow to fast-reboot only.
    Fast-reboot implementation has not been affected neither changed.
  2. 856: Switch the generic flow to warm-reboot:
  • Stop the watcher (it was used to detect Control Plane down, and not needed further in warm-reboot flow).
  • Start two background threads for 180 seconds:
    One sending pre-generated packet list (implementation in line 1032).
    Another - captures in/out traffic on PTF, and dump it to a pcap file for later debugging (implementation in line 1047).
  • Examine the captured traffic (implementation in line 1105):
    Identify the longest (by continuous packets losses and duration) disruption.
    Calculate total packets looses and total duration of all disruptions.
  1. 875: back to generic flow.

How did you verify/test it?

  1. I ran the playbook with -e testcase_name=warm-reboot
  2. I ran the playbook with -e testcase_name=fast-reboot
  3. I ran the playbook with -e testcase_name=fast-reboot being directed to warm-reboot flow
    (to check if warm-reboot flow is suitable for fast reboot as well).
  4. I ran the playbook with -e testcase_name=warm-reboot being redirected to the temporary flow without reboot
    (to check if the test shows no disruptions when a reboot actually does not happen).

Any platform specific information?

Supported testbed topology if it's a new test case?

I ran the test only on T0 topology.

Documentation

@romankachur-mlnx romankachur-mlnx changed the title warm-reboot test [ warm-reboot ] adapted advanced-reboot for warm-reboot Jan 16, 2019
Copy link
Copy Markdown
Contributor

@pavel-shirshov pavel-shirshov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but please fix my comments

sender_start = datetime.datetime.now()
self.log("Sender started at %s" % str(sender_start))
for entry in packets_list:
time.sleep(interval)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove this. python is slow enough

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sleep interval is quite small (0.0035s), I suspect you won't get that granularity. Plus as Pavel mentioned, you send rate might be already slower that that. Curious to know if you have checked the sending rate without sleep?

Copy link
Copy Markdown
Contributor Author

@romankachur-mlnx romankachur-mlnx Jan 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python is slow, but is able to send ~500 packets/second.
Its good for precision, but not for scapy.

The advisable time to detect warm-reboot disrupt is ~180 sec.
In this case:
180 x 500 x 2 = 180 000 packets (in and out).

Besides, after the warm-reboot the swtich becomes flat and floods packets for ~10 seconds out of 32 ports:

  • 10 x 500 x 30 = 150 000 packets more.
    Total 330 000 packets in capture.

In my builds, the biggest capture which scapy.sniff produced, was 200 000 packets.

Having added inter-packet interval, I reduced packet speed to ~250 packets/second
180 x 250 x 2 = 90 000 (in and out).

  • 10 x 250 x 30 = 75 000 flooded packets
    Total 165 000 packets in capture.

I added inter-packet interval to slow down the speed, sacrificing precision to ~5-10ms).
It made capture twice less.

To overcome this case I would consider another approach:
Make the Sniffer sniff only on selected 5 ports (1 vlan + 4 Lag), it will get rid of floods.
But in this case Sniff must be split to 5 threads (as scapy.sniff accepts one port). Then combine all captures and analize them.

But I think its a matter of further upgrade of sender and sniffer.

self.sender_thr = threading.Thread(target = self.send_in_background)
self.sniff_thr = threading.Thread(target = self.sniff_in_background)
self.sniff_thr.start()
time.sleep(1) # Let the listener initialize completely.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably it's better to use some sync. primitive to be sure you run your sender, when your sniffer is ready.

Copy link
Copy Markdown
Contributor Author

@romankachur-mlnx romankachur-mlnx Jan 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scapy.sniff is slow and unstable on start (it misses packets),
and the stable capturing actually happens ~1s later after python has called scapy.sniff.

I offloaded scapy.sniff() to a separate thread,
and implemented explicit time.sleep() inside the sniff_in_background(), not in the main thread.

Now there is a waiter event between sender and sniffer.

# Pre-generate list of packets to be sent in send_in_background method.
generate_start = datetime.datetime.now()
self.generate_bidirectional()
self.log("%s packets are ready after: %s" % (len(self.packets_list), str(datetime.datetime.now() - generate_start)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it's better to use %d, instead of the first %s.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imroved few more loggings.

ip_src = self.from_server_src_addr,\
ip_dst = self.from_server_dst_addr,\
udp_sport = 1234,\
udp_dport = 5000,\
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need '' here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed slashes.

Copy link
Copy Markdown
Collaborator

@yxieca yxieca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a great direction you are moving towards. Solves quite a few issues I was pondering! :-)

Thanks,
Ying

yxieca
yxieca previously approved these changes Jan 17, 2019
@yxieca
Copy link
Copy Markdown
Collaborator

yxieca commented Jan 17, 2019

I get your change for a quick test. and I have following error:

ok: [str-7260cx3-acs-2] => {
"out.stdout_lines": [
"WARNING: No route found for IPv6 destination :: (no default route?)",
"Traceback (most recent call last):",
" File "/usr/bin/ptf", line 580, in ",
" test_suite.append(test())",
" File "ptftests/advanced-reboot.py", line 416, in init",
" self.check_param('dut_username', '', required = True)",
" File "ptftests/advanced-reboot.py", line 502, in check_param",
" raise Exception("Test parameter '%s' is required" % param)",
"Exception: Test parameter 'dut_username' is required"
]
}

Am I missing something? Can you make the matching change in test script?

@yxieca yxieca dismissed their stale review January 17, 2019 23:25

test failure observed

Copy link
Copy Markdown
Collaborator

@yxieca yxieca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the test failure

@romankachur-mlnx
Copy link
Copy Markdown
Contributor Author

Hi Ying,
I see that the error you got, references to the required parameter being passed to the test:
dut_username
Its interesting that is always has been mandatory.
And I see that no changes were done lately at that point.

Here is an example (from advanced-reboot.py, line 2) of how to call the test:

#ptf --test-dir ptftests fast-reboot --qlen=1000 --platform remote -t 'verbose=True;dut_username="admin";dut_hostname="10.0.0.243";reboot_limit_in_seconds=30;portchannel_ports_file="/tmp/portchannel_interfaces.json";vlan_ports_file="/tmp/vlan_interfaces.json";ports_file="/tmp/ports.json";dut_mac="4c:76:25:f5:48:80";default_ip_range="192.168.0.0/16";vlan_ip_range="172.0.0.0/22";arista_vms="["10.0.0.200","10.0.0.201","10.0.0.202","10.0.0.203"]"' --platform-dir ptftests --disable-vxlan --disable-geneve --disable-erspan --disable-mpls --disable-nvgre

And here is how we call it from Ansible:

- include: ptf_runner.yml
  vars:
    ptf_test_name: Advanced-reboot test
    ptf_test_dir: ptftests
    ptf_test_path: advanced-reboot.ReloadTest
    ptf_platform: remote
    ptf_platform_dir: ptftests
    ptf_qlen: 1000
    ptf_test_params:
    - verbose=False
    - dut_username=\"{{ ansible_ssh_user }}\"
    - dut_hostname=\"{{ ansible_host }}\"
    - reboot_limit_in_seconds={{ reboot_limit }}
    - reboot_type=\"{{ reboot_type }}\"
    - portchannel_ports_file=\"/tmp/portchannel_interfaces.json\"
    - vlan_ports_file=\"/tmp/vlan_interfaces.json\"
    - ports_file=\"/tmp/ports.json\"
    - dut_mac='{{ ansible_Ethernet0['macaddress'] }}'
    - dut_vlan_ip='192.168.0.1'
    - default_ip_range='192.168.0.0/16'
    - vlan_ip_range=\"{{ minigraph_vlan_interfaces[0]['subnet'] }}\"
    - lo_v6_prefix=\"{{ minigraph_lo_interfaces | map(attribute='addr') | ipv6 | first | ipsubnet(64) }}\"
    - arista_vms=\"['{{ vm_hosts | list | join("','") }}']\"

@liat-grozovik liat-grozovik merged commit e6299ed into sonic-net:master Jan 24, 2019
@yxieca
Copy link
Copy Markdown
Collaborator

yxieca commented Jan 25, 2019

Made to 201811 branch on 1/25/2019

kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
…sonic-net#2679)

Submodule src/sonic-utilities f95da07..2fe01fe:
  > neighbor advertiser script (sonic-net#469)
  > [aclshow] restore PRIO column and sort entries by priority (sonic-net#476)
  > Update watermark default polling interval to 10s (sonic-net#470)
  > show interface status <interface-name> throws error (fixes sonic-net#427) (sonic-net#440)

Submodule src/sonic-swss 90eb25d..91171b6:
  > fix a unstable swss egress acl test (sonic-net#776)
  > [aclorch] Remove  L4 port range support limitation on egress ACL table and add new SWSS virtual test. (sonic-net#741)
  > Fix orchagent SEGV when PortConfigDone not set (sonic-net#803)

Submodule src/sonic-swss-common 2592b0c..5f4abd9:
  > Force only supported commands on consumer table (sonic-net#261)
  > Add multiple fields hdel support (sonic-net#267)


Signed-off-by: Ying Xie <ying.xie@microsoft.com>
kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
af0d084 2021-02-08 [sairedis] Add get response timeout knob (sonic-net#776)

Signed-off-by: liora <liora@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants