Conversation
Signed-off-by: marian-pritsak <marianp@mellanox.com>
There was a problem hiding this comment.
Have some general comments about the function test:
- When testing the detection time and recovery time. I would suggest to set each one a small value (eg. 100ms) and a large value (eg. 20s). For phase 1, setting small value should be fine.
- Can we also test the other priority class are not influenced during the recovery? i.e. The others packets will not be dropped.
- Can we also test if enabling pfcwd on all ports and there are pfc storm on all ports, sonic doesn't crash? (don't have to do the traffic test)
| ``` | ||
| ansible-playbook test_sonic.yml -i inventory --limit {DUT_NAME} --tags pfc_wd --extra-vars "testbed_type={TESTBED_TYPE}" | ||
| ``` | ||
| PFC WD test assumes that Fanout switch has [PFC generator](https://github.com/marian-pritsak/pfctest/blob/master/pfctest.py) available. |
There was a problem hiding this comment.
I don't think it is a good assumption. Shall we check-in the script and copy it to fanout during the test?
There was a problem hiding this comment.
It should be a part of fanout deploy. As I have no access to Arista fanout switches, I have to make this assumption until script will be added during deploy.
There was a problem hiding this comment.
So we don't need to copy the script to mlnx fanouts beforehand?
There was a problem hiding this comment.
We don't need to do it on every test run, only during deploy.
| @@ -0,0 +1,19 @@ | |||
| - set_fact: | |||
There was a problem hiding this comment.
Consider select port using minigraph_fact output?
There was a problem hiding this comment.
agree, ports should be taken from minigraph
There was a problem hiding this comment.
They are ports, not IP interfaces like VLANs or LAGs and they never change. This kind of assumption is made in all tests.
There was a problem hiding this comment.
Using minigraph to get the ports information is a common way in our tests. The benefit is to eliminate all the details regards to test topology or port naming way.
There was a problem hiding this comment.
I need to use only ports here to be sure through which port packets will be routed. Regarding names, there's a convention to name them that way (in templates, tests etc.). Will it be ok if I'll choose ports from the list of ports that are IP interfaces (minigraph_interfaces)?
There was a problem hiding this comment.
choosing ports from minigraph_interfaces won't work for lag topo. I am fine with the current method if it's not easy to choose from minigraph.
|
|
||
| - name: Set timers | ||
| set_fact: | ||
| pfc_wd_poll_time: 200 |
There was a problem hiding this comment.
rename to detection time?
| - name: Set timers | ||
| set_fact: | ||
| pfc_wd_poll_time: 200 | ||
| pfc_wd_penalty: 30000 |
There was a problem hiding this comment.
rename to restoration time? Set it to 30s looks too large to me. Can we also test the case when setting it to a relative small value eg. 200ms?
There was a problem hiding this comment.
renames is nice. although i see no issue with the current names.
for testing normal flow we should use the values provided by Azure production
for testing different flows 9if any) we can use any value.
what the purpose to change the 30000 to 200ms? this is not the case for production.
There was a problem hiding this comment.
I would prefer to rename as it will be easier for others to read when the variable names exactly match their meaning. For the timer setting, my suggestion is to test both large value and small value. As when the time setting is in ms, it pose harder challenge to our system than the large value.
BTW, according to the shared requirement doc, the recovery time should be <200ms. Let me know if I have missed something.
There was a problem hiding this comment.
I will rename them. As to requirements, recovery time here is set to 30 seconds to verify traffic is forwarded/dropped while the queue is in the stormed state. I will make smaller time value for timer test.
| - router_mac='{{ansible_Ethernet0['macaddress']}}' | ||
| - queue_index='{{pfc_queue_index}}' | ||
| - pkt_count='10000' | ||
| - port_src='{{pfc_wd_test_port_id}}' |
There was a problem hiding this comment.
Is port_src pfc_wd_rx_port_id or test_port_id? I think you are going to send the data packets to rx_port_id?
There was a problem hiding this comment.
It's the other way. Rx port on ptf docker is Tx port on DUT.
There was a problem hiding this comment.
I am lost here. Could you elaborate a little bit more? I was thinking the packet goes to rx_port on DUT and DUT will try to forward the packet to test_port neighbor via test_port. If this is the case, why the src_port equals to test_port? Let me know if I miss something.
There was a problem hiding this comment.
You're correct. I made a mistake in port indexes. Its rx_port instead of test_port
| - name: Verify that real restoration time is not less than configured | ||
| fail: | ||
| msg: Real restoration time is less than configured X2 | ||
| when: "{{(restore_time_list | sum) / 3 < pfc_wd_penalty }}" |
There was a problem hiding this comment.
Should fail if not in the range [restoration_time, restoration_time + polling_time].
There was a problem hiding this comment.
Added check of both bounds
|
|
||
| - name: Calculate detection and restoration timings | ||
| include: roles/test/tasks/pfc_wd/functional_test/timer_test.yml | ||
| with_sequence: start=1 end=3 |
There was a problem hiding this comment.
3 times might be too small. Consider to set a variable n to indicate the time? Can we also check maximum and minimum? And just want to understand what is the reason why we need to get the average instead of passing every time? What might be the reason of random error?
There was a problem hiding this comment.
3 times is the way usually used to verify such staff, why adding configurable parameter which will not be used?
There was a problem hiding this comment.
I doubt 3 is enough if we are calculating the average. Using "set fact" to set a iterative variable is more intuitive than hard code start=1 end=3 from my opinion.
There was a problem hiding this comment.
PFC generator is the weak link. I cannot guarantee that it is 100% reliable. But we can come up with other criteria if you have some suggestion
| shell: "date -d {{storm_detect.stdout.replace(' ',' ').split(' ')[2]}} +'%s%3N'" | ||
| register: storm_detect_millis | ||
|
|
||
| - name: Find PFC stoprm end marker |
| - block: | ||
| - name: Apply forward config to {{ pfc_wd_test_port }}. | ||
| vars: | ||
| config_file: pfc_wd_fwd_action.json |
There was a problem hiding this comment.
Can we apply drop action here? It is our key feature.
There was a problem hiding this comment.
both drop and fwd action should be verified.
if we have no other case for drop we should add so we have both options to be tested
There was a problem hiding this comment.
It's a timer test. We don't even get to the actual action, but we verify the time difference between a moment when fanout starts sending PFC frames and when WD detects it. We don't even get to applying actions here. For the purpose of verifying that traffic is dropped / forwarded when a queue is mitigated from receiving PFC frames, there are separate tests above in this file.
| @@ -0,0 +1,5 @@ | |||
| enable | |||
| configure terminal | |||
| docker exec scapy "./pfcgen.py --p{{pfc_queue_index}} --q{{pfc_queue_index}}=65535 -i {{pfc_frames_number}} -d {{pfc_fanout_interface | replace("ernet 1/", "")}} -r {{ansible_eth0['ipv4']['address']}}" | |||
There was a problem hiding this comment.
Is this the script we are going to use? I thought we will use pfctest.py. Where is the code to send syslog packets?
There was a problem hiding this comment.
what do you refer by code to send syslog packets? i am confused with the syslog and packets.
There was a problem hiding this comment.
We suppose to send syslog along with PFC frames to validate timer per discussion right? Just wondering where is the logic.
There was a problem hiding this comment.
There was a problem hiding this comment.
Thanks for the clarification. Since this template is fanout-hwsku dependent. Can we check the hwsku when using it?
sihuihan88
left a comment
There was a problem hiding this comment.
It might be better to separate .yml and json file in different folders.
Signed-off-by: marian-pritsak <marianp@mellanox.com>
| - pfc_wd_timer.json | ||
| - pfc_wd_del_action.json | ||
|
|
||
| - set_fact: pfc_wd_test_neighbor_addr="{{ pfc_wd_test_port_addr | regex_replace('(^.*\.).*$', '\\1') }}{{pfc_wd_test_port_addr.split('.')[3] | int + 1 }}" |
There was a problem hiding this comment.
These information can also be obtained from minigraph facts, minigraph_bgp.
| args: | ||
| host: "{{peer_device}}" | ||
| login: "{{peer_login}}" | ||
| connection: switch |
There was a problem hiding this comment.
pfc_storm.j2 will depends on fanout hwsku. Can we select based on fanout sku?
Signed-off-by: marian-pritsak <marianp@mellanox.com>
Signed-off-by: marian-pritsak <marianp@mellanox.com>
Signed-off-by: marian-pritsak <marianp@mellanox.com>
Signed-off-by: marian-pritsak <marianp@mellanox.com>
| register: errors_found | ||
|
|
||
| - name: Check if loganalyzer missed expected messages | ||
| shell: grep "TOTAL EXPECTED MISSING MATCHES" "{{ test_out_dir }}/{{ summary_file }}" | sed -n "s/TOTAL EXPECTED MISSING MATCHES:[[:space:]]*//p" |
There was a problem hiding this comment.
Why grep this phrase? Seems it is missing anyway, so it will fail the test case, right?
There was a problem hiding this comment.
It's not missing. Loganalyzer summary file contains a number of total expected matches, that are missing in logs. We should fail if we didn't get any of expected match.
There was a problem hiding this comment.
I checked the code: https://github.com/Azure/sonic-mgmt/blob/master/ansible/roles/test/files/tools/loganalyzer/loganalyzer.py#L526
Seems you should grep “TOTAL EXPECTED MATCHES”, right?
linkmgrd: * 3c2b546 2022-05-31 | Add default route support to `active-active` state machine (sonic-net#78) (github/202205, master, 202205) [Jing Zhang] * 6fa892e 2022-05-27 | Degrade `LinkProberStateMachineBase` virtual function logging level (sonic-net#80) [Longxiang Lyu] * 7b695ca 2022-05-27 | Fix mux wait timer and peer mux wait timer (sonic-net#81) [Longxiang Lyu] platform-daemons: * 0d90023 2022-05-31 | grpc client implementation for active-active dualtor (sonic-net#248) (github/master, github/202205, master, 202205) [vdahiya12] * 6b8bf69 2022-05-27 | [ycabled] Fix some syntax warnings in ycabled (sonic-net#263) [vdahiya12] * 2bcf936 2022-05-24 | [ycabled] fix the posting for mux_cable_static_info per downlink when ycabled is spawned; synchronizing executing Telemetry API (sonic-net#257) [vdahiya12] * ce217c0 2022-04-25 | Include changes from xcvr_api in transceiver_info table (sonic-net#253) [qinchuanares] * e0f8a35 2022-04-22 | Fix checkReplyType failed issue via recreating xcvr_table_helper on forking subprocess (sonic-net#255) [Stephen Sun] platform-common: * f575a40 2022-05-24 | [Credo][Ycable] changes for synchronizing executing Telemetry API's when mux toggle is inprogress (sonic-net#280) (github/202205, master, 202205) [vdahiya12] * b043372 2022-05-11 | [sonic_ssd] Nokia-7215: "show platform ssdhealth" not showing health percent (sonic-net#279) [bill-nokia] * d62d3d6 2022-05-04 | [CMIS]Fix low-power to high power mode transition (sonic-net#268) [Prince George] * f918125 2022-05-02 | [syseeprom] Enable display of vendor extension TLV content (sonic-net#270) [dflynn-Nokia] * 4e08440 2022-04-14 | [Credo][Ycable] improve logging for Server Powered off/Faulty cables (sonic-net#272) [vdahiya12] Signed-off-by: Ying Xie <ying.xie@microsoft.com>
…submodule head (sonic-net#11761) linkmgrd: * 476f85e 2022-08-17 | Update linkmgr health after getting default route update (sonic-net#117) (HEAD -> 202205, github/202205) [Longxiang Lyu] * fc589e9 2022-08-17 | Use `table` to toggle peer forwarding state (sonic-net#108) (sonic-net#120) [Longxiang Lyu] * bcb5a56 2022-08-17 | Fix azure pipeline (sonic-net#118) (sonic-net#121) [Longxiang Lyu] swss: * ef3a601 2022-08-17 | [muxorch] Returning true if nbr in skip_neighbor_ in isNeighborActive() (sonic-net#2415) (HEAD -> 202205) [Nikola Dancejic] sairedis: * aed01cd 2022-08-12 | Fix: missing sonic-db-cli in docker-sonic-vs image (sonic-net#1072) (sonic-net#1104) (github/202205) [Hua Liu] platform-daemon: * 5a68073 2022-08-01 | Xcvrd changes to support 400G ZR configuration (sonic-net#270) (HEAD -> 202205) [Prince George] swsssdk: * ca785a2 2022-06-01 | Remove sonic-db-cli (sonic-net#122) (HEAD -> 202205, origin/202205) [Hua Liu] Signed-off-by: Ying Xie <ying.xie@microsoft.com> Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Why I did it Submodule advances: sonic-utilities 8e8e6088 - [202211][dhcp_relay] Remove add field of vlanid to DHCP_RELAY table while adding vlan ([201811 sub-module] advance sub-modules: utilities, swss, swss-common sonic-net#2679) (16 hours ago) [Yaqiang Zhu] 1400fb94 - [GCU] Ignore bgpraw in GCU applier (Fix sfputil indexing for 7170-Q59S20 sonic-net#2623) (15 hours ago) [jingwenxie] f76a6364 - [vlan] Refresh dhcpv6_relay config while adding/deleting a vlan ([sonic-py-swsssdk] Update submodule sonic-net#2660) (15 hours ago) [Yaqiang Zhu] 7849e18d - [db_migrator] make LOG_LEVEL_DB migration more robust (Mellanox platform: attach queues 2 and 6 to lossy profile using generic buffer template sonic-net#2651) (16 hours ago) [Stepan Blyshchak] c7df6dfa - Fixed a bug in "show vnet routes all" causing screen overrun. (Add hook to allow customizing link cable lengths sonic-net#2644) (16 hours ago) [siqbal1986] a5505f02 - show logging CLI support for logs stored in tmpfs (Traceback error seen while issuing show interface commands with if_names sonic-net#2641) (16 hours ago) [mihirpat1] bbacb91a - [system-health] Fix issue: show system-health CLI crashes (Updating deb package for platform and sai sonic-net#2635) (16 hours ago) [Junchao-Mellanox] 8d724024 - [sai_failure_dump]Invoking dump during SAI failure ([dockers]: Upgrade LLDP docker to stretch build sonic-net#2633) (16 hours ago) [Sudharsan Dhamal Gopalarathnam] 3c3be526 - Add transceiver info CLI support to show output from TRANSCEIVER_INFO for ZR ([submodule]: Update sonic-sairedis pointer sonic-net#2630) (16 hours ago) [mihirpat1] 37f41666 - [show] add support for gRPC show commands for active-active ([bitmap-vnet]: Bitmap vnet test image [DO NOT MERGE] sonic-net#2629) (16 hours ago) [vdahiya12] b06d7fe4 - [show_bfd] add local discriminator in show bfd command ([Pmon] Selectively load pmon container daemons sonic-net#2625) (16 hours ago) [Baorong Liu] 6adcd3e8 - [GCU] Ignore bgpraw table in GCU operation ([Mellanox] Fix SAI version sonic-net#2628) (16 hours ago) [jingwenxie] c65bdc35 - [muxcable][config] Add support to enable/disable ceasing to be an advertisement interface when radv service is stopped (Add knob in ConfigDB to enable/disable telemetry container sonic-net#2622) (16 hours ago) [Jing Zhang] 91e9457f - Add Transceiver PM basic CLI support to show output from TRANSCEIVER_PM table for ZR ([201803] Restart SwSS, syncd and dependent services if a critical process in syncd container exits sonic-net#2615) (16 hours ago) [longhuan-cisco] 54cc8c5a - Remove TODO comment which is no longer relevant (Warm-reboot: teamd warm restart caused neighbor deleted and learned again. sonic-net#2600) (16 hours ago) [Lior Avramov] 6891b4fb - Making 'show feature autorestart' more resilient to missing auto_restart config in CONFIG_DB ([submodule] update mellanox hw-mgmgt pointer (V.2.0.0061) sonic-net#2592) (16 hours ago) [kartik-arista] 1e8bea37 - [storyteller] add link prober state change to story teller ([sonic-buildimage] New feature managementVRF(L3mdev) sonic-net#2585) (16 hours ago) [Jing Zhang] 7481a20f - Extend fast-reboot STATE_DB entry timer ([submodule]: update sonic-swss-common, sonic-py-swsssdk, sonic-snmpagent sonic-net#2577) (16 hours ago) [Aryeh Feigin] 0e08701c - [sonic_installer] use /etc/resolv.conf from the host when migrating packages (Set a rate limit on syslog messages from all Docker containers sonic-net#2573) (16 hours ago) [Stepan Blyshchak] 06096780 - Fixed admin state config CLI for Backport interfaces (Prior to install a new ONIE SONiC image, delete all partitions except EFI/ONIE sonic-net#2557) (16 hours ago) [anamehra] 9f1f13e4 - [show] Add bgpraw to show run all (Fixed typo on paragraph sonic-net#40 sonic-net#2537) (16 hours ago) [jingwenxie] 98bc8bd2 - [chassis][voq] Add "show fabric reachability" command. ([ntp]: Build 4.2.6 locally. sonic-net#2528) (16 hours ago) [jfeng-arista] 3a50b63f - Preserve copp tables through DB migration ([docker-radvd]: upgrade docker radvd to stretch based sonic-net#2524) (16 hours ago) [Aryeh Feigin] 28f6b127 - [masic] 'show interfaces counters' reminds to use '-d all' option to check for internal links (solve dependency issue sonic-net#2466) (16 hours ago) [wenyiz2021] 15026e14 - suppport multi asic for show queue counter ([dockers] Prevent old supervisord messages from gettting re-logged to syslog sonic-net#2439) (16 hours ago) [zhixzhu] 2d773e17 - [masic support] 'show run bgp' support for multi-asic (lo address not synced to the asic sonic-net#2427) (16 hours ago) [wenyiz2021] sonic-swss 4f304bc - [EVPN]Handling race condition when remote VNI arrives before tunnel map entry ([sonic-quagga] Function defect, do NOT cancel route while connect IP down sonic-net#2642) (15 hours ago) [Sudharsan Dhamal Gopalarathnam] 34fc615 - [sai_failure_dump]Invoking dump during SAI failure (Add hook to allow customizing link cable lengths sonic-net#2644) (15 hours ago) [Sudharsan Dhamal Gopalarathnam] b817695 - [autoneg]Fixing adv interface types to be set when AN is disabled (Fix issue with platform file path name sonic-net#2638) (15 hours ago) [Sudharsan Dhamal Gopalarathnam] ab36bd4 - [bfdorch] add local discriminator to state DB ([bitmap-vnet]: Bitmap vnet test image [DO NOT MERGE] sonic-net#2629) (15 hours ago) [Baorong Liu] 6343471 - Remove TODO comments that are no longer relevant (Add knob in ConfigDB to enable/disable telemetry container sonic-net#2622) (15 hours ago) [Lior Avramov] 2b1869c - [refactor]Refactoring sai handle status (Rollback kernel submodule update. sonic-net#2621) (15 hours ago) [Sudharsan Dhamal Gopalarathnam] c41a1b7 - Fix issue ARP entry is out of sync between kernel and APPL_DB after warm reboot if the ARP entry is updated more than once during warm reboot in PFC watchdog warm reboot test sonic-net#13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL ([sub module] advance sonic-utilities sub module for 201811 branch sonic-net#2619) (15 hours ago) [Stephen Sun] da0cf7a - Changed the BFD default detect multiplier to 10x ("failed to load plugin io.containerd.snapshotter..." seen during linux boot up sonic-net#2614) (15 hours ago) [siqbal1986] 13b5adf - [vstest] Only collect stdout of orchagent_restart_check in vstest ([submodules] update swss and utilities pointers sonic-net#2597) (15 hours ago) [bingwang-ms] 2b9d94d - Avoid aborting orchagent when setting TUNNEL attributes (build failing for PLATFORM=p4 sonic-net#2591) (15 hours ago) [Stephen Sun] 99b7d3b - Only collect stdout of orchagent_restart_check in vstest ( [saibcm-modules]: import new bcm modules sonic-net#2578) (15 hours ago) [bingwang-ms] 5209c42 - dereg acl-rule counters during acl-table del ([201803] Set a rate limit on syslog messages from all Docker containers sonic-net#2574) (15 hours ago) [Vivek] ae68054 - Fixed set mtu for deleted subintf due to late notification ([vs]: Add option to specify platform name for DVS orchagent sonic-net#2571) (15 hours ago) [EdenGri] ab13dfa - Remove TODO comments which are no longer needed (support set timezone in ConfigDB sonic-net#2568) (15 hours ago) [Junchao-Mellanox] a3545cf - Modify coppmgr mergeConfig to support preserving copp tables through reboot. (Added new SN3700/SN3700C Mellanox platforms sonic-net#2548) (15 hours ago) [Aryeh Feigin] be16e79 - Use github code scanning instead of LGTM ([201803] [services] Restart SwSS service upon unexpected critical process exit sonic-net#2546) (15 hours ago) [Liu Shilong] 63c0234 - Updated handling of VRF_VNI mapping and VLAN_VNI mapping for same VNI ID (Move warm_restart enable/disable config to stateDB WARM_RESTART_ENABL… sonic-net#2538) (15 hours ago) [Tapash Das] 4844111 - Fix potential risks ([mlnx] Fix sai xml path for boxer platform sonic-net#2516) (15 hours ago) [Liran-Ar] 6420808 - [p4orch]: PINS Extension tables support ([build] When generating image version, handle case where current commit has no reachable tags sonic-net#2506) (15 hours ago) [svshah-intel] sonic-swss-common 1badd46 - Increase the netlink buffer size from 3MB to 16MB. (arp_update doesn't sleep 300 between each execution sonic-net#739) (14 hours ago) [KISHORE KUNAL] 6555057 - Refactor eventpublisher deinit ([acl] Add default deny rule for l3 table sonic-net#734) (14 hours ago) [Zain Budhwani] f4d6de7 - Use github code scanning instead of LGTM ([sonic-quagga]:update submodule sonic-net#718) (14 hours ago) [Liu Shilong] sonic-linux-kernel 74f9a8f - Update linux kernel for hw-mgmt V.7.0020.4104 (Move template files to /usr/share/sonic/templates sonic-net#305) (14 hours ago) [Stephen Sun] 6365701 - Fixes for emmc unreliability ([build_debian.sh]: Integrate system dump script sonic-net#270) (14 hours ago) [Samuel Angebault] How I did it How to verify it
Signed-off-by: marian-pritsak marianp@mellanox.com