Skip to content

[202205] Improve the stability of fib/decap test#10416

Merged
wangxin merged 1 commit intosonic-net:202205from
congh-nvidia:fib_202205
Oct 30, 2023
Merged

[202205] Improve the stability of fib/decap test#10416
wangxin merged 1 commit intosonic-net:202205from
congh-nvidia:fib_202205

Conversation

@congh-nvidia
Copy link
Copy Markdown
Contributor

Description of PR

Summary:
This is the manual cherry-pick of PR #10415 for conflict resolving.

The fib/decap tests occasionally fail on or testbeds due to some random packet is not received by the ptf. After the debugging we found that there may be a big delay(max 0.3s observed on our testbed) between the packet is send by the ptf src port and received by the ptf dst port.
We suspect it could be introduced by the lag on the server side, because we see this failure mostly on the t1-lag/t1-lag-64 topologies.
And in the fib hash test, it looks like sometimes the packet is totally lost on the server(we have checked that there was no packet loss on the dut and fanout).
We have found a way to prevent this failure by:

  1. Add a timeout in the ptf verify_packet_any_port method, with the timeout, the ptf adapter will constantly poll in the dataplane until the timeout elapse.
  2. Decrease the BALANCING_TEST_TIMES in the fib_hash test, with this change, we no longer see the packet loss at all in the hash test. And 250 times is enough for the test. (Without this, even if the timeout was set to 5s, we still observed packet loss)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 201911
  • 202012
  • 202205
  • 202305

Approach

What is the motivation for this PR?

To fix that occasional packet loss failure in fib/decap tests.

How did you do it?

Check the summary.

How did you verify/test it?

We have run the regression with this fix for one week, the failure is no longer seen.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

1. Add a timeout in the packet verification for sometimes there could be a big delay(max 0.3s on our testbed) between the packet is received by the hypervisor and the ptf adapter.
2. Decrease the BALANCING_TEST_TIMES in the fib_hash test, if too many packets are sent in the test, there could be packet loss in the hypervisor server. 250 times is enough for the test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants