Skip to content

Fix flakiness in pfcwd/test_pfcwd_cli.py#19969

Merged
StormLiangMS merged 8 commits intosonic-net:masterfrom
vivekverma-arista:fix-test-pfcwd-cli
Aug 12, 2025
Merged

Fix flakiness in pfcwd/test_pfcwd_cli.py#19969
StormLiangMS merged 8 commits intosonic-net:masterfrom
vivekverma-arista:fix-test-pfcwd-cli

Conversation

@vivekverma-arista
Copy link
Copy Markdown
Contributor

Description of PR

Summary:
Fixes #714, #18496

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Approach

What is the motivation for this PR?

Recent fix: #17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?

The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?

Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@StormLiangMS
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@lipxu lipxu requested a review from StormLiangMS August 1, 2025 06:40
StormLiangMS
StormLiangMS previously approved these changes Aug 4, 2025
Copy link
Copy Markdown
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@StormLiangMS
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@StormLiangMS StormLiangMS reopened this Aug 7, 2025
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@StormLiangMS StormLiangMS merged commit f5dbe21 into sonic-net:master Aug 12, 2025
20 checks passed
@mssonicbld
Copy link
Copy Markdown
Collaborator

@vivekverma-arista PR conflicts with 202411 branch

@mssonicbld
Copy link
Copy Markdown
Collaborator

@vivekverma-arista PR conflicts with 202505 branch

@vivekverma-arista vivekverma-arista deleted the fix-test-pfcwd-cli branch August 14, 2025 06:11
@vivekverma-arista
Copy link
Copy Markdown
Contributor Author

202505 cherry pick: #20247

@vivekverma-arista
Copy link
Copy Markdown
Contributor Author

202411 cherry pick: #20248

ashutosh-agrawal pushed a commit to ashutosh-agrawal/sonic-mgmt that referenced this pull request Aug 14, 2025
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.
vidyac86 pushed a commit to vidyac86/sonic-mgmt that referenced this pull request Oct 23, 2025
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.
opcoder0 pushed a commit to opcoder0/sonic-mgmt that referenced this pull request Dec 8, 2025
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

Signed-off-by: opcoder0 <110003254+opcoder0@users.noreply.github.com>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Dec 16, 2025
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

Signed-off-by: Guy Shemesh <gshemesh@nvidia.com>
AharonMalkin pushed a commit to AharonMalkin/sonic-mgmt that referenced this pull request Dec 16, 2025
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

Signed-off-by: Aharon Malkin <amalkin@nvidia.com>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Dec 21, 2025
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

Signed-off-by: Guy Shemesh <gshemesh@nvidia.com>
venu-nexthop pushed a commit to venu-nexthop/sonic-mgmt that referenced this pull request Jan 13, 2026
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Jan 26, 2026
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

Signed-off-by: Guy Shemesh <gshemesh@nvidia.com>
ytzur1 pushed a commit to ytzur1/sonic-mgmt that referenced this pull request Feb 2, 2026
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

Signed-off-by: Yael Tzur <ytzur@nvidia.com>
venu-nexthop pushed a commit to venu-nexthop/sonic-mgmt that referenced this pull request Mar 27, 2026
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants