Skip to content

Fix/nonlinear dataplane downtime#21936

Merged
bingwang-ms merged 6 commits intosonic-net:masterfrom
PriyanshTratiya:fix/nonlinear-dataplane-downtime
Jan 20, 2026
Merged

Fix/nonlinear dataplane downtime#21936
bingwang-ms merged 6 commits intosonic-net:masterfrom
PriyanshTratiya:fix/nonlinear-dataplane-downtime

Conversation

@PriyanshTratiya
Copy link
Copy Markdown
Contributor

Description of PR

Summary:
Fixes # (issue)
This PR addresses non‑linear dataplane downtime behavior observed in high‑scale BGP IPv6 scenarios when running the port and session flapping tests. When the number of connections to flap doubled, the dataplane downtime increased by 450x.

This change refines the tests and helper logic to ensure that downtime measurements:

  • More accurately reflect real control‑plane and data‑plane outage intervals,
  • Scale more predictably with load and iterations, and
  • Avoid over‑counting or under‑counting downtime due to measurement artifacts and overlapping events.

Type of change

  • [ x ] Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Approach

What is the motivation for this PR?

While validating high‑scale BGP convergence, flap, and route‑programming tests, we observed that:

  • Dataplane downtime did not scale linearly with:
    • The number of flap iterations,
    • The number of routes or neighbors.

These issues were traced to how the tests were executed sequentially while the PTF dataplane packet‑filtering/counter state was never cleared between runs. As a result, masks and counters kept accumulating over time, so that each subsequent run especially those with a larger number of ports to flap saw an artificially inflated dataplane downtime.

In other words, the measured non‑linear increase in downtime was caused by PTF dataplane state rather than actual BGP control‑plane behavior. The goal of this PR is to:

  • Properly reset/clean relevant PTF dataplane state between runs,
  • Ensure that measured dataplane downtime reflects only the real BGP and data‑plane behavior,
  • Restore a linear and predictable relationship between test scale (routes/neighbors/iterations) and observed downtime.

How did you do it?

  • Added logic to explicitly clear PTF dataplane state between runs, including:
    • Flushing or re‑initializing PTF packet filters used for counting traffic to the prefixes under test.
    • Resetting relevant PTF counters so that each run starts with a clean environment.
  • Updated the test flow so that:
    • Each scale/iteration configuration first ensures PTF dataplane state is clean before starting flaps and dataplane measurements.
    • Dataplane downtime is computed only from counters and observations collected within the current run, avoiding any contamination from previous runs.
  • Adjusted/factored helper utilities (where appropriate) so that the PTF cleanup is:
    • Centralized and reusable across the convergence, flap, and route‑programming tests,
    • Invoked consistently whenever a new test scenario or iteration is started.
  • Enhanced logging around:
    • When PTF dataplane state is cleared,
    • Per‑iteration dataplane downtime measurements after the fix, so it is easy to verify that:
      • Counters are reset when expected, and
      • The resulting downtime scales linearly with the number of ports/routes/iterations, reflecting actual BGP and dataplane behavior.

How did you verify/test it?

  • Re‑ran the high‑bgp convergence, flap, and route‑programming tests with the fixes applied:
    • Topology: t0-isolated-d2u510s2
    • Platform: Broadcom Arista-7060X6-64PE-B-C512S2
  • Verified that:
    • Measured downtime per iteration is stable and scales predictably with load and iteration count.
    • Spurious spikes caused by measurement artifacts are eliminated and stay within millisecond compared to previous tens of seconds.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@PriyanshTratiya PriyanshTratiya requested review from r12f and removed request for cyw233, lolyu and sanjair-git January 14, 2026 14:38
@PriyanshTratiya PriyanshTratiya marked this pull request as ready for review January 14, 2026 17:50
@r12f
Copy link
Copy Markdown
Collaborator

r12f commented Jan 15, 2026

straightforward change. will sign off, once comments are addressed.

@r12f
Copy link
Copy Markdown
Collaborator

r12f commented Jan 29, 2026

Picking to 202412: Azure/sonic-mgmt.msft#982

@mssonicbld
Copy link
Copy Markdown
Collaborator

@PriyanshTratiya PR conflicts with 202511 branch

ytzur1 pushed a commit to ytzur1/sonic-mgmt that referenced this pull request Feb 2, 2026
* ptf dataplane cleaners for in between test runs

Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
Signed-off-by: Yael Tzur <ytzur@nvidia.com>
abhishek-nexthop pushed a commit to nexthop-ai/sonic-mgmt that referenced this pull request Feb 6, 2026
* ptf dataplane cleaners for in between test runs

Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
Anirudh-nokia pushed a commit to Anirudh-nokia/sonic-mgmt-fork that referenced this pull request Feb 6, 2026
* ptf dataplane cleaners for in between test runs

Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
Signed-off-by: ayya <anirudh.ayya@nokia.com>
@weiguo-nvidia
Copy link
Copy Markdown
Contributor

Hi @PriyanshTratiya

Could you help cherry-pick the fix to 202511 branch? Thanks

PriyanshTratiya added a commit to PriyanshTratiya/sonic-mgmt that referenced this pull request Feb 9, 2026
* ptf dataplane cleaners for in between test runs

Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
nnelluri-cisco pushed a commit to nnelluri-cisco/sonic-mgmt that referenced this pull request Feb 12, 2026
* ptf dataplane cleaners for in between test runs

Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
Signed-off-by: nnelluri-cisco <nnelluri@cisco.com>
rraghav-cisco pushed a commit to rraghav-cisco/sonic-mgmt that referenced this pull request Feb 13, 2026
* ptf dataplane cleaners for in between test runs

Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
PriyanshTratiya added a commit to PriyanshTratiya/sonic-mgmt that referenced this pull request Feb 14, 2026
* ptf dataplane cleaners for in between test runs

Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
@PriyanshTratiya
Copy link
Copy Markdown
Contributor Author

@PriyanshTratiya PR conflicts with 202511 branch

#22419

vmittal-msft pushed a commit that referenced this pull request Feb 18, 2026
* ptf dataplane cleaners for in between test runs

Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
anilal-amd pushed a commit to anilal-amd/anilal-forked-sonic-mgmt that referenced this pull request Feb 19, 2026
* ptf dataplane cleaners for in between test runs

Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
Signed-off-by: Zhuohui Tan <zhuohui.tan@amd.com>
abhishek-nexthop pushed a commit to nexthop-ai/sonic-mgmt that referenced this pull request Mar 17, 2026
* ptf dataplane cleaners for in between test runs

Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
Signed-off-by: Abhishek <abhishek@nexthop.ai>
venu-nexthop pushed a commit to venu-nexthop/sonic-mgmt that referenced this pull request Mar 27, 2026
* ptf dataplane cleaners for in between test runs

Signed-off-by: Priyansh Tratiya <ptratiya@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants