Handle stale transient FEC errors in test_verify_fec_stats_counters#22733
Handle stale transient FEC errors in test_verify_fec_stats_counters#22733kewei-arista wants to merge 2 commits intosonic-net:masterfrom
Conversation
Signed-off-by: kewei <kewei@arista.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Hi @kewei-arista, thanks for the detailed fix! I noticed this PR overlaps with #22829 (longhuan-cisco) — both address the same root issue: stale/historical FEC uncorrectable errors causing A few issues I found in this PR: Bug 1: In try:
fec_uncorr_int = int(fec_uncorr)
except ValueError:
pytest.fail("... fec_uncorr: {}".format(intf_name, fec_uncorr_int)) # NameError!Should use Bug 2: Global state never cleared — breaks multi-HWSKU testbeds
This silently validates the wrong DUT's counters. Design concern: Cross-test state sharing The shared global mutable state between Comparison with #22829: PR #22829 takes a simpler approach (clear counters + wait 60s) that avoids these issues. However, this PR has the advantage of also optimizing the histogram test polling. Perhaps the histogram optimization could be a separate follow-up PR once the bugs above are addressed? |
|
@kewei-arista can we use this PR? #22829 |
Description of PR
test_verify_fec_stats_countershas a similar issue astest_verify_fec_histogramthat has been fixed in #21685, where it may fail for a stable link that has stale transient FEC errors.This change follows the same approach as #21685 to fix the issue for
test_verify_fec_stats_countersby polling FEC counters for an interface for a longer time to determine whether the link has been stable, and not fail the test for a stable link with stale transient FEC errors.This change also optimizes the code path to poll FEC counters so we only need to do it once for both
test_verify_fec_stats_countersandtest_verify_fec_histogram, so this can save the polling time and speed up the whole test.This change also addresses the #21685 (comment) so the polling time is now test attribute driven.
Type of change
Back port request
Approach
What is the motivation for this PR?
Improve the pass rate by handling these corner cases
How did you do it?
Wait for enough long time to make sure the links are actually stable
How did you verify/test it?
Confirmed the test now can pass with a stable link but transient FEC symbol errors
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation