Fix test_advanced_reboot: configurable control_plane_down_timeout + SSH thread cleanup on failure#2
Draft
Fix test_advanced_reboot: configurable control_plane_down_timeout + SSH thread cleanup on failure#2
Conversation
…e and fix SSH thread cleanup on failure Co-authored-by: ravaliyel <227423972+ravaliyel@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Run advanced reboot test case with pytest
Fix test_advanced_reboot: configurable control_plane_down_timeout + SSH thread cleanup on failure
Mar 15, 2026
ravaliyel
pushed a commit
that referenced
this pull request
Mar 27, 2026
…kets not received on collector interface (sonic-net#22186) * [sonic-mgmt] Fix sflow/test_sflow.py failures with expected sflow packets not received on collector interface Issue #1: In some cases (like sflow config enabled for first time, device reboot), hsflowd daemon is taking little over 3 mins to be fully initialized and process collector config. During this window, hsflowd service won't send sflow packets ('CounterSample', 'FlowSample' etc) to collector interface and thus test can fail with i) "Packets are not received in active collector, collector\d+" and ii) "Expected Number of samples are not collected from Interface Ethernet\d+ in collector collector\d+ , Received \d+" hsflowd service is writing to "/etc/hsflowd.auto" once it's processed collector configuration. Thus waiting for collector info to be present in "/etc/hsflowd.auto" seems to be safe option before proceeding with sflow traffic verfication. Issue #2: If the test expects flow samples/packets on the collector interface but they aren't seen for some reason, then we are hitting "KeyError: 'flow_port_count'". Due to counter samples seen on collector interface, "data['total_samples']" will not be zero but "data['total_flow_count']" will be 0 and lead to KeyError when tried to access "data['flow_port_count']". Fix is to have assert on "total_flow_count" and "total_counter_count" before calling corresponding sample analyze functions. Signed-off-by: Vinod <vkjammala@arista.com> * Addressing review comments 1) Enhanced "wait_until_hsflowd_ready" to make it wait for all the collector IPs (instead of calling it sequentially for each IP) 2) Add docstring for "wait_until_hsflowd_ready" function 3) Updated "ast.literal_eval" usage to handle the case where "active_collectors" is passed as empty string ("" instead of "[]") Signed-off-by: Vinod <vkjammala@arista.com> * Fix pre-commit check failures Signed-off-by: Vinod <vkjammala@arista.com> * Revert PR#21674 partially to enable "sflow/test_sflow.py" test Signed-off-by: Vinod <vkjammala@arista.com> --------- Signed-off-by: Vinod <vkjammala@arista.com>
ravaliyel
pushed a commit
that referenced
this pull request
Mar 27, 2026
…kets not received on collector interface (sonic-net#22186) * [sonic-mgmt] Fix sflow/test_sflow.py failures with expected sflow packets not received on collector interface Issue #1: In some cases (like sflow config enabled for first time, device reboot), hsflowd daemon is taking little over 3 mins to be fully initialized and process collector config. During this window, hsflowd service won't send sflow packets ('CounterSample', 'FlowSample' etc) to collector interface and thus test can fail with i) "Packets are not received in active collector, collector\d+" and ii) "Expected Number of samples are not collected from Interface Ethernet\d+ in collector collector\d+ , Received \d+" hsflowd service is writing to "/etc/hsflowd.auto" once it's processed collector configuration. Thus waiting for collector info to be present in "/etc/hsflowd.auto" seems to be safe option before proceeding with sflow traffic verfication. Issue #2: If the test expects flow samples/packets on the collector interface but they aren't seen for some reason, then we are hitting "KeyError: 'flow_port_count'". Due to counter samples seen on collector interface, "data['total_samples']" will not be zero but "data['total_flow_count']" will be 0 and lead to KeyError when tried to access "data['flow_port_count']". Fix is to have assert on "total_flow_count" and "total_counter_count" before calling corresponding sample analyze functions. Signed-off-by: Vinod <vkjammala@arista.com> * Addressing review comments 1) Enhanced "wait_until_hsflowd_ready" to make it wait for all the collector IPs (instead of calling it sequentially for each IP) 2) Add docstring for "wait_until_hsflowd_ready" function 3) Updated "ast.literal_eval" usage to handle the case where "active_collectors" is passed as empty string ("" instead of "[]") Signed-off-by: Vinod <vkjammala@arista.com> * Fix pre-commit check failures Signed-off-by: Vinod <vkjammala@arista.com> * Revert PR#21674 partially to enable "sflow/test_sflow.py" test Signed-off-by: Vinod <vkjammala@arista.com> --------- Signed-off-by: Vinod <vkjammala@arista.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
test_fast_rebootfails withTimeoutError: DUT hasn't shutdown in 600 secondsbecausecontrol_plane_down_timeoutwas hardcoded, and the resulting exception skipped sendingquitto neighbor SSH threads, leaving them blocked onqueue.get()until SIGTERM killed the PTF process.Changes
ptftests/py3/advanced-reboot.pyself.control_plane_down_timeout = 600with a test parameter (check_param('control_plane_down_timeout', 600)), preserving default behaviorexceptblock ofrunTest(), sendquitviaput_nowaitto all SSH threads to unblock them when the DUT timeout path is taken — previously these threads hung indefinitely sincehandle_post_reboot_health_check()(wherequitis normally sent) was skippedtests/common/platform/args/advanced_reboot_args.py--control_plane_down_timeoutCLI option (default600) so slow-rebooting hardware can set a longer thresholdtests/common/fixtures/advanced_reboot.py--control_plane_down_timeoutand forward it ascontrol_plane_down_timeoutin the PTF params dict📱 Kick off Copilot coding agent tasks wherever you are with GitHub Mobile, available on iOS and Android.