Exit pytest with error code 16 if ptfhost is unreachable#20539
Exit pytest with error code 16 if ptfhost is unreachable#20539wangxin merged 3 commits intosonic-net:masterfrom
Conversation
Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| except BaseException as e: | ||
| logger.error("Failed to copy files to ptfhost.") | ||
| request.config.cache.set("ptfhost_unreachable", True) | ||
| pt_assert(False, "!!! ptfhost unreachable !!! Exception: {}".format(repr(e))) |
There was a problem hiding this comment.
How do you know the Exception is definitely PTF unreachable?
There was a problem hiding this comment.
@wangxin most of time, the unreachable PTF to cause copy file failure, but you are right, I change words to exception.
Please review it again, thanks.
There was a problem hiding this comment.
@wangxin Thank you for your suggestion.
It turns out pytest_ansible.errors.AnsibleConnectionFailure works, but ansible.errors.AnsibleConnectionFailure doesn't work.
Correct:
from pytest_ansible.errors import AnsibleConnectionFailure
Wrong:
from ansible.errors import AnsibleConnectionFailure
Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| ptfhost.copy(src=os.path.join(SCRIPTS_SRC_DIR, ICMP_RESPONDER_PY), dest=OPT_DIR) | ||
| try: | ||
| ptfhost.copy(src=os.path.join(SCRIPTS_SRC_DIR, ICMP_RESPONDER_PY), dest=OPT_DIR) | ||
| except BaseException as e: |
There was a problem hiding this comment.
Only exception AnsibleConnectionFailure means that the PTF is unreachable. It is better to capture this AnsibleConnectionFailure exception here and set "ptfhost_exception" to True. For other exceptions, they could be different issues and should not be treated as ptf unreachable.
Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…0539) What is the motivation for this PR? On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost. In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results. In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated. Similar PR was filed before sonic-net#10243 How did you do it? Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early. How did you verify/test it? use run_test.sh to test when ptf is unreachable. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
|
Cherry-pick PR to 202505: #20601 |
What is the motivation for this PR? On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost. In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results. In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated. Similar PR was filed before #10243 How did you do it? Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early. How did you verify/test it? use run_test.sh to test when ptf is unreachable. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
…0539) What is the motivation for this PR? On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost. In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results. In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated. Similar PR was filed before sonic-net#10243 How did you do it? Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early. How did you verify/test it? use run_test.sh to test when ptf is unreachable. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
…0539) What is the motivation for this PR? On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost. In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results. In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated. Similar PR was filed before sonic-net#10243 How did you do it? Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early. How did you verify/test it? use run_test.sh to test when ptf is unreachable. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
…0539) What is the motivation for this PR? On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost. In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results. In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated. Similar PR was filed before sonic-net#10243 How did you do it? Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early. How did you verify/test it? use run_test.sh to test when ptf is unreachable. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
…0539) What is the motivation for this PR? On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost. In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results. In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated. Similar PR was filed before sonic-net#10243 How did you do it? Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early. How did you verify/test it? use run_test.sh to test when ptf is unreachable. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com> Signed-off-by: Guy Shemesh <gshemesh@nvidia.com>
…0539) What is the motivation for this PR? On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost. In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results. In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated. Similar PR was filed before sonic-net#10243 How did you do it? Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early. How did you verify/test it? use run_test.sh to test when ptf is unreachable. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com> Signed-off-by: Aharon Malkin <amalkin@nvidia.com>
…0539) What is the motivation for this PR? On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost. In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results. In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated. Similar PR was filed before sonic-net#10243 How did you do it? Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early. How did you verify/test it? use run_test.sh to test when ptf is unreachable. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com> Signed-off-by: Guy Shemesh <gshemesh@nvidia.com>
…0539) What is the motivation for this PR? On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost. In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results. In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated. Similar PR was filed before sonic-net#10243 How did you do it? Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early. How did you verify/test it? use run_test.sh to test when ptf is unreachable. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
…0539) What is the motivation for this PR? On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost. In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results. In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated. Similar PR was filed before sonic-net#10243 How did you do it? Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early. How did you verify/test it? use run_test.sh to test when ptf is unreachable. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com> Signed-off-by: Guy Shemesh <gshemesh@nvidia.com>
…0539) What is the motivation for this PR? On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost. In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results. In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated. Similar PR was filed before sonic-net#10243 How did you do it? Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early. How did you verify/test it? use run_test.sh to test when ptf is unreachable. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com> Signed-off-by: Lakshmi Yarramaneni <lakshmi@nexthop.ai>
…0539) What is the motivation for this PR? On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost. In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results. In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated. Similar PR was filed before sonic-net#10243 How did you do it? Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early. How did you verify/test it? use run_test.sh to test when ptf is unreachable. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com> Signed-off-by: Yael Tzur <ytzur@nvidia.com>
…0539) What is the motivation for this PR? On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost. In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results. In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated. Similar PR was filed before sonic-net#10243 How did you do it? Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early. How did you verify/test it? use run_test.sh to test when ptf is unreachable. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
When a DUT is unreachable, session-scoped fixtures like add_mgmt_test_mark fail with AnsibleConnectionFailure. But unlike the duthosts fixture (PR sonic-net#10243 which catches BaseException and exits with code 15), add_mgmt_test_mark has no error handling. The exception propagates as a generic test error (exit code 1) and the scheduler does not remove the bad testbed. This commit applies the same pattern as PR sonic-net#10243: 1. Wrap add_mgmt_test_mark in try/except BaseException, setting the duthosts_fixture_failed cache flag so pytest exits with code 15. This reuses the existing mechanism that run_tests.sh already checks. 2. Extend the ptfhost exception handling in run_icmp_responder_session (PR sonic-net#20539) to also cover the remaining ptfhost.copy() and ptfhost.shell() calls that were left unprotected. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
When a DUT is unreachable, session-scoped fixtures like add_mgmt_test_mark fail with AnsibleConnectionFailure. But unlike the duthosts fixture (PR sonic-net#10243 which catches BaseException and exits with code 15), add_mgmt_test_mark has no error handling. The exception propagates as a generic test error (exit code 1) and the scheduler does not remove the bad testbed. This commit applies the same pattern as PR sonic-net#10243: 1. Wrap add_mgmt_test_mark in try/except BaseException, setting the duthosts_fixture_failed cache flag so pytest exits with code 15. 2. Extend the ptfhost exception handling in run_icmp_responder_session (PR sonic-net#20539) to also cover the remaining ptfhost.copy() and ptfhost.shell() calls that were left unprotected. 3. Wrap do_checks() in the post-test sanity check phase with try/except so that if the DUT becomes unreachable during a test, the post-check crash still sets post_sanity_check_failed (exit code 11) instead of silently propagating as a generic error. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
When a DUT or PTF host is unreachable, the test framework should exit with an error code that the scheduler (ElasticTest) recognizes so it can remove the bad testbed from the test plan. ElasticTest checks for exit codes [10, 11, 12, 15] but NOT 16, so the PTF unreachable exit code 16 from PR sonic-net#20539 was silently ignored. Changes: 1. conftest.py: Catch AnsibleConnectionFailure in add_mgmt_test_mark fixture and set duthosts_fixture_failed flag (exit code 15), following the same pattern as PR sonic-net#10243 for duthosts fixture failures. Rename DUTHOSTS_FIXTURE_FAILED_RC to HOST_FIXTURE_FAILED_RC. 2. ptfhost_utils.py: Change PTFHOST_EXCEPTION_RC from 16 to 15 (renamed to HOST_FIXTURE_FAILED_RC) so ElasticTest recognizes it and kicks the testbed out instead of assigning more tests. 3. sanity_check/__init__.py: Wrap do_checks() in the post-test sanity check with try/except so that if the DUT becomes unreachable during a test, the post-check crash sets post_sanity_check_failed (exit code 11) instead of propagating as exit code 1. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
…15) (#23549) * Fix DUT/PTF unreachable not causing early exit and testbed removal When a DUT or PTF host is unreachable, the test framework should exit with an error code that the scheduler (ElasticTest) recognizes so it can remove the bad testbed from the test plan. ElasticTest checks for exit codes [10, 11, 12, 15] but NOT 16, so the PTF unreachable exit code 16 from PR #20539 was silently ignored. Changes: 1. conftest.py: Catch AnsibleConnectionFailure in add_mgmt_test_mark fixture and set duthosts_fixture_failed flag (exit code 15), following the same pattern as PR #10243 for duthosts fixture failures. Rename DUTHOSTS_FIXTURE_FAILED_RC to HOST_FIXTURE_FAILED_RC. 2. ptfhost_utils.py: Change PTFHOST_EXCEPTION_RC from 16 to 15 (renamed to HOST_FIXTURE_FAILED_RC) so ElasticTest recognizes it and kicks the testbed out instead of assigning more tests. 3. sanity_check/__init__.py: Wrap do_checks() in the post-test sanity check with try/except so that if the DUT becomes unreachable during a test, the post-check crash sets post_sanity_check_failed (exit code 11) instead of propagating as exit code 1. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com> * Merge pytest_sessionfinish hooks for DUT and PTF unreachable Both conftest.py and ptfhost_utils.py had separate pytest_sessionfinish hooks checking different cache keys (duthosts_fixture_failed and ptfhost_exception). Since both now use the same exit code 15 (HOST_FIXTURE_FAILED_RC), merge them into a single hook in conftest.py and remove the duplicate from ptfhost_utils.py. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
…15) (sonic-net#23549) * Fix DUT/PTF unreachable not causing early exit and testbed removal When a DUT or PTF host is unreachable, the test framework should exit with an error code that the scheduler (ElasticTest) recognizes so it can remove the bad testbed from the test plan. ElasticTest checks for exit codes [10, 11, 12, 15] but NOT 16, so the PTF unreachable exit code 16 from PR sonic-net#20539 was silently ignored. Changes: 1. conftest.py: Catch AnsibleConnectionFailure in add_mgmt_test_mark fixture and set duthosts_fixture_failed flag (exit code 15), following the same pattern as PR sonic-net#10243 for duthosts fixture failures. Rename DUTHOSTS_FIXTURE_FAILED_RC to HOST_FIXTURE_FAILED_RC. 2. ptfhost_utils.py: Change PTFHOST_EXCEPTION_RC from 16 to 15 (renamed to HOST_FIXTURE_FAILED_RC) so ElasticTest recognizes it and kicks the testbed out instead of assigning more tests. 3. sanity_check/__init__.py: Wrap do_checks() in the post-test sanity check with try/except so that if the DUT becomes unreachable during a test, the post-check crash sets post_sanity_check_failed (exit code 11) instead of propagating as exit code 1. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com> * Merge pytest_sessionfinish hooks for DUT and PTF unreachable Both conftest.py and ptfhost_utils.py had separate pytest_sessionfinish hooks checking different cache keys (duthosts_fixture_failed and ptfhost_exception). Since both now use the same exit code 15 (HOST_FIXTURE_FAILED_RC), merge them into a single hook in conftest.py and remove the duplicate from ptfhost_utils.py. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com> Signed-off-by: mssonicbld <sonicbld@microsoft.com>
…15) (sonic-net#23549) * Fix DUT/PTF unreachable not causing early exit and testbed removal When a DUT or PTF host is unreachable, the test framework should exit with an error code that the scheduler (ElasticTest) recognizes so it can remove the bad testbed from the test plan. ElasticTest checks for exit codes [10, 11, 12, 15] but NOT 16, so the PTF unreachable exit code 16 from PR sonic-net#20539 was silently ignored. Changes: 1. conftest.py: Catch AnsibleConnectionFailure in add_mgmt_test_mark fixture and set duthosts_fixture_failed flag (exit code 15), following the same pattern as PR sonic-net#10243 for duthosts fixture failures. Rename DUTHOSTS_FIXTURE_FAILED_RC to HOST_FIXTURE_FAILED_RC. 2. ptfhost_utils.py: Change PTFHOST_EXCEPTION_RC from 16 to 15 (renamed to HOST_FIXTURE_FAILED_RC) so ElasticTest recognizes it and kicks the testbed out instead of assigning more tests. 3. sanity_check/__init__.py: Wrap do_checks() in the post-test sanity check with try/except so that if the DUT becomes unreachable during a test, the post-check crash sets post_sanity_check_failed (exit code 11) instead of propagating as exit code 1. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com> * Merge pytest_sessionfinish hooks for DUT and PTF unreachable Both conftest.py and ptfhost_utils.py had separate pytest_sessionfinish hooks checking different cache keys (duthosts_fixture_failed and ptfhost_exception). Since both now use the same exit code 15 (HOST_FIXTURE_FAILED_RC), merge them into a single hook in conftest.py and remove the duplicate from ptfhost_utils.py. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com> Signed-off-by: opcoder0 <110003254+opcoder0@users.noreply.github.com>
…al (unify exit code 15) (#23619) ### Description of PR Summary: Fix DUT and PTF host unreachable errors not causing testbed removal from test plan. **Root cause:** ElasticTest only kicks testbeds for exit codes `[10, 11, 12, 15]` (defined in `test_plan_constants.py`). The PTF unreachable exit code 16 from PR #20539 is NOT in this list, so it was silently ignored. And `add_mgmt_test_mark` had no error handling at all — when DUT is unreachable but `duthosts` init succeeded from cached facts, the `AnsibleConnectionFailure` propagated as exit code 1. **Real-world impact:** - DUT unreachable: `testbed-bjw2-can-t0-7260-2` in testplan `69c736202878597161a7a636` — 117 tests failed across 2 modules - PTF unreachable: `vmsvc1-dual-t0-7050-1` in testplan `69ae61e01ba706f7e92aa4f5` — testbed kept getting assigned tests despite PTF being down ### Type of change - [x] Bug fix ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 - [ ] 202511 ### Approach #### What is the motivation for this PR? PR #10243 catches `BaseException` in `duthosts` fixture init → exit code 15 → testbed kicked out. But when `duthosts` succeeds from cached facts, the first SSH attempt in `add_mgmt_test_mark` fails without any handler. PR #20539 catches PTF `AnsibleConnectionFailure` → exit code 16. But ElasticTest only checks `[10, 11, 12, 15]`, so exit code 16 is ignored and the testbed stays in the test plan. #### How did you do it? 1. **`tests/conftest.py`**: Catch `AnsibleConnectionFailure` in `add_mgmt_test_mark`, set `duthosts_fixture_failed` cache flag → exit code 15 (same pattern as PR #10243). Rename `DUTHOSTS_FIXTURE_FAILED_RC` to `HOST_FIXTURE_FAILED_RC` to reflect it covers both DUT and PTF hosts. 2. **`tests/common/fixtures/ptfhost_utils.py`**: Change `PTFHOST_EXCEPTION_RC` from 16 to 15 (renamed to `HOST_FIXTURE_FAILED_RC`) so ElasticTest recognizes it and kicks the testbed out. 3. **`tests/common/plugins/sanity_check/__init__.py`**: Wrap `do_checks()` in post-test sanity check with try/except so DUT unreachable during a test sets `post_sanity_check_failed` (exit code 11) instead of propagating as exit code 1. #### How did you verify/test it? - Analyzed real failure logs from ElasticTest for both DUT and PTF unreachable scenarios - Verified ElasticTest scheduler code confirms exit code 15 is in `DUTHOST_UNREACHABLE_RET_CODES` - Confirmed `run_tests.sh` already handles exit code 15 #### Any platform specific information? N/A #### Supported testbed topology if it's a new test case? N/A ### Documentation N/A Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com> Signed-off-by: mssonicbld <sonicbld@microsoft.com> Co-authored-by: Zhaohui Sun <94606222+ZhaohuiS@users.noreply.github.com>
…0539) What is the motivation for this PR? On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost. In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results. In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated. Similar PR was filed before sonic-net#10243 How did you do it? Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early. How did you verify/test it? use run_test.sh to test when ptf is unreachable. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…15) (sonic-net#23549) * Fix DUT/PTF unreachable not causing early exit and testbed removal When a DUT or PTF host is unreachable, the test framework should exit with an error code that the scheduler (ElasticTest) recognizes so it can remove the bad testbed from the test plan. ElasticTest checks for exit codes [10, 11, 12, 15] but NOT 16, so the PTF unreachable exit code 16 from PR sonic-net#20539 was silently ignored. Changes: 1. conftest.py: Catch AnsibleConnectionFailure in add_mgmt_test_mark fixture and set duthosts_fixture_failed flag (exit code 15), following the same pattern as PR sonic-net#10243 for duthosts fixture failures. Rename DUTHOSTS_FIXTURE_FAILED_RC to HOST_FIXTURE_FAILED_RC. 2. ptfhost_utils.py: Change PTFHOST_EXCEPTION_RC from 16 to 15 (renamed to HOST_FIXTURE_FAILED_RC) so ElasticTest recognizes it and kicks the testbed out instead of assigning more tests. 3. sanity_check/__init__.py: Wrap do_checks() in the post-test sanity check with try/except so that if the DUT becomes unreachable during a test, the post-check crash sets post_sanity_check_failed (exit code 11) instead of propagating as exit code 1. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com> * Merge pytest_sessionfinish hooks for DUT and PTF unreachable Both conftest.py and ptfhost_utils.py had separate pytest_sessionfinish hooks checking different cache keys (duthosts_fixture_failed and ptfhost_exception). Since both now use the same exit code 15 (HOST_FIXTURE_FAILED_RC), merge them into a single hook in conftest.py and remove the duplicate from ptfhost_utils.py. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com> Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
…0539) What is the motivation for this PR? On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost. In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results. In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated. Similar PR was filed before sonic-net#10243 How did you do it? Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early. How did you verify/test it? use run_test.sh to test when ptf is unreachable. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com> Signed-off-by: Sudarshan Kumar (from Dev Box) <sudakumar@microsoft.com>
…15) (sonic-net#23549) * Fix DUT/PTF unreachable not causing early exit and testbed removal When a DUT or PTF host is unreachable, the test framework should exit with an error code that the scheduler (ElasticTest) recognizes so it can remove the bad testbed from the test plan. ElasticTest checks for exit codes [10, 11, 12, 15] but NOT 16, so the PTF unreachable exit code 16 from PR sonic-net#20539 was silently ignored. Changes: 1. conftest.py: Catch AnsibleConnectionFailure in add_mgmt_test_mark fixture and set duthosts_fixture_failed flag (exit code 15), following the same pattern as PR sonic-net#10243 for duthosts fixture failures. Rename DUTHOSTS_FIXTURE_FAILED_RC to HOST_FIXTURE_FAILED_RC. 2. ptfhost_utils.py: Change PTFHOST_EXCEPTION_RC from 16 to 15 (renamed to HOST_FIXTURE_FAILED_RC) so ElasticTest recognizes it and kicks the testbed out instead of assigning more tests. 3. sanity_check/__init__.py: Wrap do_checks() in the post-test sanity check with try/except so that if the DUT becomes unreachable during a test, the post-check crash sets post_sanity_check_failed (exit code 11) instead of propagating as exit code 1. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com> * Merge pytest_sessionfinish hooks for DUT and PTF unreachable Both conftest.py and ptfhost_utils.py had separate pytest_sessionfinish hooks checking different cache keys (duthosts_fixture_failed and ptfhost_exception). Since both now use the same exit code 15 (HOST_FIXTURE_FAILED_RC), merge them into a single hook in conftest.py and remove the duplicate from ptfhost_utils.py. Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com> Signed-off-by: Sudarshan Kumar (from Dev Box) <sudakumar@microsoft.com>
Description of PR
Summary:
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
On dualtor testbed, in very early setup, it will try to fixture
run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost.In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results.
In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated.
Similar PR was filed before #10243
Test log before:
Test log after:
How did you do it?
Capture exception in
run_icmp_responder_session, when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and makerun_test.shaware of this failure and exit pipeline early.How did you verify/test it?
use
run_test.shto test when ptf is unreachable.Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation