Skip to content

Exit pytest with error code 16 if ptfhost is unreachable#20539

Merged
wangxin merged 3 commits intosonic-net:masterfrom
ZhaohuiS:ZhaohuiS/ptf_unreachable
Sep 9, 2025
Merged

Exit pytest with error code 16 if ptfhost is unreachable#20539
wangxin merged 3 commits intosonic-net:masterfrom
ZhaohuiS:ZhaohuiS/ptf_unreachable

Conversation

@ZhaohuiS
Copy link
Copy Markdown
Contributor

@ZhaohuiS ZhaohuiS commented Sep 5, 2025

Description of PR

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Approach

What is the motivation for this PR?

On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost.
In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results.
In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated.

Similar PR was filed before #10243

Test log before:

____________ ERROR at setup of test_ecn_during_encap_on_standby[6] _____________

duthosts = [<MultiAsicSonicHost str3-8101c1-05>, <MultiAsicSonicHost str3-8101c1-06>]
duthost = <MultiAsicSonicHost str3-8101c1-05>
ptfhost = <tests.common.devices.ptf.PTFHost object at 0x7f94d040e8b0>
tbinfo = {'auto_recover': 'True', 'comment': 'yawenni', 'conf-name': 'vms66-dual-t0-8101c1-03', 'duts': ['str3-8101c1-05', 'str3-8101c1-06'], ...}

    @pytest.fixture(scope="session", autouse=True)
    def run_icmp_responder_session(duthosts, duthost, ptfhost, tbinfo):
        """Run icmp_responder on ptfhost session-wise on dualtor testbeds with active-active ports."""
        # No vlan is available on non-t0 testbed, so skip this fixture
        if "dualtor-mixed" not in tbinfo["topo"]["name"] and "dualtor-aa" not in tbinfo["topo"]["name"]:
            logger.info("Skip running icmp_responder at session level, "
                        "it is only for dualtor testbed with active-active mux ports.")
            yield
            return
    
        global icmp_responder_session_started
    
        update_linkmgrd_probe_interval(duthosts, tbinfo, PROBER_INTERVAL_MS)
        duthosts.shell("config save -y")
    
        duthost = duthosts[0]
        logger.debug("Copy icmp_responder.py to ptfhost '{0}'".format(ptfhost.hostname))
>       ptfhost.copy(src=os.path.join(SCRIPTS_SRC_DIR, ICMP_RESPONDER_PY), dest=OPT_DIR)

duthost    = <MultiAsicSonicHost str3-8101c1-05>
duthosts   = [<MultiAsicSonicHost str3-8101c1-05>, <MultiAsicSonicHost str3-8101c1-06>]
ptfhost    = <tests.common.devices.ptf.PTFHost object at 0x7f94d040e8b0>
tbinfo     = {'auto_recover': 'True', 'comment': 'yawenni', 'conf-name': 'vms66-dual-t0-8101c1-03', 'duts': ['str3-8101c1-05', 'str3-8101c1-06'], ...}

common/fixtures/ptfhost_utils.py:322: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
common/devices/base.py:105: in _run
    res = self.module(*module_args, **complex_args)[self.hostname]
        complex_args = {'dest': '/opt', 'src': 'scripts/icmp_responder.py'}
        filename   = '/var/src/sonic-mgmt_vms66-dual-t0-8101c1-03/tests/common/fixtures/ptfhost_utils.py'
        function_name = 'run_icmp_responder_session'
        index      = 0
        line_number = 322
        lines      = ['    ptfhost.copy(src=os.path.join(SCRIPTS_SRC_DIR, ICMP_RESPONDER_PY), dest=OPT_DIR)\n']
        module_args = []
        module_async = False
        module_ignore_errors = False
        previous_frame = <frame at 0x11df64e0, file '/var/src/sonic-mgmt_vms66-dual-t0-8101c1-03/tests/common/fixtures/ptfhost_utils.py', line 322, code run_icmp_responder_session>
        self       = <tests.common.devices.ptf.PTFHost object at 0x7f94d040e8b0>
        verbose    = True
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pytest_ansible.module_dispatcher.v213.ModuleDispatcherV213 object at 0x7f94cb842ee0>
module_args = ()
complex_args = {'dest': '/opt', 'src': 'scripts/icmp_responder.py'}
hosts = [vms66-7], extra_hosts = [], no_hosts = False
args = ['pytest-ansible', 'vms66-7', '--connection=smart', '--become', '--become-method=sudo', '--become-user=root', ...]
verbosity = None, verbosity_syntax = '-vvvvv', argument = 'module-path'
arg_value = ['/var/src/sonic-mgmt_vms66-dual-t0-8101c1-03/ansible/library']
callback = <pytest_ansible.module_dispatcher.v213.ResultAccumulator object at 0x7f94cb997850>

    def _run(self, *module_args, **complex_args):
        """Execute an ansible adhoc command returning the result in a AdhocResult object."""
        # Assemble module argument string
        if module_args:
            complex_args.update({"_raw_params": " ".join(module_args)})
    
        # Assert hosts matching the provided pattern exist
        hosts = self.options["inventory_manager"].list_hosts()
        if "extra_inventory_manager" in self.options:
            extra_hosts = self.options["extra_inventory_manager"].list_hosts()
        else:
            extra_hosts = []
        no_hosts = False
        if len(hosts + extra_hosts) == 0:
            no_hosts = True
            warnings.warn("provided hosts list is empty, only localhost is available")
    
        self.options["inventory_manager"].subset(self.options.get("subset"))
        hosts = self.options["inventory_manager"].list_hosts(
            self.options["host_pattern"],
        )
        if "extra_inventory_manager" in self.options:
            self.options["extra_inventory_manager"].subset(self.options.get("subset"))
            extra_hosts = self.options["extra_inventory_manager"].list_hosts()
        else:
            extra_hosts = []
        if len(hosts + extra_hosts) == 0 and not no_hosts:
            raise ansible.errors.AnsibleError(
                "Specified hosts and/or --limit does not match any hosts.",
            )
    
        # Pass along cli options
        args = ["pytest-ansible"]
        verbosity = None
DEBUG:tests.conftest:[log_custom_msg] item: <Function test_ecn_during_encap_on_standby[6]>
INFO:root:Can not get Allure report URL. Please check logs
        for verbosity_syntax in ("-v", "-vv", "-vvv", "-vvvv", "-vvvvv"):
            if verbosity_syntax in sys.argv:
                verbosity = verbosity_syntax
                break
        if verbosity is not None:
            args.append(verbosity_syntax)
        args.extend([self.options["host_pattern"]])
        for argument in (
            "connection",
            "user",
            "become",
            "become_method",
            "become_user",
            "module_path",
        ):
            arg_value = self.options.get(argument)
            argument = argument.replace("_", "-")
    
            if arg_value in (None, False):
                continue
    
            if arg_value is True:
                args.append(f"--{argument}")
            else:
                args.append(f"--{argument}={arg_value}")
    
        # Use Ansible's own adhoc cli to parse the fake command line we created and then save it
        # into Ansible's global context
        adhoc = AdHocCLI(args)
        adhoc.parse()
    
        # And now we'll never speak of this again
        del adhoc
    
        # Initialize callbacks to capture module JSON responses
        callback = ResultAccumulator()
    
        kwargs = {
            "inventory": self.options["inventory_manager"],
            "variable_manager": self.options["variable_manager"],
            "loader": self.options["loader"],
            "stdout_callback": callback,
            "passwords": {"conn_pass": None, "become_pass": None},
        }
    
        kwargs_extra = {}
        # If we have an extra inventory, do the same that we did for the inventory
        if "extra_inventory_manager" in self.options:
            callback_extra = ResultAccumulator()
    
            kwargs_extra = {
                "inventory": self.options["extra_inventory_manager"],
                "variable_manager": self.options["extra_variable_manager"],
                "loader": self.options["extra_loader"],
                "stdout_callback": callback_extra,
                "passwords": {"conn_pass": None, "become_pass": None},
            }
    
        # create a pseudo-play to execute the specified module via a single task
        play_ds = {
            "name": "pytest-ansible",
            "hosts": self.options["host_pattern"],
            "become": self.options.get("become"),
            "become_user": self.options.get("become_user"),
            "gather_facts": "no",
            "tasks": [
                {
                    "action": {
                        "module": self.options["module_name"],
                        "args": complex_args,
                    },
                },
            ],
        }
    
        play = Play().load(
            play_ds,
            variable_manager=self.options["variable_manager"],
            loader=self.options["loader"],
        )
        play_extra = None
        if "extra_inventory_manager" in self.options:
            play_extra = Play().load(
                play_ds,
                variable_manager=self.options["extra_variable_manager"],
                loader=self.options["extra_loader"],
            )
    
        if HAS_CUSTOM_LOADER_SUPPORT:
            # Load the collection finder, unsupported, may change in future
            init_plugin_loader(COLLECTIONS_PATHS)
    
        # now create a task queue manager to execute the play
        tqm = None
        try:
            tqm = TaskQueueManager(**kwargs)
            tqm.run(play)
        finally:
            if tqm:
                tqm.cleanup()
    
        if "extra_inventory_manager" in self.options:
            tqm_extra = None
            try:
                tqm_extra = TaskQueueManager(**kwargs_extra)
                tqm_extra.run(play_extra)
            finally:
                if tqm_extra:
                    tqm_extra.cleanup()
    
        # Raise exception if host(s) unreachable
        # FIXME - if multiple hosts were involved, should an exception be raised?
        if callback.unreachable:
>           raise AnsibleConnectionFailure(
                "Host unreachable in the inventory",
                dark=callback.unreachable,
                contacted=callback.contacted,
            )
E           pytest_ansible.errors.AnsibleConnectionFailure: Host unreachable in the inventory

arg_value  = ['/var/src/sonic-mgmt_vms66-dual-t0-8101c1-03/ansible/library']
args       = ['pytest-ansible', 'vms66-7', '--connection=smart', '--become', '--become-method=sudo', '--become-user=root', ...]
argument   = 'module-path'
callback   = <pytest_ansible.module_dispatcher.v213.ResultAccumulator object at 0x7f94cb997850>
complex_args = {'dest': '/opt', 'src': 'scripts/icmp_responder.py'}
extra_hosts = []
hosts      = [vms66-7]
kwargs     = {'inventory': <ansible.inventory.manager.InventoryManager object at 0x7f94d040ef70>, 'loader': <ansible.parsing.datalo...ss': None}, 'stdout_callback': <pytest_ansible.module_dispatcher.v213.ResultAccumulator object at 0x7f94cb997850>, ...}
kwargs_extra = {}
module_args = ()
no_hosts   = False
play       = pytest-ansible
play_ds    = {'become': True, 'become_user': 'root', 'gather_facts': 'no', 'hosts': 'vms66-7', ...}
play_extra = None
self       = <pytest_ansible.module_dispatcher.v213.ModuleDispatcherV213 object at 0x7f94cb842ee0>
tqm        = <ansible.executor.task_queue_manager.TaskQueueManager object at 0x7f94d44868e0>
verbosity  = None
verbosity_syntax = '-vvvvv'

Test log after:

        if callback.unreachable:
>           raise AnsibleConnectionFailure(
                "Host unreachable in the inventory",
                dark=callback.unreachable,
                contacted=callback.contacted,
            )
E           pytest_ansible.errors.AnsibleConnectionFailure: Host unreachable in the inventory

/usr/local/lib/python3.8/dist-packages/pytest_ansible/module_dispatcher/v213.py:232: AnsibleConnectionFailure

During handling of the above exception, another exception occurred:

duthosts = [<MultiAsicSonicHost str2-8101c1-01>, <MultiAsicSonicHost str2-8101c1-02>], duthost = <MultiAsicSonicHost str2-8101c1-01>, ptfhost = <tests.common.devices.ptf.PTFHost object at 0x7fc316a756a0>
tbinfo = {'auto_recover': 'True', 'comment': 'yawenni', 'conf-name': 'vms18-dual-t0-8101c1-01', 'duts': ['str2-8101c1-01', 'str2-8101c1-02'], ...}
request = <SubRequest 'run_icmp_responder_session' for <Function test_lldp[str2-8101c1-01-None]>>

    @pytest.fixture(scope="session", autouse=True)
    def run_icmp_responder_session(duthosts, duthost, ptfhost, tbinfo, request):
        """Run icmp_responder on ptfhost session-wise on dualtor testbeds with active-active ports."""
        # No vlan is available on non-t0 testbed, so skip this fixture
        if "dualtor-mixed" not in tbinfo["topo"]["name"] and "dualtor-aa" not in tbinfo["topo"]["name"]:
            logger.info("Skip running icmp_responder at session level, "
                        "it is only for dualtor testbed with active-active mux ports.")
            yield
            return
    
        global icmp_responder_session_started
    
        update_linkmgrd_probe_interval(duthosts, tbinfo, PROBER_INTERVAL_MS)
        duthosts.shell("config save -y")
    
        duthost = duthosts[0]
        logger.debug("Copy icmp_responder.py to ptfhost '{0}'".format(ptfhost.hostname))
        try:
            ptfhost.copy(src=os.path.join(SCRIPTS_SRC_DIR, ICMP_RESPONDER_PY), dest=OPT_DIR)
        except AnsibleConnectionFailure as e:
            logger.error("Failed to copy files to ptfhost.")
            request.config.cache.set("ptfhost_unreachable", True)
>           pt_assert(False, "!!! ptfhost unreachable !!! Exception: {}".format(repr(e)))
E           Failed: !!! ptfhost unreachable !!! Exception: Host unreachable in the inventory

common/fixtures/ptfhost_utils.py:334: Failed

How did you do it?

Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early.

How did you verify/test it?

use run_test.sh to test when ptf is unreachable.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
@ZhaohuiS ZhaohuiS requested review from a team and wangxin as code owners September 5, 2025 10:08
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@ZhaohuiS ZhaohuiS requested a review from lolyu September 5, 2025 10:09
Copy link
Copy Markdown
Collaborator

@lolyu lolyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Zhaohui

Comment thread tests/common/fixtures/ptfhost_utils.py Outdated
except BaseException as e:
logger.error("Failed to copy files to ptfhost.")
request.config.cache.set("ptfhost_unreachable", True)
pt_assert(False, "!!! ptfhost unreachable !!! Exception: {}".format(repr(e)))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know the Exception is definitely PTF unreachable?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangxin most of time, the unreachable PTF to cause copy file failure, but you are right, I change words to exception.
Please review it again, thanks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangxin Thank you for your suggestion.
It turns out pytest_ansible.errors.AnsibleConnectionFailure works, but ansible.errors.AnsibleConnectionFailure doesn't work.

Correct:
from pytest_ansible.errors import AnsibleConnectionFailure

Wrong:
from ansible.errors import AnsibleConnectionFailure

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread tests/common/fixtures/ptfhost_utils.py Outdated
ptfhost.copy(src=os.path.join(SCRIPTS_SRC_DIR, ICMP_RESPONDER_PY), dest=OPT_DIR)
try:
ptfhost.copy(src=os.path.join(SCRIPTS_SRC_DIR, ICMP_RESPONDER_PY), dest=OPT_DIR)
except BaseException as e:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only exception AnsibleConnectionFailure means that the PTF is unreachable. It is better to capture this AnsibleConnectionFailure exception here and set "ptfhost_exception" to True. For other exceptions, they could be different issues and should not be treated as ptf unreachable.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@wangxin wangxin merged commit e81625f into sonic-net:master Sep 9, 2025
20 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Sep 10, 2025
…0539)

What is the motivation for this PR?
On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost.
In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results.
In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated.

Similar PR was filed before sonic-net#10243

How did you do it?
Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early.

How did you verify/test it?
use run_test.sh to test when ptf is unreachable.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

Cherry-pick PR to 202505: #20601

mssonicbld pushed a commit that referenced this pull request Sep 10, 2025
What is the motivation for this PR?
On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost.
In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results.
In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated.

Similar PR was filed before #10243

How did you do it?
Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early.

How did you verify/test it?
use run_test.sh to test when ptf is unreachable.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
xixuej pushed a commit to xixuej/sonic-mgmt that referenced this pull request Sep 17, 2025
…0539)

What is the motivation for this PR?
On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost.
In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results.
In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated.

Similar PR was filed before sonic-net#10243

How did you do it?
Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early.

How did you verify/test it?
use run_test.sh to test when ptf is unreachable.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
vidyac86 pushed a commit to vidyac86/sonic-mgmt that referenced this pull request Oct 23, 2025
…0539)

What is the motivation for this PR?
On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost.
In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results.
In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated.

Similar PR was filed before sonic-net#10243

How did you do it?
Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early.

How did you verify/test it?
use run_test.sh to test when ptf is unreachable.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
opcoder0 pushed a commit to opcoder0/sonic-mgmt that referenced this pull request Dec 8, 2025
…0539)

What is the motivation for this PR?
On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost.
In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results.
In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated.

Similar PR was filed before sonic-net#10243

How did you do it?
Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early.

How did you verify/test it?
use run_test.sh to test when ptf is unreachable.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Dec 16, 2025
…0539)

What is the motivation for this PR?
On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost.
In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results.
In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated.

Similar PR was filed before sonic-net#10243

How did you do it?
Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early.

How did you verify/test it?
use run_test.sh to test when ptf is unreachable.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
Signed-off-by: Guy Shemesh <gshemesh@nvidia.com>
AharonMalkin pushed a commit to AharonMalkin/sonic-mgmt that referenced this pull request Dec 16, 2025
…0539)

What is the motivation for this PR?
On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost.
In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results.
In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated.

Similar PR was filed before sonic-net#10243

How did you do it?
Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early.

How did you verify/test it?
use run_test.sh to test when ptf is unreachable.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
Signed-off-by: Aharon Malkin <amalkin@nvidia.com>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Dec 21, 2025
…0539)

What is the motivation for this PR?
On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost.
In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results.
In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated.

Similar PR was filed before sonic-net#10243

How did you do it?
Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early.

How did you verify/test it?
use run_test.sh to test when ptf is unreachable.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
Signed-off-by: Guy Shemesh <gshemesh@nvidia.com>
venu-nexthop pushed a commit to venu-nexthop/sonic-mgmt that referenced this pull request Jan 13, 2026
…0539)

What is the motivation for this PR?
On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost.
In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results.
In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated.

Similar PR was filed before sonic-net#10243

How did you do it?
Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early.

How did you verify/test it?
use run_test.sh to test when ptf is unreachable.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Jan 26, 2026
…0539)

What is the motivation for this PR?
On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost.
In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results.
In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated.

Similar PR was filed before sonic-net#10243

How did you do it?
Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early.

How did you verify/test it?
use run_test.sh to test when ptf is unreachable.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
Signed-off-by: Guy Shemesh <gshemesh@nvidia.com>
lakshmi-nexthop pushed a commit to lakshmi-nexthop/sonic-mgmt that referenced this pull request Jan 28, 2026
…0539)

What is the motivation for this PR?
On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost.
In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results.
In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated.

Similar PR was filed before sonic-net#10243

How did you do it?
Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early.

How did you verify/test it?
use run_test.sh to test when ptf is unreachable.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
Signed-off-by: Lakshmi Yarramaneni <lakshmi@nexthop.ai>
ytzur1 pushed a commit to ytzur1/sonic-mgmt that referenced this pull request Feb 2, 2026
…0539)

What is the motivation for this PR?
On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost.
In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results.
In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated.

Similar PR was filed before sonic-net#10243

How did you do it?
Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early.

How did you verify/test it?
use run_test.sh to test when ptf is unreachable.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
Signed-off-by: Yael Tzur <ytzur@nvidia.com>
venu-nexthop pushed a commit to venu-nexthop/sonic-mgmt that referenced this pull request Mar 27, 2026
…0539)

What is the motivation for this PR?
On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost.
In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results.
In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated.

Similar PR was filed before sonic-net#10243

How did you do it?
Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early.

How did you verify/test it?
use run_test.sh to test when ptf is unreachable.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
ZhaohuiS added a commit to ZhaohuiS/sonic-mgmt that referenced this pull request Apr 2, 2026
When a DUT is unreachable, session-scoped fixtures like add_mgmt_test_mark
fail with AnsibleConnectionFailure. But unlike the duthosts fixture (PR sonic-net#10243
which catches BaseException and exits with code 15), add_mgmt_test_mark has no
error handling. The exception propagates as a generic test error (exit code 1)
and the scheduler does not remove the bad testbed.

This commit applies the same pattern as PR sonic-net#10243:

1. Wrap add_mgmt_test_mark in try/except BaseException, setting the
   duthosts_fixture_failed cache flag so pytest exits with code 15.
   This reuses the existing mechanism that run_tests.sh already checks.

2. Extend the ptfhost exception handling in run_icmp_responder_session
   (PR sonic-net#20539) to also cover the remaining ptfhost.copy() and
   ptfhost.shell() calls that were left unprotected.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
ZhaohuiS added a commit to ZhaohuiS/sonic-mgmt that referenced this pull request Apr 2, 2026
When a DUT is unreachable, session-scoped fixtures like add_mgmt_test_mark
fail with AnsibleConnectionFailure. But unlike the duthosts fixture (PR sonic-net#10243
which catches BaseException and exits with code 15), add_mgmt_test_mark has no
error handling. The exception propagates as a generic test error (exit code 1)
and the scheduler does not remove the bad testbed.

This commit applies the same pattern as PR sonic-net#10243:

1. Wrap add_mgmt_test_mark in try/except BaseException, setting the
   duthosts_fixture_failed cache flag so pytest exits with code 15.

2. Extend the ptfhost exception handling in run_icmp_responder_session
   (PR sonic-net#20539) to also cover the remaining ptfhost.copy() and
   ptfhost.shell() calls that were left unprotected.

3. Wrap do_checks() in the post-test sanity check phase with try/except
   so that if the DUT becomes unreachable during a test, the post-check
   crash still sets post_sanity_check_failed (exit code 11) instead of
   silently propagating as a generic error.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
ZhaohuiS added a commit to ZhaohuiS/sonic-mgmt that referenced this pull request Apr 2, 2026
When a DUT or PTF host is unreachable, the test framework should exit with
an error code that the scheduler (ElasticTest) recognizes so it can remove
the bad testbed from the test plan. ElasticTest checks for exit codes
[10, 11, 12, 15] but NOT 16, so the PTF unreachable exit code 16 from
PR sonic-net#20539 was silently ignored.

Changes:

1. conftest.py: Catch AnsibleConnectionFailure in add_mgmt_test_mark
   fixture and set duthosts_fixture_failed flag (exit code 15), following
   the same pattern as PR sonic-net#10243 for duthosts fixture failures.
   Rename DUTHOSTS_FIXTURE_FAILED_RC to HOST_FIXTURE_FAILED_RC.

2. ptfhost_utils.py: Change PTFHOST_EXCEPTION_RC from 16 to 15
   (renamed to HOST_FIXTURE_FAILED_RC) so ElasticTest recognizes it
   and kicks the testbed out instead of assigning more tests.

3. sanity_check/__init__.py: Wrap do_checks() in the post-test sanity
   check with try/except so that if the DUT becomes unreachable during
   a test, the post-check crash sets post_sanity_check_failed (exit
   code 11) instead of propagating as exit code 1.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
wangxin pushed a commit that referenced this pull request Apr 3, 2026
…15) (#23549)

* Fix DUT/PTF unreachable not causing early exit and testbed removal

When a DUT or PTF host is unreachable, the test framework should exit with
an error code that the scheduler (ElasticTest) recognizes so it can remove
the bad testbed from the test plan. ElasticTest checks for exit codes
[10, 11, 12, 15] but NOT 16, so the PTF unreachable exit code 16 from
PR #20539 was silently ignored.

Changes:

1. conftest.py: Catch AnsibleConnectionFailure in add_mgmt_test_mark
   fixture and set duthosts_fixture_failed flag (exit code 15), following
   the same pattern as PR #10243 for duthosts fixture failures.
   Rename DUTHOSTS_FIXTURE_FAILED_RC to HOST_FIXTURE_FAILED_RC.

2. ptfhost_utils.py: Change PTFHOST_EXCEPTION_RC from 16 to 15
   (renamed to HOST_FIXTURE_FAILED_RC) so ElasticTest recognizes it
   and kicks the testbed out instead of assigning more tests.

3. sanity_check/__init__.py: Wrap do_checks() in the post-test sanity
   check with try/except so that if the DUT becomes unreachable during
   a test, the post-check crash sets post_sanity_check_failed (exit
   code 11) instead of propagating as exit code 1.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>

* Merge pytest_sessionfinish hooks for DUT and PTF unreachable

Both conftest.py and ptfhost_utils.py had separate pytest_sessionfinish
hooks checking different cache keys (duthosts_fixture_failed and
ptfhost_exception). Since both now use the same exit code 15
(HOST_FIXTURE_FAILED_RC), merge them into a single hook in conftest.py
and remove the duplicate from ptfhost_utils.py.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Apr 3, 2026
…15) (sonic-net#23549)

* Fix DUT/PTF unreachable not causing early exit and testbed removal

When a DUT or PTF host is unreachable, the test framework should exit with
an error code that the scheduler (ElasticTest) recognizes so it can remove
the bad testbed from the test plan. ElasticTest checks for exit codes
[10, 11, 12, 15] but NOT 16, so the PTF unreachable exit code 16 from
PR sonic-net#20539 was silently ignored.

Changes:

1. conftest.py: Catch AnsibleConnectionFailure in add_mgmt_test_mark
   fixture and set duthosts_fixture_failed flag (exit code 15), following
   the same pattern as PR sonic-net#10243 for duthosts fixture failures.
   Rename DUTHOSTS_FIXTURE_FAILED_RC to HOST_FIXTURE_FAILED_RC.

2. ptfhost_utils.py: Change PTFHOST_EXCEPTION_RC from 16 to 15
   (renamed to HOST_FIXTURE_FAILED_RC) so ElasticTest recognizes it
   and kicks the testbed out instead of assigning more tests.

3. sanity_check/__init__.py: Wrap do_checks() in the post-test sanity
   check with try/except so that if the DUT becomes unreachable during
   a test, the post-check crash sets post_sanity_check_failed (exit
   code 11) instead of propagating as exit code 1.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>

* Merge pytest_sessionfinish hooks for DUT and PTF unreachable

Both conftest.py and ptfhost_utils.py had separate pytest_sessionfinish
hooks checking different cache keys (duthosts_fixture_failed and
ptfhost_exception). Since both now use the same exit code 15
(HOST_FIXTURE_FAILED_RC), merge them into a single hook in conftest.py
and remove the duplicate from ptfhost_utils.py.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
Signed-off-by: mssonicbld <sonicbld@microsoft.com>
opcoder0 pushed a commit to opcoder0/sonic-mgmt that referenced this pull request Apr 13, 2026
…15) (sonic-net#23549)

* Fix DUT/PTF unreachable not causing early exit and testbed removal

When a DUT or PTF host is unreachable, the test framework should exit with
an error code that the scheduler (ElasticTest) recognizes so it can remove
the bad testbed from the test plan. ElasticTest checks for exit codes
[10, 11, 12, 15] but NOT 16, so the PTF unreachable exit code 16 from
PR sonic-net#20539 was silently ignored.

Changes:

1. conftest.py: Catch AnsibleConnectionFailure in add_mgmt_test_mark
   fixture and set duthosts_fixture_failed flag (exit code 15), following
   the same pattern as PR sonic-net#10243 for duthosts fixture failures.
   Rename DUTHOSTS_FIXTURE_FAILED_RC to HOST_FIXTURE_FAILED_RC.

2. ptfhost_utils.py: Change PTFHOST_EXCEPTION_RC from 16 to 15
   (renamed to HOST_FIXTURE_FAILED_RC) so ElasticTest recognizes it
   and kicks the testbed out instead of assigning more tests.

3. sanity_check/__init__.py: Wrap do_checks() in the post-test sanity
   check with try/except so that if the DUT becomes unreachable during
   a test, the post-check crash sets post_sanity_check_failed (exit
   code 11) instead of propagating as exit code 1.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>

* Merge pytest_sessionfinish hooks for DUT and PTF unreachable

Both conftest.py and ptfhost_utils.py had separate pytest_sessionfinish
hooks checking different cache keys (duthosts_fixture_failed and
ptfhost_exception). Since both now use the same exit code 15
(HOST_FIXTURE_FAILED_RC), merge them into a single hook in conftest.py
and remove the duplicate from ptfhost_utils.py.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
Signed-off-by: opcoder0 <110003254+opcoder0@users.noreply.github.com>
mssonicbld added a commit that referenced this pull request Apr 15, 2026
…al (unify exit code 15) (#23619)

### Description of PR

Summary:
Fix DUT and PTF host unreachable errors not causing testbed removal from
test plan.

**Root cause:** ElasticTest only kicks testbeds for exit codes `[10, 11,
12, 15]` (defined in `test_plan_constants.py`). The PTF unreachable exit
code 16 from PR #20539 is NOT in this list, so it was silently ignored.
And `add_mgmt_test_mark` had no error handling at all — when DUT is
unreachable but `duthosts` init succeeded from cached facts, the
`AnsibleConnectionFailure` propagated as exit code 1.

**Real-world impact:**
- DUT unreachable: `testbed-bjw2-can-t0-7260-2` in testplan
`69c736202878597161a7a636` — 117 tests failed across 2 modules
- PTF unreachable: `vmsvc1-dual-t0-7050-1` in testplan
`69ae61e01ba706f7e92aa4f5` — testbed kept getting assigned tests despite
PTF being down

### Type of change

- [x] Bug fix

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505
- [ ] 202511

### Approach
#### What is the motivation for this PR?

PR #10243 catches `BaseException` in `duthosts` fixture init → exit code
15 → testbed kicked out. But when `duthosts` succeeds from cached facts,
the first SSH attempt in `add_mgmt_test_mark` fails without any handler.

PR #20539 catches PTF `AnsibleConnectionFailure` → exit code 16. But
ElasticTest only checks `[10, 11, 12, 15]`, so exit code 16 is ignored
and the testbed stays in the test plan.

#### How did you do it?

1. **`tests/conftest.py`**: Catch `AnsibleConnectionFailure` in
`add_mgmt_test_mark`, set `duthosts_fixture_failed` cache flag → exit
code 15 (same pattern as PR #10243). Rename `DUTHOSTS_FIXTURE_FAILED_RC`
to `HOST_FIXTURE_FAILED_RC` to reflect it covers both DUT and PTF hosts.

2. **`tests/common/fixtures/ptfhost_utils.py`**: Change
`PTFHOST_EXCEPTION_RC` from 16 to 15 (renamed to
`HOST_FIXTURE_FAILED_RC`) so ElasticTest recognizes it and kicks the
testbed out.

3. **`tests/common/plugins/sanity_check/__init__.py`**: Wrap
`do_checks()` in post-test sanity check with try/except so DUT
unreachable during a test sets `post_sanity_check_failed` (exit code 11)
instead of propagating as exit code 1.

#### How did you verify/test it?

- Analyzed real failure logs from ElasticTest for both DUT and PTF
unreachable scenarios
- Verified ElasticTest scheduler code confirms exit code 15 is in
`DUTHOST_UNREACHABLE_RET_CODES`
- Confirmed `run_tests.sh` already handles exit code 15

#### Any platform specific information?

N/A

#### Supported testbed topology if it's a new test case?

N/A

### Documentation

N/A

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
Signed-off-by: mssonicbld <sonicbld@microsoft.com>
Co-authored-by: Zhaohui Sun <94606222+ZhaohuiS@users.noreply.github.com>
rraghav-cisco pushed a commit to rraghav-cisco/sonic-mgmt that referenced this pull request Apr 20, 2026
…0539)

What is the motivation for this PR?
On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost.
In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results.
In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated.

Similar PR was filed before sonic-net#10243

How did you do it?
Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early.

How did you verify/test it?
use run_test.sh to test when ptf is unreachable.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
rraghav-cisco pushed a commit to rraghav-cisco/sonic-mgmt that referenced this pull request Apr 20, 2026
…15) (sonic-net#23549)

* Fix DUT/PTF unreachable not causing early exit and testbed removal

When a DUT or PTF host is unreachable, the test framework should exit with
an error code that the scheduler (ElasticTest) recognizes so it can remove
the bad testbed from the test plan. ElasticTest checks for exit codes
[10, 11, 12, 15] but NOT 16, so the PTF unreachable exit code 16 from
PR sonic-net#20539 was silently ignored.

Changes:

1. conftest.py: Catch AnsibleConnectionFailure in add_mgmt_test_mark
   fixture and set duthosts_fixture_failed flag (exit code 15), following
   the same pattern as PR sonic-net#10243 for duthosts fixture failures.
   Rename DUTHOSTS_FIXTURE_FAILED_RC to HOST_FIXTURE_FAILED_RC.

2. ptfhost_utils.py: Change PTFHOST_EXCEPTION_RC from 16 to 15
   (renamed to HOST_FIXTURE_FAILED_RC) so ElasticTest recognizes it
   and kicks the testbed out instead of assigning more tests.

3. sanity_check/__init__.py: Wrap do_checks() in the post-test sanity
   check with try/except so that if the DUT becomes unreachable during
   a test, the post-check crash sets post_sanity_check_failed (exit
   code 11) instead of propagating as exit code 1.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>

* Merge pytest_sessionfinish hooks for DUT and PTF unreachable

Both conftest.py and ptfhost_utils.py had separate pytest_sessionfinish
hooks checking different cache keys (duthosts_fixture_failed and
ptfhost_exception). Since both now use the same exit code 15
(HOST_FIXTURE_FAILED_RC), merge them into a single hook in conftest.py
and remove the duplicate from ptfhost_utils.py.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
sudarshankumar4893 pushed a commit to sudarshankumar4893/pervnetbgp-tests that referenced this pull request Apr 20, 2026
…0539)

What is the motivation for this PR?
On dualtor testbed, in very early setup, it will try to fixture run_icmp_responder_session, if ptf is unreachable, the script doesn't know about it and still use ptfhost.copy to copy file from local to pfthost.
In this PR, the script will capture this exception and ensure to exit pytest early, no need to run any more cases on this unhealthy testbed, which wastes time and also avoids uploading many noise failed test results.
In ElasticTest, if ptfhost unreachable on one testbed, case failed on this testbed, and will pick up another testbed to run, it will generate many flaky results. It's better to exit pytest early and this testbed will be kicked out and no more other flaky results generated.

Similar PR was filed before sonic-net#10243

How did you do it?
Capture exception in run_icmp_responder_session , when ptf becomes unreachable, this is the first failed fixture. set session.exitstatus to 16 and make run_test.sh aware of this failure and exit pipeline early.

How did you verify/test it?
use run_test.sh to test when ptf is unreachable.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
Signed-off-by: Sudarshan Kumar (from Dev Box) <sudakumar@microsoft.com>
sudarshankumar4893 pushed a commit to sudarshankumar4893/pervnetbgp-tests that referenced this pull request Apr 20, 2026
…15) (sonic-net#23549)

* Fix DUT/PTF unreachable not causing early exit and testbed removal

When a DUT or PTF host is unreachable, the test framework should exit with
an error code that the scheduler (ElasticTest) recognizes so it can remove
the bad testbed from the test plan. ElasticTest checks for exit codes
[10, 11, 12, 15] but NOT 16, so the PTF unreachable exit code 16 from
PR sonic-net#20539 was silently ignored.

Changes:

1. conftest.py: Catch AnsibleConnectionFailure in add_mgmt_test_mark
   fixture and set duthosts_fixture_failed flag (exit code 15), following
   the same pattern as PR sonic-net#10243 for duthosts fixture failures.
   Rename DUTHOSTS_FIXTURE_FAILED_RC to HOST_FIXTURE_FAILED_RC.

2. ptfhost_utils.py: Change PTFHOST_EXCEPTION_RC from 16 to 15
   (renamed to HOST_FIXTURE_FAILED_RC) so ElasticTest recognizes it
   and kicks the testbed out instead of assigning more tests.

3. sanity_check/__init__.py: Wrap do_checks() in the post-test sanity
   check with try/except so that if the DUT becomes unreachable during
   a test, the post-check crash sets post_sanity_check_failed (exit
   code 11) instead of propagating as exit code 1.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>

* Merge pytest_sessionfinish hooks for DUT and PTF unreachable

Both conftest.py and ptfhost_utils.py had separate pytest_sessionfinish
hooks checking different cache keys (duthosts_fixture_failed and
ptfhost_exception). Since both now use the same exit code 15
(HOST_FIXTURE_FAILED_RC), merge them into a single hook in conftest.py
and remove the duplicate from ptfhost_utils.py.

Signed-off-by: Zhaohui Sun <zhaohuisun@microsoft.com>
Signed-off-by: Sudarshan Kumar (from Dev Box) <sudakumar@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants