[pytest] Test the autorestart feature of containers#1769
[pytest] Test the autorestart feature of containers#1769yxieca merged 44 commits intosonic-net:masterfrom
Conversation
process portsyncd and a non-critical process rsyslogd. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
and use 'show container feature autorestart' to get the containers which were implemented with the autorestart feature. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
critical process and a non-critical process. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
a critical group. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
…critical program. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
need set the autorestart features of containers to 'enabled' if they are in the 'disabled' state and restore this 'disabled' state after testing. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
get_program_state(...) Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
of command "docker exec <command_name> supervisorctl" not "docker exec <container_name> ps -ax" since the name in critical_processes file is the program name defined in supervisord.file. At the same time, I also add functionality to reset failed flag once the container reaches the limitation of restart times which was set in service file. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
to improve the readability. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
is killed in the specified container. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
since the exit code is the execution result of 'kill' command not the output. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
'is_container_running(...)'. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
|
Looks good to me. @yxieca to review again, please. |
| else: | ||
| pytest.fail("Syntax of the line '{}' in critical_processes file is incorrect.".format(line)) | ||
|
|
||
| return critical_group_list, critical_process_list |
There was a problem hiding this comment.
What is the difference between this function and https://github.com/Azure/sonic-mgmt/blob/695a9a53b75c90878c8b8de022479a53af1240a0/tests/common/devices.py#L371? can you merge this into a common place?
This can come as another PR.
There was a problem hiding this comment.
The difference between these two functions is all_critical_process_status(...) will call the function critical_process_status(...) which first get all the critical process list and then retrieve their running status for each containers in SONiC, get_critical_group_and_process_list(...) will only get all critical process list for a specified container.
@jleveque Can we break the function critical_process_status(...) in common/devices.py into two sub-functions? One function is used to get all critical process list of a container and another function can be created to retrieve the running status of these critical processes.
There was a problem hiding this comment.
@yozhao101: Yes. I think splitting the function into two sub-functions should be fine.
There was a problem hiding this comment.
I will do this by creating a separate PR.
| if container_state == "disabled": | ||
| disabled_containers.append(container_name) | ||
|
|
||
| return disabled_containers |
There was a problem hiding this comment.
There was a problem hiding this comment.
FYI, @lguohan removed the CONTAINER_FEATURE table recently and merged that info into the FEATURE table. Thus, this common function needs to be updated to get the new data.
| if container_state in ["enabled", "disabled"]: | ||
| container_autorestart_states[container_name] = container_state | ||
|
|
||
| return container_autorestart_states |
There was a problem hiding this comment.
This function can be moved to common/devices.py as a shared function if there is no such member function yet.
There was a problem hiding this comment.
Will moved this function to common/devices.py by creating a separate PR.
| container_autorestart_states = get_container_autorestart_states(duthost) | ||
| disabled_containers = get_disabled_container_list(duthost) | ||
|
|
||
| for container_name in container_autorestart_states.keys(): |
There was a problem hiding this comment.
It would better to make container name a parameter, there should be a way to make container list a parameter fixture. So that each container would have a unique test case to mark success or failure.
This change can come as a follow up PR.
| program_status, program_pid) | ||
|
|
||
| critical_group_list, critical_process_list = get_critical_group_and_process_list(duthost, container_name) | ||
| for critical_process in critical_process_list: |
There was a problem hiding this comment.
Same as containers, this could be parameterized.
There was a problem hiding this comment.
One thing this test misses, is that you should have a clean up method in the end to check if all critical processes are running, if not, then you need to recover the DUT so that next test case will have a clean run. I think you should address this issue, the other issues can come back as follow up PR.
'get_critical_group_and_process_list(...)' into common/devices.py. Signed-off-by: Yong Zhao <yozhao@microsoft.com>
"get_critical_group_and_process_lists(...)" Signed-off-by: Yong Zhao <yozhao@microsoft.com>
#### Why I did it Update submodule pointer for swss to include recent changes 4f1d726 [portsorch] fix errors when moving port from one lag to another. (sonic-net#1797) ae44701 [orchagent] Put port configuration to APPL_DB according to autoneg mode (sonic-net#1769) 5295f91 Add failure handling for SAI get operations (sonic-net#1768) 7c7c451 Revert recirc port change (sonic-net#1813) 5528ebf Cleanup code (sonic-net#1814)
Description of PR
We want to test the autorestart feature of five containers swss, syncd, teamd, dhcp_relay and bgp. Specifically we want to test whether the container can be automatically restarted if one of critical process is killed. Currently I tested two test cases manually for swss container. One test case is to kill a critical process
portsyncdand another is to kill a non-critical processrsyslogd.Summary:
Fixes # (issue)
Type of change
Approach
What is the motivation for this PR?
How did you do it?
How did you verify/test it?
I tested it on a virtual testbed which is installed SONiC image built from public master branch.
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation