Skip to content

[pytest] Test the autorestart feature of containers#1769

Merged
yxieca merged 44 commits intosonic-net:masterfrom
yozhao101:test_autorestart_feature_containers
Sep 14, 2020
Merged

[pytest] Test the autorestart feature of containers#1769
yxieca merged 44 commits intosonic-net:masterfrom
yozhao101:test_autorestart_feature_containers

Conversation

@yozhao101
Copy link
Copy Markdown
Contributor

@yozhao101 yozhao101 commented Jun 12, 2020

Description of PR

We want to test the autorestart feature of five containers swss, syncd, teamd, dhcp_relay and bgp. Specifically we want to test whether the container can be automatically restarted if one of critical process is killed. Currently I tested two test cases manually for swss container. One test case is to kill a critical process portsyncd and another is to kill a non-critical process rsyslogd.

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

I tested it on a virtual testbed which is installed SONiC image built from public master branch.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

process portsyncd and a non-critical process rsyslogd.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@yozhao101 yozhao101 requested review from jleveque, lguohan and yxieca June 12, 2020 17:55
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
and use 'show container feature autorestart' to get the containers which
were implemented with the autorestart feature.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
critical process and a non-critical process.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
a critical group.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
…critical program.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
need set the autorestart features of containers to 'enabled' if they
are in the 'disabled' state and restore this 'disabled' state after
testing.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
get_program_state(...)

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
yozhao101 added 2 commits July 2, 2020 15:59
of command "docker exec <command_name> supervisorctl" not "docker exec
<container_name> ps -ax" since the name in critical_processes file is
the program name defined in supervisord.file. At the same time, I
also add functionality to reset failed flag
once the container reaches the limitation of restart times which was set in
service file.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
to improve the readability.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
is killed in the specified container.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
since the exit code is the execution result of 'kill' command not the output.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
'is_container_running(...)'.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@jleveque
Copy link
Copy Markdown
Contributor

jleveque commented Aug 4, 2020

Looks good to me. @yxieca to review again, please.

else:
pytest.fail("Syntax of the line '{}' in critical_processes file is incorrect.".format(line))

return critical_group_list, critical_process_list
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between this function and https://github.com/Azure/sonic-mgmt/blob/695a9a53b75c90878c8b8de022479a53af1240a0/tests/common/devices.py#L371? can you merge this into a common place?

This can come as another PR.

Copy link
Copy Markdown
Contributor Author

@yozhao101 yozhao101 Sep 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between these two functions is all_critical_process_status(...) will call the function critical_process_status(...) which first get all the critical process list and then retrieve their running status for each containers in SONiC, get_critical_group_and_process_list(...) will only get all critical process list for a specified container.

@jleveque Can we break the function critical_process_status(...) in common/devices.py into two sub-functions? One function is used to get all critical process list of a container and another function can be created to retrieve the running status of these critical processes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yozhao101: Yes. I think splitting the function into two sub-functions should be fine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this by creating a separate PR.

if container_state == "disabled":
disabled_containers.append(container_name)

return disabled_containers
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, @lguohan removed the CONTAINER_FEATURE table recently and merged that info into the FEATURE table. Thus, this common function needs to be updated to get the new data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

if container_state in ["enabled", "disabled"]:
container_autorestart_states[container_name] = container_state

return container_autorestart_states
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function can be moved to common/devices.py as a shared function if there is no such member function yet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will moved this function to common/devices.py by creating a separate PR.

container_autorestart_states = get_container_autorestart_states(duthost)
disabled_containers = get_disabled_container_list(duthost)

for container_name in container_autorestart_states.keys():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would better to make container name a parameter, there should be a way to make container list a parameter fixture. So that each container would have a unique test case to mark success or failure.

This change can come as a follow up PR.

program_status, program_pid)

critical_group_list, critical_process_list = get_critical_group_and_process_list(duthost, container_name)
for critical_process in critical_process_list:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as containers, this could be parameterized.

Copy link
Copy Markdown
Collaborator

@yxieca yxieca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing this test misses, is that you should have a clean up method in the end to check if all critical processes are running, if not, then you need to recover the DUT so that next test case will have a clean run. I think you should address this issue, the other issues can come back as follow up PR.

'get_critical_group_and_process_list(...)' into common/devices.py.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
"get_critical_group_and_process_lists(...)"

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
@yxieca yxieca merged commit 5d40492 into sonic-net:master Sep 14, 2020
kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
#### Why I did it
Update submodule pointer for swss to include recent changes

4f1d726 [portsorch] fix errors when moving port from one lag to another. (sonic-net#1797)
ae44701 [orchagent] Put port configuration to APPL_DB according to autoneg mode (sonic-net#1769)
5295f91 Add failure handling for SAI get operations (sonic-net#1768)
7c7c451 Revert recirc port change (sonic-net#1813)
5528ebf Cleanup code (sonic-net#1814)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants