Skip to content

[Monit] Monitoring the running status of containers.#6251

Merged
lguohan merged 18 commits intosonic-net:masterfrom
yozhao101:monitoring_containers
Jan 8, 2021
Merged

[Monit] Monitoring the running status of containers.#6251
lguohan merged 18 commits intosonic-net:masterfrom
yozhao101:monitoring_containers

Conversation

@yozhao101
Copy link
Copy Markdown
Contributor

- Why I did it
This PR aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command sudo systemctl reset-failed <container_name> manually.

- How I did it
We will employ Monit to monitor a script. This script will generate the expected running container list and compare it with the current running containers. If there are containers which were expected to run but were not running, then an alerting message will be written into syslog.

- How to verify it
I tested this feature on a lab device str-a7050-acs-3 which has single ASIC and str2-n3164-acs-3 which has a Multi-ASIC. First I manually stopped a container by running the command sudo systemctl stop <container_name>, then I checked whether there was an alerting message in the syslog.

- Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • [ x] 202006

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

write an alerting message into syslog if a container which was expected
to run but was not running.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Copy link
Copy Markdown
Contributor

@jleveque jleveque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As comments.

Also, I suggest renaming the monitoring_containers script to container_checker

@yozhao101 yozhao101 marked this pull request as ready for review December 25, 2020 21:04
@yozhao101 yozhao101 requested review from abdosi and lguohan December 29, 2020 05:24
'container_checker'.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
message into syslog.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
containers.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Signed-off-by: Yong Zhao <yozhao@microsoft.com>
run but running.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
Copy link
Copy Markdown
Contributor

@abdosi abdosi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lguohan lguohan merged commit 04cd1d6 into sonic-net:master Jan 8, 2021
lguohan pushed a commit that referenced this pull request Jan 9, 2021
**- Why I did it**
This PR aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command `sudo systemctl reset-failed <container_name>` manually. 

**- How I did it**
We will employ Monit to monitor a script. This script will generate the expected running container list and compare it with the current running containers. If there are containers which were expected to run but were not running, then an alerting message will be written into syslog.

**- How to verify it**
I tested this feature on a lab device `str-a7050-acs-3` which has single ASIC and `str2-n3164-acs-3` which has a Multi-ASIC. First I manually stopped a container by running the command `sudo systemctl stop <container_name>`, then I checked whether there was an alerting message in the syslog.

Signed-off-by: Yong Zhao <yozhao@microsoft.com>
yozhao101 added a commit to sonic-net/sonic-mgmt that referenced this pull request Jan 27, 2021
What is the motivation for this PR?
This PR aims to test the feature of container checker and PR link of container checker is sonic-net/sonic-buildimage#6251.

The script of container_checker was run periodically by Monit and aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command sudo systemctl reset-failed <container_name> manually.

How did you do it?
This pytest script will test the script container_checker in the following steps:

Stop the containers explicitly.
Check whether the names of stopped containers appear in the Monit alerting message.
Restart the corresponding stopped containers.
Post-check all the critical processes are running and BGP sessions are established.
How did you verify/test it?
I tested this pytest script on a virtual testbed.

Any platform specific information?
N/A

Supported testbed topology if it's a new test case?
N/A
yozhao101 added a commit to sonic-net/sonic-mgmt that referenced this pull request Mar 19, 2021
Signed-off-by: Yong Zhao yozhao@microsoft.com

Description of PR
Summary:
This PR aims to test the feature of container checker and PR link is sonic-net/sonic-buildimage#6251.

Fixes # (issue)

Type of change
 Bug fix
 Testbed and Framework(new/improvement)
[ x] Test case(new/improvement)

Approach

What is the motivation for this PR?
This PR aims to test the feature of container checker and PR link of container checker is sonic-net/sonic-buildimage#6251.

The script of container_checker was run periodically by Monit and aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command sudo systemctl reset-failed <container_name> manually.

How did you do it?
This pytest script will test the script container_checker in the following steps:

Stop the containers explicitly.
Check whether the names of stopped containers appear in the Monit alerting message.
Restart the containers by the config_reload(...).
Post-check all the critical processes are running and BGP sessions are established.

How did you verify/test it?
I tested the PR against the physical testbed (str-dx010-acs-1) which was installed image built from public master branch.
vmittal-msft pushed a commit to vmittal-msft/sonic-mgmt that referenced this pull request Sep 28, 2021
Signed-off-by: Yong Zhao yozhao@microsoft.com

Description of PR
Summary:
This PR aims to test the feature of container checker and PR link is sonic-net/sonic-buildimage#6251.

Fixes # (issue)

Type of change
 Bug fix
 Testbed and Framework(new/improvement)
[ x] Test case(new/improvement)

Approach

What is the motivation for this PR?
This PR aims to test the feature of container checker and PR link of container checker is sonic-net/sonic-buildimage#6251.

The script of container_checker was run periodically by Monit and aims to monitor the running status of each container. Currently the auto-restart feature was enabled. If a critical process exited unexpected, the container will be restarted. If the container was restarted 3 times during 20 minutes, then it will not run anymore unless we cleared the flag using the command sudo systemctl reset-failed <container_name> manually.

How did you do it?
This pytest script will test the script container_checker in the following steps:

Stop the containers explicitly.
Check whether the names of stopped containers appear in the Monit alerting message.
Restart the containers by the config_reload(...).
Post-check all the critical processes are running and BGP sessions are established.

How did you verify/test it?
I tested the PR against the physical testbed (str-dx010-acs-1) which was installed image built from public master branch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants