test: Create a tedge-watchdog test suite by Bravo555 · Pull Request #3138 · thin-edge/thin-edge.io

Bravo555 · 2024-09-25T14:09:23Z

Proposed changes

Created a test suite checking if tedge-watchdog correctly prevents services from being restarted if they're healthy and makes systemd restart them when they're unhealthy.

We do have tests for other services that utilise tedge-watchdog which check if these services react to watchdog's health check requests and otherwise integrate with it, but they test these other services, not the watchdog itself. As a separate component, tedge-watchdog very much needs a test suite to itself validating its behaviour, particularly because it doesn't get much attention and isn't extensively documented.

Types of changes

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
Documentation Update (if none of the other choices apply)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

Checklist

I have read the CONTRIBUTING doc
I have signed the CLA (in all commits with git commit -s)
I ran cargo fmt as mentioned in CODING_GUIDELINES
I used cargo clippy as mentioned in CODING_GUIDELINES
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Further comments

This test suite was motivated by the issue that was found in tedge-watchdog where it didn't look at status before notifying systemd that service is healthy. See: #3132 (comment)

But as explained in the comment in the first test, despite this issue, the services for which "status":"down" is published are being restarted as we'd expect, either because of some intended trickery in the implementation or of some other issue.

Now I'll try to establish why it fails, so the implementation might end up being correct, but still I think it would be good to have a dedicated test suite for tedge-watchdog.

github-actions · 2024-09-25T14:33:36Z

Robot Results

✅ Passed	❌ Failed	⏭️ Skipped	Total	Pass %	⏱️ Duration
513	0	2	513	100	1h26m53.498628999s

albinsuresh · 2024-09-26T06:19:14Z

But as explained in the comment in the first test, despite this issue, the services for which "status":"down" is published are being restarted as we'd expect, either because of some intended trickery in the implementation or of some other issue.

I finally remember why it was written without an explicit status value check. If a service responds to the health check request, the watchdog implicitly assumes that the service is healthy, as it received a response. If the service was genuinely down, it wouldn't have gotten any response, not even the down response, as the down status message is sent only once by the MQTT broker when that service disconnects. But, tedge-watchdog does not rely on those retained messages, but actively sends a health check and awaits its response. That's why the whole scheme is working even without any explicit status field check. But, that logic was written with only up and down statuses considered, and not account for newly added ones like unknown. So, the logic needs to be updated to account for that.

didier-wenzek · 2024-09-26T08:28:25Z

But, that logic was written with only up and down statuses considered, and not account for newly added ones like unknown.

What's the purpose of this unknown status?

albinsuresh · 2024-09-26T10:39:54Z

But, that logic was written with only up and down statuses considered, and not account for newly added ones like unknown.

What's the purpose of this unknown status?

To cover buggy clients that are reporting "some" health status messages on those specific topics, but not in the desired format. Interpreting such erroneous messages as down would be wrong, as something is still there responding to health requests, meaning it's really not "down". So, we want to differentiate such components from those that are genuinely down. This might help the user to fix the erroneous reporting of their component's health status.

Bravo555 · 2024-09-26T13:37:27Z

But as explained in the comment in the first test, despite this issue, the services for which "status":"down" is published are being restarted as we'd expect, either because of some intended trickery in the implementation or of some other issue.

Found the problem that caused tedge-watchdog to disconnect. It wasn't due to "status": "down", but due to "time" field. This is payload sent in the test:

Execute Command
...    tedge mqtt pub 'te/device/main/service/${SERVICE_NAME}/status/health' '{"status": "down", "pid": ${pid}, "time": "${before_restart_timestamp}"}'

time is a string, so we're expecting it to be in ISO format, but it's actually a UNIX timestamp instead. This returns an error from get_latest_health_status_message function and exits the watchdog for the service. We should probably ignore invalid messages instead of exiting the watchdog for the service.

Changing time from string to float in the test (i.e. removing quotes) makes the test fail as expected.

If a service responds to the health check request, the watchdog implicitly assumes that the service is healthy, as it received a response. If the service was genuinely down, it wouldn't have gotten any response, not even the down response

But, that logic was written with only up and down statuses considered, and not account for newly added ones like unknown. So, the logic needs to be updated to account for that.

So it seems to me there are 2 TODOs at this moment:

change deserialization logic so watchdog doesn't exit if it fails to deserialize time
update logic so that we notify only on "status": "up" and down/unknown are treated as unhealthy for which we don't send notifications

albinsuresh · 2024-09-27T08:55:23Z

tests/RobotFramework/tests/tedge_watchdog.robot

+    Execute Command
+    ...    tedge mqtt pub 'te/device/main/service/${SERVICE_NAME}/status/health' '{"status": "down", "pid": ${pid}, "time": "${before_restart_timestamp}"}'


This may not be fully deterministic as the watchdog would be waiting for a response after it sends the health check (with a later timestamp) and just ignores other random health messages that it receives with a lower timestamp. So, we need to make sure that this response reaches the watchdog after the health check request is received, with a higher timestamp, and also ensure that it reaches before the real running tedge-agent responds with its up status.

An way to disconnect tedge-watchdog from the MQTT topics used by the agent and the mapper (for testing purposes) is to configure tedge-watchdog with a fake topic root (say test-te).

gligorisaev

Finding

Non-deterministic Behavior:

• The watchdog sends a health check request and expects a response with a higher timestamp (indicating a more recent status). However, since the watchdog may receive multiple health messages (both valid and outdated), it must be able to discern between these messages.

• The risk is that the watchdog might ignore random health messages with lower timestamps that arrive after its health check, potentially leading to inconsistencies in how it processes the health state of the service.

Ensuring the Right Order of Events:

• To maintain determinism, the test should ensure that the watchdog receives the response to its health check request in the correct sequence:

• First, the watchdog sends a health check.
• Second, it receives a response with a timestamp that is newer than the health check request.
• The watchdog should not accept outdated health messages with lower timestamps that may incorrectly reflect the state of the service.

Race Condition Risk:

• There’s a race condition where the real, healthy tedge-agent might send an “up” status before the watchdog processes the simulated unhealthy response. If the healthy message arrives first, the test might falsely pass, even though it hasn’t correctly handled the simulated failure scenario.

• To avoid this, the test needs to ensure that the simulated unhealthy message is processed before the actual healthy status message from the real tedge-agent.

Possible Issues in Testing:

• Flaky Tests: Current test setup may produce inconsistent results, as the timing of messages isn’t guaranteed to always happen in the correct order.

Solution Suggestions:

• The test might need to introduce delays or synchronization mechanisms to ensure the correct sequence of events. For example, introducing a slight delay before the real tedge-agent sends its healthy status could help.

• Alternatively, adding more robust checks or timestamps in the test would allow it to explicitly confirm the order of events and prevent accepting out-of-order messages.

Impact on Test Reliability:

• Race Conditions: If the race condition isn’t handled properly, the test could pass incorrectly because it might receive the healthy message too soon.

• Deterministic Timing: For the test to be reliable, it must ensure that the simulated failure is properly processed and reflected in the results before the actual healthy state is acknowledged.

Recommendation:

To improve the reliability of the test:

• It should be ensured that the simulated unhealthy message (with a higher timestamp) is processed by the watchdog before the real service has a chance to send its healthy status.

• Should be Introduced specific synchronization points in the test where the test waits for confirmation that the health check has been processed before allowing further messages to be sent.

• Should be onsidered adding explicit logging or timestamps to better track when each message is sent and processed, allowing the test to fail more meaningfully when the sequence is incorrect.

Bravo555 · 2024-10-17T10:40:01Z

Race Condition Risk:

There’s a race condition where the real, healthy tedge-agent might send an “up” status before the watchdog processes the simulated unhealthy response. If the healthy message arrives first, the test might falsely pass, even though it hasn’t correctly handled the simulated failure scenario.

To avoid this, the test needs to ensure that the simulated unhealthy message is processed before the actual healthy status message from the real tedge-agent.

Possible Issues in Testing:

Flaky Tests: Current test setup may produce inconsistent results, as the timing of messages isn’t guaranteed to always happen in the correct order.

Indeed the current approach of letting a real service stay alive and send messages while the test tries to scoop in some of its own messages is not very reliable. To fix this, the solution would be to change the service unit definition so that it doesn't start a real service, but a mock, that the script can control. I've started to work on this, but got distracted by some other tickets.
I'll use this approach and implement the recommendations from the quoted comment once I've completed other more pressing work

reubenmiller · 2024-10-18T15:29:30Z

FYI: I've created a PR to track the decision about the tedge-watchdog feature

Created a test suite checking if tedge-watchdog correctly prevents services from being restarted if they're healthy and makes systemd restart them when they're unhealthy. We do have tests for other services that utilise tedge-watchdog which check if these services react to watchdog's health check requests and otherwise integrate with it, but they test these other services, not the watchdog itself. As a separate component, tedge-watchdog very much needs a test suite to itself validating its behaviour, particularly because it doesn't get much attention and isn't extensively documented. Signed-off-by: Marcel Guzik <marcel.guzik@inetum.com>

Signed-off-by: Marcel Guzik <marcel.guzik@inetum.com>

codecov · 2024-10-21T14:49:01Z

Codecov Report

Attention: Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
crates/core/tedge_watchdog/src/systemd_watchdog.rs	50.00%	2 Missing ⚠️

Additional details and impacted files

📢 Thoughts on this report? Let us know!

If a unix timestamp returned by the service is an integer (i.e. it does not have a fractional, sub-second part), then the comparison if the response is newer than health check request could fail. Thus the comparison was changed to compare with a 1s precision. The case when timestamp is an integer works and it shouldn't return false positives because we don't send health check requests more often than 1s. Signed-off-by: Marcel Guzik <marcel.guzik@inetum.com>

albinsuresh · 2024-10-23T05:45:42Z

tests/RobotFramework/tests/tedge_watchdog/tedge_watchdog.robot

+
+    Transfer To Device    ${CURDIR}/health_check_respond.sh    /setup/
+
+    # Without this line mqtt-logger can't connect to listener at 1883, but with it it successfully connects to listener


This is a side effect of skipping the bootstrap first (which probably results in the mqtt-logger connecting to the non-secure broker) and then bootstrapping the device with --no-bootstrap --no-connect, that configures the broker securely but mqtt-logger is not reconfigured/restarted to connect to the same? If that's the case, I'd update the comment accordingly and even move this statement right after the bootstrap step. We probably need to update the bootstrap logic also to fix this, but not in this PR.

Fixed cdca912

albinsuresh · 2024-10-23T06:38:43Z

tests/RobotFramework/tests/tedge_watchdog/health_check_respond.sh

+HEALTH_CHECK_TOPIC="te/device/main/service/$SERVICE_NAME/cmd/health/check"
+HEALTH_STATUS_TOPIC="te/device/main/service/$SERVICE_NAME/status/health"
+
+echo "Response: $RESPONSE"


Suggested change

echo "Response: $RESPONSE"

# The $RESPONSE flag is used to control whether this service responds to healthcheck requests or not

echo "Response: $RESPONSE"

Just some adding docs. I'd have called it RESPOND though.

fixed cdca912

albinsuresh · 2024-10-23T06:45:20Z

tests/RobotFramework/tests/tedge_watchdog/tedge_watchdog.robot

+    Should Not Be Equal    ${pid}    ${pid1}
+
+Watchdog doesn't fail on unexpected time format
+    Set Service health check response    response=1


Suggested change

Set Service health check response response=1

Set Service health check response response=0

When you're simulating the health responses on behalf of the service from this test, the service should have be configured not to respond to the health checks by itself, right?

You're right, as we're sending messages from inside the test, the service should not respond.

albinsuresh · 2024-10-23T06:47:19Z

tests/RobotFramework/tests/tedge_watchdog/tedge_watchdog.robot

+
+    # Verify that tedge-watchdog is still running
+    ${pid1} =    Service Should Be Running    tedge-watchdog
+    Should Be Equal    ${pid}    ${pid1}


Shouldn't we also validate that the actual service itself is also restarted, when it keeps sending bad responses?

Indeed, apart from checking that tedge-watchdog doesn't crash, we can also check if it still functions correctly.
Also the test was buggy: tedge-watchdog PID was used in a message from a service, which was fixed.
fixed cdca912

albinsuresh

LGTM

Bravo555 added theme:monitoring Theme: Service monitoring and watchdogs theme:testing Theme: Testing labels Sep 25, 2024

Bravo555 requested review from a team and gligorisaev as code owners September 25, 2024 14:09

Bravo555 temporarily deployed to Test Pull Request September 25, 2024 14:09 — with GitHub Actions Inactive

Bravo555 temporarily deployed to Test Auto September 25, 2024 14:16 — with GitHub Actions Inactive

albinsuresh reviewed Sep 27, 2024

View reviewed changes

gligorisaev reviewed Oct 17, 2024

View reviewed changes

Bravo555 marked this pull request as draft October 18, 2024 09:53

Bravo555 force-pushed the fix/watchdog-inspect-health-status branch from 67b2a49 to 958a3cc Compare October 18, 2024 14:40

Bravo555 temporarily deployed to Test Pull Request October 18, 2024 14:40 — with GitHub Actions Inactive

Bravo555 marked this pull request as ready for review October 18, 2024 14:47

Bravo555 requested review from jarhodes314, reubenmiller and rina23q as code owners October 18, 2024 14:47

Bravo555 had a problem deploying to Test Auto October 18, 2024 14:50 — with GitHub Actions Failure

Bravo555 added 2 commits October 21, 2024 16:30

Don't exit tedge-watchdog on time deserialization error

a408362

Signed-off-by: Marcel Guzik <marcel.guzik@inetum.com>

Bravo555 force-pushed the fix/watchdog-inspect-health-status branch from 958a3cc to 7a04f63 Compare October 21, 2024 14:30

Bravo555 temporarily deployed to Test Pull Request October 21, 2024 14:30 — with GitHub Actions Inactive

Bravo555 had a problem deploying to Test Auto October 21, 2024 14:36 — with GitHub Actions Failure

Bravo555 temporarily deployed to Test Pull Request October 21, 2024 14:37 — with GitHub Actions Inactive

Bravo555 had a problem deploying to Test Auto October 21, 2024 14:45 — with GitHub Actions Failure

Bravo555 had a problem deploying to Test Pull Request October 21, 2024 14:56 — with GitHub Actions Failure

Bravo555 force-pushed the fix/watchdog-inspect-health-status branch from c76252b to e1a91e5 Compare October 21, 2024 15:04

Bravo555 temporarily deployed to Test Pull Request October 21, 2024 15:04 — with GitHub Actions Inactive

Bravo555 had a problem deploying to Test Auto October 21, 2024 15:09 — with GitHub Actions Failure

Bravo555 temporarily deployed to Test Pull Request October 22, 2024 11:49 — with GitHub Actions Inactive

Bravo555 had a problem deploying to Test Auto October 22, 2024 11:54 — with GitHub Actions Failure

Bravo555 force-pushed the fix/watchdog-inspect-health-status branch from 80b61eb to 456e3e0 Compare October 22, 2024 13:07

Bravo555 temporarily deployed to Test Pull Request October 22, 2024 13:07 — with GitHub Actions Inactive

Bravo555 temporarily deployed to Test Auto October 22, 2024 13:13 — with GitHub Actions Inactive

Bravo555 force-pushed the fix/watchdog-inspect-health-status branch from 456e3e0 to 53ccae9 Compare October 22, 2024 13:45

Bravo555 temporarily deployed to Test Pull Request October 22, 2024 13:45 — with GitHub Actions Inactive

Bravo555 temporarily deployed to Test Auto October 22, 2024 13:51 — with GitHub Actions Inactive

albinsuresh reviewed Oct 23, 2024

View reviewed changes

Bravo555 temporarily deployed to Test Pull Request October 23, 2024 09:36 — with GitHub Actions Inactive

Bravo555 temporarily deployed to Test Auto October 23, 2024 11:00 — with GitHub Actions Inactive

albinsuresh approved these changes Oct 24, 2024

View reviewed changes

Mock a service in tedge-watchdog integration test

83a28c6

Bravo555 force-pushed the fix/watchdog-inspect-health-status branch from cdca912 to 83a28c6 Compare October 24, 2024 09:02

Bravo555 temporarily deployed to Test Pull Request October 24, 2024 09:03 — with GitHub Actions Inactive

Bravo555 temporarily deployed to Test Auto October 24, 2024 09:08 — with GitHub Actions Inactive

Bravo555 added this pull request to the merge queue Oct 24, 2024

Merged via the queue into thin-edge:main with commit ba97db0 Oct 24, 2024

		Execute Command
		... tedge mqtt pub 'te/device/main/service/${SERVICE_NAME}/status/health' '{"status": "down", "pid": ${pid}, "time": "${before_restart_timestamp}"}'


		Transfer To Device ${CURDIR}/health_check_respond.sh /setup/

		# Without this line mqtt-logger can't connect to listener at 1883, but with it it successfully connects to listener

	echo "Response: $RESPONSE"
	# The $RESPONSE flag is used to control whether this service responds to healthcheck requests or not
	echo "Response: $RESPONSE"

	Set Service health check response response=1
	Set Service health check response response=0

Conversation

Bravo555 commented Sep 25, 2024

Proposed changes

Types of changes

Paste Link to the issue

Checklist

Further comments

Uh oh!

github-actions bot commented Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Robot Results

Uh oh!

albinsuresh commented Sep 26, 2024

Uh oh!

didier-wenzek commented Sep 26, 2024

Uh oh!

albinsuresh commented Sep 26, 2024

Uh oh!

Bravo555 commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gligorisaev left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bravo555 commented Oct 17, 2024

Uh oh!

reubenmiller commented Oct 18, 2024

Uh oh!

codecov bot commented Oct 21, 2024

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albinsuresh left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions bot commented Sep 25, 2024 •

edited

Loading

Bravo555 commented Sep 26, 2024 •

edited

Loading

gligorisaev left a comment •

edited

Loading