Skip to content

test: Create a tedge-watchdog test suite#3138

Merged
Bravo555 merged 4 commits intothin-edge:mainfrom
Bravo555:fix/watchdog-inspect-health-status
Oct 24, 2024
Merged

test: Create a tedge-watchdog test suite#3138
Bravo555 merged 4 commits intothin-edge:mainfrom
Bravo555:fix/watchdog-inspect-health-status

Conversation

@Bravo555
Copy link
Copy Markdown
Member

Proposed changes

Created a test suite checking if tedge-watchdog correctly prevents services from being restarted if they're healthy and makes systemd restart them when they're unhealthy.

We do have tests for other services that utilise tedge-watchdog which check if these services react to watchdog's health check requests and otherwise integrate with it, but they test these other services, not the watchdog itself. As a separate component, tedge-watchdog very much needs a test suite to itself validating its behaviour, particularly because it doesn't get much attention and isn't extensively documented.

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue


Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

This test suite was motivated by the issue that was found in tedge-watchdog where it didn't look at status before notifying systemd that service is healthy. See: #3132 (comment)

But as explained in the comment in the first test, despite this issue, the services for which "status":"down" is published are being restarted as we'd expect, either because of some intended trickery in the implementation or of some other issue.

Now I'll try to establish why it fails, so the implementation might end up being correct, but still I think it would be good to have a dedicated test suite for tedge-watchdog.

@Bravo555 Bravo555 added theme:monitoring Theme: Service monitoring and watchdogs theme:testing Theme: Testing labels Sep 25, 2024
@Bravo555 Bravo555 requested review from a team and gligorisaev as code owners September 25, 2024 14:09
@Bravo555 Bravo555 temporarily deployed to Test Pull Request September 25, 2024 14:09 — with GitHub Actions Inactive
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Sep 25, 2024

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
513 0 2 513 100 1h26m53.498628999s

@albinsuresh
Copy link
Copy Markdown
Contributor

But as explained in the comment in the first test, despite this issue, the services for which "status":"down" is published are being restarted as we'd expect, either because of some intended trickery in the implementation or of some other issue.

I finally remember why it was written without an explicit status value check. If a service responds to the health check request, the watchdog implicitly assumes that the service is healthy, as it received a response. If the service was genuinely down, it wouldn't have gotten any response, not even the down response, as the down status message is sent only once by the MQTT broker when that service disconnects. But, tedge-watchdog does not rely on those retained messages, but actively sends a health check and awaits its response. That's why the whole scheme is working even without any explicit status field check. But, that logic was written with only up and down statuses considered, and not account for newly added ones like unknown. So, the logic needs to be updated to account for that.

@didier-wenzek
Copy link
Copy Markdown
Contributor

But, that logic was written with only up and down statuses considered, and not account for newly added ones like unknown.

What's the purpose of this unknown status?

@albinsuresh
Copy link
Copy Markdown
Contributor

But, that logic was written with only up and down statuses considered, and not account for newly added ones like unknown.

What's the purpose of this unknown status?

To cover buggy clients that are reporting "some" health status messages on those specific topics, but not in the desired format. Interpreting such erroneous messages as down would be wrong, as something is still there responding to health requests, meaning it's really not "down". So, we want to differentiate such components from those that are genuinely down. This might help the user to fix the erroneous reporting of their component's health status.

@Bravo555
Copy link
Copy Markdown
Member Author

Bravo555 commented Sep 26, 2024

But as explained in the comment in the first test, despite this issue, the services for which "status":"down" is published are being restarted as we'd expect, either because of some intended trickery in the implementation or of some other issue.

Found the problem that caused tedge-watchdog to disconnect. It wasn't due to "status": "down", but due to "time" field. This is payload sent in the test:

Execute Command
...    tedge mqtt pub 'te/device/main/service/${SERVICE_NAME}/status/health' '{"status": "down", "pid": ${pid}, "time": "${before_restart_timestamp}"}'

time is a string, so we're expecting it to be in ISO format, but it's actually a UNIX timestamp instead. This returns an error from get_latest_health_status_message function and exits the watchdog for the service. We should probably ignore invalid messages instead of exiting the watchdog for the service.

Changing time from string to float in the test (i.e. removing quotes) makes the test fail as expected.


If a service responds to the health check request, the watchdog implicitly assumes that the service is healthy, as it received a response. If the service was genuinely down, it wouldn't have gotten any response, not even the down response

But, that logic was written with only up and down statuses considered, and not account for newly added ones like unknown. So, the logic needs to be updated to account for that.

So it seems to me there are 2 TODOs at this moment:

  • change deserialization logic so watchdog doesn't exit if it fails to deserialize time
  • update logic so that we notify only on "status": "up" and down/unknown are treated as unhealthy for which we don't send notifications

Comment on lines +38 to +39
Execute Command
... tedge mqtt pub 'te/device/main/service/${SERVICE_NAME}/status/health' '{"status": "down", "pid": ${pid}, "time": "${before_restart_timestamp}"}'
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not be fully deterministic as the watchdog would be waiting for a response after it sends the health check (with a later timestamp) and just ignores other random health messages that it receives with a lower timestamp. So, we need to make sure that this response reaches the watchdog after the health check request is received, with a higher timestamp, and also ensure that it reaches before the real running tedge-agent responds with its up status.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An way to disconnect tedge-watchdog from the MQTT topics used by the agent and the mapper (for testing purposes) is to configure tedge-watchdog with a fake topic root (say test-te).

Copy link
Copy Markdown
Contributor

@gligorisaev gligorisaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finding

  1. Non-deterministic Behavior:

• The watchdog sends a health check request and expects a response with a higher timestamp (indicating a more recent status). However, since the watchdog may receive multiple health messages (both valid and outdated), it must be able to discern between these messages.

• The risk is that the watchdog might ignore random health messages with lower timestamps that arrive after its health check, potentially leading to inconsistencies in how it processes the health state of the service.

  1. Ensuring the Right Order of Events:

• To maintain determinism, the test should ensure that the watchdog receives the response to its health check request in the correct sequence:

• First, the watchdog sends a health check.
• Second, it receives a response with a timestamp that is newer than the health check request.
• The watchdog should not accept outdated health messages with lower timestamps that may incorrectly reflect the state of the service.

  1. Race Condition Risk:

• There’s a race condition where the real, healthy tedge-agent might send an “up” status before the watchdog processes the simulated unhealthy response. If the healthy message arrives first, the test might falsely pass, even though it hasn’t correctly handled the simulated failure scenario.

• To avoid this, the test needs to ensure that the simulated unhealthy message is processed before the actual healthy status message from the real tedge-agent.

Possible Issues in Testing:

• Flaky Tests: Current test setup may produce inconsistent results, as the timing of messages isn’t guaranteed to always happen in the correct order.

Solution Suggestions:

• The test might need to introduce delays or synchronization mechanisms to ensure the correct sequence of events. For example, introducing a slight delay before the real tedge-agent sends its healthy status could help.

• Alternatively, adding more robust checks or timestamps in the test would allow it to explicitly confirm the order of events and prevent accepting out-of-order messages.

Impact on Test Reliability:

• Race Conditions: If the race condition isn’t handled properly, the test could pass incorrectly because it might receive the healthy message too soon.

• Deterministic Timing: For the test to be reliable, it must ensure that the simulated failure is properly processed and reflected in the results before the actual healthy state is acknowledged.

Recommendation:

To improve the reliability of the test:

• It should be ensured that the simulated unhealthy message (with a higher timestamp) is processed by the watchdog before the real service has a chance to send its healthy status.

• Should be Introduced specific synchronization points in the test where the test waits for confirmation that the health check has been processed before allowing further messages to be sent.

• Should be onsidered adding explicit logging or timestamps to better track when each message is sent and processed, allowing the test to fail more meaningfully when the sequence is incorrect.

@Bravo555
Copy link
Copy Markdown
Member Author

  1. Race Condition Risk:
  • There’s a race condition where the real, healthy tedge-agent might send an “up” status before the watchdog processes the simulated unhealthy response. If the healthy message arrives first, the test might falsely pass, even though it hasn’t correctly handled the simulated failure scenario.
  • To avoid this, the test needs to ensure that the simulated unhealthy message is processed before the actual healthy status message from the real tedge-agent.

Possible Issues in Testing:

  • Flaky Tests: Current test setup may produce inconsistent results, as the timing of messages isn’t guaranteed to always happen in the correct order.

Indeed the current approach of letting a real service stay alive and send messages while the test tries to scoop in some of its own messages is not very reliable. To fix this, the solution would be to change the service unit definition so that it doesn't start a real service, but a mock, that the script can control. I've started to work on this, but got distracted by some other tickets.
I'll use this approach and implement the recommendations from the quoted comment once I've completed other more pressing work

@Bravo555 Bravo555 marked this pull request as draft October 18, 2024 09:53
@Bravo555 Bravo555 force-pushed the fix/watchdog-inspect-health-status branch from 67b2a49 to 958a3cc Compare October 18, 2024 14:40
@Bravo555 Bravo555 temporarily deployed to Test Pull Request October 18, 2024 14:40 — with GitHub Actions Inactive
@Bravo555 Bravo555 marked this pull request as ready for review October 18, 2024 14:47
@reubenmiller
Copy link
Copy Markdown
Contributor

FYI: I've created a PR to track the decision about the tedge-watchdog feature

Created a test suite checking if tedge-watchdog correctly prevents
services from being restarted if they're healthy and makes systemd
restart them when they're unhealthy.

We do have tests for other services that utilise tedge-watchdog which
check if these services react to watchdog's health check requests and
otherwise integrate with it, but they test these other services, not the
watchdog itself. As a separate component, tedge-watchdog very much needs
a test suite to itself validating its behaviour, particularly because it
doesn't get much attention and isn't extensively documented.

Signed-off-by: Marcel Guzik <marcel.guzik@inetum.com>
Signed-off-by: Marcel Guzik <marcel.guzik@inetum.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Oct 21, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/core/tedge_watchdog/src/systemd_watchdog.rs 50.00% 2 Missing ⚠️
Additional details and impacted files

📢 Thoughts on this report? Let us know!

If a unix timestamp returned by the service is an integer (i.e. it does
not have a fractional, sub-second part), then the comparison if the
response is newer than health check request could fail.

Thus the comparison was changed to compare with a 1s precision. The case
when timestamp is an integer works and it shouldn't return false
positives because we don't send health check requests more often than
1s.

Signed-off-by: Marcel Guzik <marcel.guzik@inetum.com>
@Bravo555 Bravo555 force-pushed the fix/watchdog-inspect-health-status branch from c76252b to e1a91e5 Compare October 21, 2024 15:04
@Bravo555 Bravo555 temporarily deployed to Test Pull Request October 21, 2024 15:04 — with GitHub Actions Inactive
@Bravo555 Bravo555 temporarily deployed to Test Pull Request October 22, 2024 11:49 — with GitHub Actions Inactive
@Bravo555 Bravo555 force-pushed the fix/watchdog-inspect-health-status branch from 80b61eb to 456e3e0 Compare October 22, 2024 13:07
@Bravo555 Bravo555 temporarily deployed to Test Pull Request October 22, 2024 13:07 — with GitHub Actions Inactive
@Bravo555 Bravo555 force-pushed the fix/watchdog-inspect-health-status branch from 456e3e0 to 53ccae9 Compare October 22, 2024 13:45
@Bravo555 Bravo555 temporarily deployed to Test Pull Request October 22, 2024 13:45 — with GitHub Actions Inactive

Transfer To Device ${CURDIR}/health_check_respond.sh /setup/

# Without this line mqtt-logger can't connect to listener at 1883, but with it it successfully connects to listener
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a side effect of skipping the bootstrap first (which probably results in the mqtt-logger connecting to the non-secure broker) and then bootstrapping the device with --no-bootstrap --no-connect, that configures the broker securely but mqtt-logger is not reconfigured/restarted to connect to the same? If that's the case, I'd update the comment accordingly and even move this statement right after the bootstrap step. We probably need to update the bootstrap logic also to fix this, but not in this PR.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed cdca912

HEALTH_CHECK_TOPIC="te/device/main/service/$SERVICE_NAME/cmd/health/check"
HEALTH_STATUS_TOPIC="te/device/main/service/$SERVICE_NAME/status/health"

echo "Response: $RESPONSE"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
echo "Response: $RESPONSE"
# The $RESPONSE flag is used to control whether this service responds to healthcheck requests or not
echo "Response: $RESPONSE"

Just some adding docs. I'd have called it RESPOND though.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed cdca912

Should Not Be Equal ${pid} ${pid1}

Watchdog doesn't fail on unexpected time format
Set Service health check response response=1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Set Service health check response response=1
Set Service health check response response=0

When you're simulating the health responses on behalf of the service from this test, the service should have be configured not to respond to the health checks by itself, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, as we're sending messages from inside the test, the service should not respond.


# Verify that tedge-watchdog is still running
${pid1} = Service Should Be Running tedge-watchdog
Should Be Equal ${pid} ${pid1}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also validate that the actual service itself is also restarted, when it keeps sending bad responses?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, apart from checking that tedge-watchdog doesn't crash, we can also check if it still functions correctly.
Also the test was buggy: tedge-watchdog PID was used in a message from a service, which was fixed.
fixed cdca912

Copy link
Copy Markdown
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Bravo555 Bravo555 force-pushed the fix/watchdog-inspect-health-status branch from cdca912 to 83a28c6 Compare October 24, 2024 09:02
@Bravo555 Bravo555 temporarily deployed to Test Pull Request October 24, 2024 09:03 — with GitHub Actions Inactive
@Bravo555 Bravo555 added this pull request to the merge queue Oct 24, 2024
Merged via the queue into thin-edge:main with commit ba97db0 Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

theme:monitoring Theme: Service monitoring and watchdogs theme:testing Theme: Testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants