test: Create a tedge-watchdog test suite#3138
Conversation
Robot Results
|
I finally remember why it was written without an explicit |
What's the purpose of this |
To cover buggy clients that are reporting "some" health status messages on those specific topics, but not in the desired format. Interpreting such erroneous messages as |
Found the problem that caused
Changing
So it seems to me there are 2 TODOs at this moment:
|
| Execute Command | ||
| ... tedge mqtt pub 'te/device/main/service/${SERVICE_NAME}/status/health' '{"status": "down", "pid": ${pid}, "time": "${before_restart_timestamp}"}' |
There was a problem hiding this comment.
This may not be fully deterministic as the watchdog would be waiting for a response after it sends the health check (with a later timestamp) and just ignores other random health messages that it receives with a lower timestamp. So, we need to make sure that this response reaches the watchdog after the health check request is received, with a higher timestamp, and also ensure that it reaches before the real running tedge-agent responds with its up status.
There was a problem hiding this comment.
An way to disconnect tedge-watchdog from the MQTT topics used by the agent and the mapper (for testing purposes) is to configure tedge-watchdog with a fake topic root (say test-te).
There was a problem hiding this comment.
Finding
- Non-deterministic Behavior:
• The watchdog sends a health check request and expects a response with a higher timestamp (indicating a more recent status). However, since the watchdog may receive multiple health messages (both valid and outdated), it must be able to discern between these messages.
• The risk is that the watchdog might ignore random health messages with lower timestamps that arrive after its health check, potentially leading to inconsistencies in how it processes the health state of the service.
- Ensuring the Right Order of Events:
• To maintain determinism, the test should ensure that the watchdog receives the response to its health check request in the correct sequence:
• First, the watchdog sends a health check.
• Second, it receives a response with a timestamp that is newer than the health check request.
• The watchdog should not accept outdated health messages with lower timestamps that may incorrectly reflect the state of the service.
- Race Condition Risk:
• There’s a race condition where the real, healthy tedge-agent might send an “up” status before the watchdog processes the simulated unhealthy response. If the healthy message arrives first, the test might falsely pass, even though it hasn’t correctly handled the simulated failure scenario.
• To avoid this, the test needs to ensure that the simulated unhealthy message is processed before the actual healthy status message from the real tedge-agent.
Possible Issues in Testing:
• Flaky Tests: Current test setup may produce inconsistent results, as the timing of messages isn’t guaranteed to always happen in the correct order.
Solution Suggestions:
• The test might need to introduce delays or synchronization mechanisms to ensure the correct sequence of events. For example, introducing a slight delay before the real tedge-agent sends its healthy status could help.
• Alternatively, adding more robust checks or timestamps in the test would allow it to explicitly confirm the order of events and prevent accepting out-of-order messages.
Impact on Test Reliability:
• Race Conditions: If the race condition isn’t handled properly, the test could pass incorrectly because it might receive the healthy message too soon.
• Deterministic Timing: For the test to be reliable, it must ensure that the simulated failure is properly processed and reflected in the results before the actual healthy state is acknowledged.
Recommendation:
To improve the reliability of the test:
• It should be ensured that the simulated unhealthy message (with a higher timestamp) is processed by the watchdog before the real service has a chance to send its healthy status.
• Should be Introduced specific synchronization points in the test where the test waits for confirmation that the health check has been processed before allowing further messages to be sent.
• Should be onsidered adding explicit logging or timestamps to better track when each message is sent and processed, allowing the test to fail more meaningfully when the sequence is incorrect.
Indeed the current approach of letting a real service stay alive and send messages while the test tries to scoop in some of its own messages is not very reliable. To fix this, the solution would be to change the service unit definition so that it doesn't start a real service, but a mock, that the script can control. I've started to work on this, but got distracted by some other tickets. |
67b2a49 to
958a3cc
Compare
|
FYI: I've created a PR to track the decision about the tedge-watchdog feature |
Created a test suite checking if tedge-watchdog correctly prevents services from being restarted if they're healthy and makes systemd restart them when they're unhealthy. We do have tests for other services that utilise tedge-watchdog which check if these services react to watchdog's health check requests and otherwise integrate with it, but they test these other services, not the watchdog itself. As a separate component, tedge-watchdog very much needs a test suite to itself validating its behaviour, particularly because it doesn't get much attention and isn't extensively documented. Signed-off-by: Marcel Guzik <marcel.guzik@inetum.com>
Signed-off-by: Marcel Guzik <marcel.guzik@inetum.com>
958a3cc to
7a04f63
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files📢 Thoughts on this report? Let us know! |
If a unix timestamp returned by the service is an integer (i.e. it does not have a fractional, sub-second part), then the comparison if the response is newer than health check request could fail. Thus the comparison was changed to compare with a 1s precision. The case when timestamp is an integer works and it shouldn't return false positives because we don't send health check requests more often than 1s. Signed-off-by: Marcel Guzik <marcel.guzik@inetum.com>
c76252b to
e1a91e5
Compare
80b61eb to
456e3e0
Compare
456e3e0 to
53ccae9
Compare
|
|
||
| Transfer To Device ${CURDIR}/health_check_respond.sh /setup/ | ||
|
|
||
| # Without this line mqtt-logger can't connect to listener at 1883, but with it it successfully connects to listener |
There was a problem hiding this comment.
This is a side effect of skipping the bootstrap first (which probably results in the mqtt-logger connecting to the non-secure broker) and then bootstrapping the device with --no-bootstrap --no-connect, that configures the broker securely but mqtt-logger is not reconfigured/restarted to connect to the same? If that's the case, I'd update the comment accordingly and even move this statement right after the bootstrap step. We probably need to update the bootstrap logic also to fix this, but not in this PR.
| HEALTH_CHECK_TOPIC="te/device/main/service/$SERVICE_NAME/cmd/health/check" | ||
| HEALTH_STATUS_TOPIC="te/device/main/service/$SERVICE_NAME/status/health" | ||
|
|
||
| echo "Response: $RESPONSE" |
There was a problem hiding this comment.
| echo "Response: $RESPONSE" | |
| # The $RESPONSE flag is used to control whether this service responds to healthcheck requests or not | |
| echo "Response: $RESPONSE" |
Just some adding docs. I'd have called it RESPOND though.
| Should Not Be Equal ${pid} ${pid1} | ||
|
|
||
| Watchdog doesn't fail on unexpected time format | ||
| Set Service health check response response=1 |
There was a problem hiding this comment.
| Set Service health check response response=1 | |
| Set Service health check response response=0 |
When you're simulating the health responses on behalf of the service from this test, the service should have be configured not to respond to the health checks by itself, right?
There was a problem hiding this comment.
You're right, as we're sending messages from inside the test, the service should not respond.
|
|
||
| # Verify that tedge-watchdog is still running | ||
| ${pid1} = Service Should Be Running tedge-watchdog | ||
| Should Be Equal ${pid} ${pid1} |
There was a problem hiding this comment.
Shouldn't we also validate that the actual service itself is also restarted, when it keeps sending bad responses?
There was a problem hiding this comment.
Indeed, apart from checking that tedge-watchdog doesn't crash, we can also check if it still functions correctly.
Also the test was buggy: tedge-watchdog PID was used in a message from a service, which was fixed.
fixed cdca912
cdca912 to
83a28c6
Compare
Proposed changes
Created a test suite checking if tedge-watchdog correctly prevents services from being restarted if they're healthy and makes systemd restart them when they're unhealthy.
We do have tests for other services that utilise tedge-watchdog which check if these services react to watchdog's health check requests and otherwise integrate with it, but they test these other services, not the watchdog itself. As a separate component, tedge-watchdog very much needs a test suite to itself validating its behaviour, particularly because it doesn't get much attention and isn't extensively documented.
Types of changes
Paste Link to the issue
Checklist
cargo fmtas mentioned in CODING_GUIDELINEScargo clippyas mentioned in CODING_GUIDELINESFurther comments
This test suite was motivated by the issue that was found in tedge-watchdog where it didn't look at status before notifying systemd that service is healthy. See: #3132 (comment)
But as explained in the comment in the first test, despite this issue, the services for which
"status":"down"is published are being restarted as we'd expect, either because of some intended trickery in the implementation or of some other issue.Now I'll try to establish why it fails, so the implementation might end up being correct, but still I think it would be good to have a dedicated test suite for tedge-watchdog.