fix: agent leaves interrupted operations in non-final state by didier-wenzek · Pull Request #3210 · thin-edge/thin-edge.io

didier-wenzek · 2024-10-28T17:42:10Z

Proposed changes

Reproduce tedge-agent can leave an operation in a non-final state if the service is restarted #3149
Correctly emit over MQTT the state of commands interrupted by a restart.

Types of changes

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
Documentation Update (if none of the other choices apply)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

Reproduce #3149

Checklist

I have read the CONTRIBUTING doc
I have signed the CLA (in all commits with git commit -s)
I ran cargo fmt as mentioned in CODING_GUIDELINES
I used cargo clippy as mentioned in CODING_GUIDELINES
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Further comments

github-actions · 2024-10-28T18:06:35Z

Robot Results

✅ Passed	❌ Failed	⏭️ Skipped	Total	Pass %	⏱️ Duration
529	0	2	529	100	1h34m39.170237s

codecov · 2024-10-29T10:06:46Z

Codecov Report

Attention: Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
.../core/tedge_agent/src/operation_workflows/actor.rs	0.00%	3 Missing ⚠️

Additional details and impacted files

📢 Thoughts on this report? Let us know!

didier-wenzek · 2024-10-29T13:22:05Z

The test "Update tedge version from base to current using Cumulocity" failed for unrelated reasons. Not obviously flaky though:

$ invoke flake-finder --test-name "Update tedge version from base to current using Cumulocity" --iterations 10 --outputdir output_ff --clean
------------------------------
Overall: PASSED
Results: 10 iterations, 10 passed, 0 failed
Elapsed time: 0:10:42.849008

albinsuresh · 2024-10-30T06:26:04Z

crates/core/tedge_agent/src/operation_workflows/actor.rs

                    .load_pending_commands(pending_commands)
                {
+                    // Make sure the latest state is visible over MQTT
+                    self.mqtt_publisher


Should we limit this republishing only for commands in their terminal state? If not, there's the risk of the even the intermediate command states like scheduled or executing getting resent, risking duplicate execution of the same logic by any external component listening on them.

Though there isn't any official contract that says these messages will only be published exactly once. But generally such listeners are not doing control (as control logic should be in the workflow itself), so triggering twice on a status change should not be a problem.

Just using qos 1 is enough to receive a duplicate message (without publishing the status change twice).

Should we limit this republishing only for commands in their terminal state?

This would introduce the same issue on intermediate states (.e.g the "executing" state being never published).

If not, there's the risk of the even the intermediate command states like scheduled or executing getting resent, risking duplicate execution of the same logic by any external component listening on them.

The agent protects itself from reacting to messages of which it is the publisher.

Other components, such as the mappers, should be prepared to receive a status twice.

Just using qos 1 is enough to receive a duplicate message (without publishing the status change twice).

It's also enough to restart, the messages being retained. If the c8y mapper stop after sending an EXECUTING 501 message and restart before the command is successful, a second EXECUTING 501 will be sent.

Just using qos 1 is enough to receive a duplicate message (without publishing the status change twice).

I have always assumed that users would send critical messages like commands only with QoS 2, as they might expect once-and-only-once execution for such things. But yeah, if the users are already expected to be resilient to duplicate delivery, esp since most cloud platforms don't support QoS 2 either, I guess this is okay.

albinsuresh

LGTM

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

didier-wenzek temporarily deployed to Test Pull Request October 28, 2024 17:42 — with GitHub Actions Inactive

didier-wenzek had a problem deploying to Test Auto October 28, 2024 17:48 — with GitHub Actions Failure

didier-wenzek added bug Something isn't working theme:workflows Theme: Workflow engine topics labels Oct 28, 2024

didier-wenzek changed the title ~~fix:~~ fix: agent leaves interrupted operations in non-final state Oct 28, 2024

didier-wenzek temporarily deployed to Test Pull Request October 29, 2024 09:55 — with GitHub Actions Inactive

didier-wenzek marked this pull request as ready for review October 29, 2024 09:56

didier-wenzek requested review from a team, albinsuresh, gligorisaev, jarhodes314 and rina23q as code owners October 29, 2024 09:56

didier-wenzek had a problem deploying to Test Auto October 29, 2024 10:00 — with GitHub Actions Failure

didier-wenzek temporarily deployed to Test Auto October 29, 2024 13:22 — with GitHub Actions Inactive

albinsuresh reviewed Oct 30, 2024

View reviewed changes

albinsuresh approved these changes Oct 31, 2024

View reviewed changes

didier-wenzek added 2 commits November 4, 2024 10:00

Reproduce thin-edge#3149

8f6f960

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

Make sure the latest command state is visible over MQT

24f870f

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>

didier-wenzek force-pushed the fix/agent-leaves-operation-in-non-final-state branch from ebe374d to 24f870f Compare November 4, 2024 09:03

didier-wenzek temporarily deployed to Test Pull Request November 4, 2024 09:03 — with GitHub Actions Inactive

didier-wenzek temporarily deployed to Test Auto November 4, 2024 09:08 — with GitHub Actions Inactive

didier-wenzek added this pull request to the merge queue Nov 4, 2024

Merged via the queue into thin-edge:main with commit 80e0abd Nov 4, 2024

didier-wenzek deleted the fix/agent-leaves-operation-in-non-final-state branch November 4, 2024 09:57

reubenmiller added this to the 1.4.0 milestone Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: agent leaves interrupted operations in non-final state #3210

fix: agent leaves interrupted operations in non-final state #3210
didier-wenzek merged 2 commits intothin-edge:mainfrom
didier-wenzek:fix/agent-leaves-operation-in-non-final-state

didier-wenzek commented Oct 28, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Oct 28, 2024 •

edited

Loading

Uh oh!

codecov bot commented Oct 29, 2024

Uh oh!

didier-wenzek commented Oct 29, 2024

Uh oh!

albinsuresh Oct 30, 2024

Uh oh!

reubenmiller Oct 30, 2024

Uh oh!

reubenmiller Oct 30, 2024

Uh oh!

didier-wenzek Oct 30, 2024

Uh oh!

albinsuresh Oct 31, 2024

Uh oh!

albinsuresh left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

didier-wenzek commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Types of changes

Paste Link to the issue

Checklist

Further comments

Uh oh!

github-actions bot commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Robot Results

Uh oh!

codecov bot commented Oct 29, 2024

Codecov Report

Uh oh!

didier-wenzek commented Oct 29, 2024

Uh oh!

albinsuresh Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

reubenmiller Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

reubenmiller Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

didier-wenzek Oct 30, 2024

Choose a reason for hiding this comment

Uh oh!

albinsuresh Oct 31, 2024

Choose a reason for hiding this comment

Uh oh!

albinsuresh left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

didier-wenzek commented Oct 28, 2024 •

edited

Loading

github-actions bot commented Oct 28, 2024 •

edited

Loading