Skip to content

fix: agent leaves interrupted operations in non-final state #3210

Merged
didier-wenzek merged 2 commits intothin-edge:mainfrom
didier-wenzek:fix/agent-leaves-operation-in-non-final-state
Nov 4, 2024
Merged

fix: agent leaves interrupted operations in non-final state #3210
didier-wenzek merged 2 commits intothin-edge:mainfrom
didier-wenzek:fix/agent-leaves-operation-in-non-final-state

Conversation

@didier-wenzek
Copy link
Copy Markdown
Contributor

@didier-wenzek didier-wenzek commented Oct 28, 2024

Proposed changes

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

Reproduce #3149

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

@didier-wenzek didier-wenzek added bug Something isn't working theme:workflows Theme: Workflow engine topics labels Oct 28, 2024
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Oct 28, 2024

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
529 0 2 529 100 1h34m39.170237s

@didier-wenzek didier-wenzek changed the title fix: fix: agent leaves interrupted operations in non-final state Oct 28, 2024
@didier-wenzek didier-wenzek marked this pull request as ready for review October 29, 2024 09:56
@codecov
Copy link
Copy Markdown

codecov bot commented Oct 29, 2024

Codecov Report

Attention: Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../core/tedge_agent/src/operation_workflows/actor.rs 0.00% 3 Missing ⚠️
Additional details and impacted files

📢 Thoughts on this report? Let us know!

@didier-wenzek
Copy link
Copy Markdown
Contributor Author

The test "Update tedge version from base to current using Cumulocity" failed for unrelated reasons. Not obviously flaky though:

$ invoke flake-finder --test-name "Update tedge version from base to current using Cumulocity" --iterations 10 --outputdir output_ff --clean
------------------------------
Overall: PASSED
Results: 10 iterations, 10 passed, 0 failed
Elapsed time: 0:10:42.849008

.load_pending_commands(pending_commands)
{
// Make sure the latest state is visible over MQTT
self.mqtt_publisher
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we limit this republishing only for commands in their terminal state? If not, there's the risk of the even the intermediate command states like scheduled or executing getting resent, risking duplicate execution of the same logic by any external component listening on them.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though there isn't any official contract that says these messages will only be published exactly once. But generally such listeners are not doing control (as control logic should be in the workflow itself), so triggering twice on a status change should not be a problem.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just using qos 1 is enough to receive a duplicate message (without publishing the status change twice).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we limit this republishing only for commands in their terminal state?

This would introduce the same issue on intermediate states (.e.g the "executing" state being never published).

If not, there's the risk of the even the intermediate command states like scheduled or executing getting resent, risking duplicate execution of the same logic by any external component listening on them.

The agent protects itself from reacting to messages of which it is the publisher.

Other components, such as the mappers, should be prepared to receive a status twice.

Just using qos 1 is enough to receive a duplicate message (without publishing the status change twice).

It's also enough to restart, the messages being retained. If the c8y mapper stop after sending an EXECUTING 501 message and restart before the command is successful, a second EXECUTING 501 will be sent.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just using qos 1 is enough to receive a duplicate message (without publishing the status change twice).

I have always assumed that users would send critical messages like commands only with QoS 2, as they might expect once-and-only-once execution for such things. But yeah, if the users are already expected to be resilient to duplicate delivery, esp since most cloud platforms don't support QoS 2 either, I guess this is okay.

Copy link
Copy Markdown
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
Signed-off-by: Didier Wenzek <didier.wenzek@free.fr>
@didier-wenzek didier-wenzek force-pushed the fix/agent-leaves-operation-in-non-final-state branch from ebe374d to 24f870f Compare November 4, 2024 09:03
@didier-wenzek didier-wenzek added this pull request to the merge queue Nov 4, 2024
Merged via the queue into thin-edge:main with commit 80e0abd Nov 4, 2024
@didier-wenzek didier-wenzek deleted the fix/agent-leaves-operation-in-non-final-state branch November 4, 2024 09:57
@reubenmiller reubenmiller added this to the 1.4.0 milestone Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working theme:workflows Theme: Workflow engine topics

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants