fix: de-duplicate operations received in quick succession from c8y/devicecontrol/notifications by jarhodes314 · Pull Request #3454 · thin-edge/thin-edge.io

jarhodes314 · 2025-03-07T16:00:22Z

Proposed changes

Fixes the handling of in-progress operations by the Cumulocity converter so that the de-duplication mechanism is applied immediately. Since it takes a small amount of time between triggering the operation and the converter receiving the te/... topic message (which was when the operation was marked as "active" in the converter), there was a race that could lead to duplicate messages from Cumulocity both being processed by the converter.

Specifically, this PR changes the converter to mark the operation as "active" as it is initially handled by the converter. This has two advantages: firstly, this fixes the aforementioned race condition, and it additionally means that legacy custom operations based on smartrest are also de-duplicated. Since the operation doesn't have an associated te/... topic, these "active" operations expire after 12 hours. For workflow-based/built-in operations, the "active" operation will be discarded by the mapper only when it is marked as complete on the relevant te/... topic. The de-duplication will work across mapper restarts (assuming MQTT broker persistence is approriately configured), as it did before.

Types of changes

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
Documentation Update (if none of the other choices apply)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

deduplicate new operations received on c8y/devicecontrol/notifications #3403

Checklist

I have read the CONTRIBUTING doc
I have signed the CLA (in all commits with git commit -s. You can activate automatic signing by running just prepare-dev once)
I ran cargo fmt as mentioned in CODING_GUIDELINES
I used cargo clippy as mentioned in CODING_GUIDELINES
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)

Further comments

codecov · 2025-03-07T16:14:39Z

Codecov Report

Attention: Patch coverage is 96.50350% with 10 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
crates/extensions/c8y_mapper_ext/src/converter.rs	96.35%	1 Missing and 9 partials ⚠️

📢 Thoughts on this report? Let us know!

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-03-07T16:23:27Z

Robot Results

✅ Passed	❌ Failed	⏭️ Skipped	Total	Pass %	⏱️ Duration
600	0	3	600	100	1h41m21.677650999s

albinsuresh

One issue that exists with this solution is a (admittedly likely very slow) memory leak caused by the mapper storing the operation IDs indefinitely. Ideally, we should delete these at some point where we know the operation is no longer marked as pending, though I'm not sure how to do this.

Although clearing this entry when the operation transitions to successful/failed states is good enough for majority of the cases, I understand that it is not a fool-proof solution as there is still the possibility of getting a duplicate message while these state transition messages are still in transit to the cloud (or buffered for processing either locally or on the cloud). But since the terminal state transitions usually happen after the executing transition has already happened(which changes the PENDING status of the op in the cloud), the risk is reduced even further, although not fully eliminated.

I'm in favour of risking a duplicate operation execution in such rare cases(where the duplicate is delivered even after the terminal state transition is published), compared to the risk of that slow memory leak.

albinsuresh · 2025-03-10T05:57:04Z

crates/extensions/c8y_mapper_ext/src/converter.rs

    supported_operations: SupportedOperations,
    pub operation_handler: OperationHandler,
+
+    processed_ids: HashSet<String>,


Could we not re-use the active_commands set? Or you avoided that because the entries in that set are cleared on the terminal state transition of those operations and you didn't want these entries cleared so soon?

Looking at the doc comment for active_commands this sounds indeed as the correct place for that fix.

However, this raises an other point: why do we have this issue with duplicated commands while there is already a mechanism supposed to handle that?

I think active_commands is a broken solution to the problem. I think there we are de-duplicating c8y/devicecontrol/notifications, but we are only tracking an active command once we receive the relevant tedge-topic command message, which will happen a short while later. I think this is what leaves open the window for an operation to be duplicated.

Assuming I've understood correctly, that would indicate that the problem could simply be solved by moving the active_commands insertion to where I'm currently inserting to processed_ids, and deleting the processed_ids stuff?

Assuming I've understood correctly, that would indicate that the problem could simply be solved by moving the active_commands insertion to where I'm currently inserting to processed_ids, and deleting the processed_ids stuff?

One point is sure: one should keep only a single de-duplication mechanism. What you propose makes sense: it's better to remove duplicates before any processing.

why do we have this issue with duplicated commands while there is already a mechanism supposed to handle that?

For the cases where the duplicate messages are delivered after a restart, it is the lack of persistence of this set. But for duplicate messages delivered while the mapper is still live, this should have been sufficient.

I've now modified active_commands to insert immediately post-conversion, rather than when we receive our outgoing message. As a result, I've deleted processed_ids.

albinsuresh · 2025-03-10T06:19:35Z

we receive an unrecognised custom operation that's later registered

I don't see how this can happen as a device is not expected to receive an operation that it hasn't declared as a supported operation (this is my expectation from C8Y) via the operation capability registration.

reubenmiller · 2025-03-10T09:09:45Z

we receive an unrecognised custom operation that's later registered

I don't see how this can happen as a device is not expected to receive an operation that it hasn't declared as a supported operation (this is my expectation from C8Y) via the operation capability registration.

Operation can be created via API which don't first look at which operations are supported by the device or not...so in cases of automation, it is not so uncommon for operations to be sent to devices as it is automation will assume that the an unsupported operation will be rejected by the agent (and this is deemed cheaper rather than backend service first checking each device if it supports the intended operation)...this is fairly common in large device fleets (> 200K).

didier-wenzek · 2025-03-13T12:46:13Z

crates/extensions/c8y_mapper_ext/src/converter.rs

+        let original = MqttMessage::new(&Topic::new_unchecked("c8y/devicecontrol/notifications"), json!(
+            {"delivery":{"log":[],"time":"2025-03-05T08:49:24.986Z","status":"PENDING"},"agentId":"1916574062","creationTime":"2025-03-05T08:49:24.967Z","deviceId":"1916574062","id":"16574089","status":"PENDING","c8y_Restart":{},"description":"do something","externalSource":{"externalId":"test-device","type":"c8y_Serial"}}
+        ).to_string());
+        let redelivery = MqttMessage::new(&Topic::new_unchecked("c8y/devicecontrol/notifications"), json!(
+            {"delivery":{"log":[{"time":"2025-03-05T08:49:24.986Z","status":"PENDING"},{"time":"2025-03-05T08:49:25.000Z","status":"SEND"},{"time":"2025-03-05T08:49:25.162Z","status":"DELIVERED"}],"time":"2025-03-05T08:49:25.707Z","status":"PENDING"},"agentId":"1916574062","creationTime":"2025-03-05T08:49:24.967Z","deviceId":"1916574062","id":"16574089","status":"PENDING","c8y_Restart":{},"description":"do something","externalSource":{"externalId":"test-device","type":"c8y_Serial"}}
+        ).to_string());


It would be good to make obvious that original and redelivery only differ on the delivery field`.

I've now removed some of the extraneous fields and changed things so we send the original message twice, since the converter doesn't care about the delivery field.

didier-wenzek · 2025-03-13T12:55:10Z

crates/extensions/c8y_mapper_ext/src/converter.rs

+    }
+
+    #[tokio::test]
+    async fn custom_operations_are_not_deduplicated_before_registration() {


I struggle to understand what is checked by this test and how.

CumulocityConverter.supported_operations field is patched before the first request then restored before the second. Okay, but why?

This tests is checking what happens if we receive a custom operation that is unrecognised and later get redelivered the custom operation after it is registered with the mapper. If the converter naïvely assumes that the operation is active after we first receive it, the de-duplication mechanism will ignore the redelivery. But since we haven't yet processed this message, we should process such a redelivery. This is obviously dependent on something sending a 500 message once the operation is registered, but that could be the case for some custom operation handling service.

The patching of the supported_operations was intended as an easy way of ensuring the registered operations are made clear, since I'm not trying to test how we update supported_operations in this case.

This is clearer. Might be good to add this response as a comment to the test.

Technically this is fine. But seeing the detailed discussion above, it felt like this case could have been better represented as an integration test in tests.rs, where we can better simulate the dynamic custom operation registration during the execution of the test.

I think I want the opposite of that, @albinsuresh. The point of this being a unit test is I don't also want to test the operation registration logic at the same time. I've added a comment to explain what it is I'm trying to test.

albinsuresh

Changes look fine. Some minor suggestions on the tests.

albinsuresh · 2025-03-14T06:50:51Z

crates/extensions/c8y_mapper_ext/src/converter.rs

+        assert_ne!(
+            converter
+                .parse_json_custom_operation_topic(&original)
+                .await
+                .unwrap(),
+            vec![],
+            "Initial operation delivery produces outgoing message"
+        );


A slightly stricter check that validates at least the cmd topic would have been better than this non-empty output check. To avoid false positives like a converted error message sent to te/errors instead of the expected cmd beating this assertion.

albinsuresh · 2025-03-14T06:56:38Z

crates/extensions/c8y_mapper_ext/src/converter.rs

+
+        converter.supported_operations = after_registration;
+
+        assert_ne!(


Same comment as above, regarding a stricter check.

albinsuresh · 2025-03-14T06:58:08Z

crates/extensions/c8y_mapper_ext/src/converter.rs

+    }
+
+    #[tokio::test]
+    async fn custom_operations_are_not_deduplicated_before_registration() {


Technically this is fine. But seeing the detailed discussion above, it felt like this case could have been better represented as an integration test in tests.rs, where we can better simulate the dynamic custom operation registration during the execution of the test.

didier-wenzek

Two forgotten dbg! to be removed and some questions.

didier-wenzek · 2025-03-20T12:59:48Z

crates/extensions/c8y_mapper_ext/src/converter.rs

        &mut self,
        message: &MqttMessage,
    ) -> Result<Vec<MqttMessage>, ConversionError> {
+        if dbg!(self.active_commands_last_cleared.elapsed(&*self.clock)) > Duration::from_secs(3600)


Just a matter of taste: I would prefer an elapsed or elapsed_since method on the clock rather than the instant:

Suggested change

if dbg!(self.active_commands_last_cleared.elapsed(&*self.clock)) > Duration::from_secs(3600)

if self.clock.elapsed_since(&self.active_commands_last_cleared) > Duration::from_secs(3600)

PS: dbg! to be removed

crates/extensions/c8y_mapper_ext/src/converter.rs

crates/common/clock/src/lib.rs

didier-wenzek

Approved. Thank you.

crates/extensions/c8y_mapper_ext/src/converter.rs

albinsuresh

The changes look much simpler now. I've gone one concern though, regarding the premature eviction from the cache, before the operation really completes.

albinsuresh · 2025-03-21T05:00:51Z

crates/extensions/c8y_mapper_ext/src/converter.rs

        &mut self,
        message: &MqttMessage,
    ) -> Result<Vec<MqttMessage>, ConversionError> {
+        if self.active_commands_last_cleared.elapsed() > Duration::from_secs(3600) {


I was actually thinking that you'd use the timer actor from the mapper to get timeout notification message at desired intervals. But, it was smart to place the eviction logic here.

crates/extensions/c8y_mapper_ext/src/converter.rs

albinsuresh

LGTM.

albinsuresh

I re-confirm my approval.

Redid time-based expiry logic

Signed-off-by: James Rhodes <jarhodes314@gmail.com>

…for active_commands

Signed-off-by: James Rhodes <jarhodes314@gmail.com>

didier-wenzek

Approved

jarhodes314 requested review from albinsuresh, didier-wenzek and rina23q as code owners March 7, 2025 16:00

jarhodes314 temporarily deployed to Test Pull Request March 7, 2025 16:00 — with GitHub Actions Inactive

jarhodes314 temporarily deployed to Test Auto March 7, 2025 16:07 — with GitHub Actions Inactive

albinsuresh reviewed Mar 10, 2025

View reviewed changes

jarhodes314 added the theme:c8y Theme: Cumulocity related topics label Mar 10, 2025

jarhodes314 temporarily deployed to Test Pull Request March 12, 2025 11:24 — with GitHub Actions Inactive

jarhodes314 temporarily deployed to Test Auto March 12, 2025 11:30 — with GitHub Actions Inactive

didier-wenzek reviewed Mar 13, 2025

View reviewed changes

albinsuresh reviewed Mar 14, 2025

View reviewed changes

jarhodes314 temporarily deployed to Test Pull Request March 20, 2025 10:46 — with GitHub Actions Inactive

jarhodes314 temporarily deployed to Test Auto March 20, 2025 10:52 — with GitHub Actions Inactive

didier-wenzek reviewed Mar 20, 2025

View reviewed changes

jarhodes314 had a problem deploying to Test Pull Request March 20, 2025 13:27 — with GitHub Actions Failure

jarhodes314 temporarily deployed to Test Pull Request March 20, 2025 13:39 — with GitHub Actions Inactive

jarhodes314 temporarily deployed to Test Auto March 20, 2025 13:50 — with GitHub Actions Inactive

didier-wenzek previously approved these changes Mar 20, 2025

View reviewed changes

crates/extensions/c8y_mapper_ext/src/converter.rs Outdated Show resolved Hide resolved

albinsuresh reviewed Mar 21, 2025

View reviewed changes

jarhodes314 temporarily deployed to Test Pull Request March 21, 2025 11:41 — with GitHub Actions Inactive

jarhodes314 temporarily deployed to Test Auto March 21, 2025 11:48 — with GitHub Actions Inactive

albinsuresh approved these changes Mar 21, 2025

View reviewed changes

jarhodes314 had a problem deploying to Test Pull Request March 24, 2025 12:06 — with GitHub Actions Failure

jarhodes314 force-pushed the feat/dedup-operations branch from e50c720 to 288ab5e Compare March 24, 2025 14:57

jarhodes314 temporarily deployed to Test Pull Request March 24, 2025 14:57 — with GitHub Actions Inactive

jarhodes314 changed the title ~~feat: de-duplicate operations from c8y/devicecontrol/notifications~~ fix: de-duplicate operations received in quick succession from c8y/devicecontrol/notifications Mar 24, 2025

jarhodes314 temporarily deployed to Test Auto March 24, 2025 17:09 — with GitHub Actions Inactive

albinsuresh approved these changes Mar 25, 2025

View reviewed changes

jarhodes314 added 3 commits March 25, 2025 11:21

feat: de-duplicate operations from c8y/devicecontrol/notifications

68123aa

Signed-off-by: James Rhodes <jarhodes314@gmail.com>

Make tests more self-explanatory and add time-based cache expiration …

84552fe

…for active_commands

disable time-based expiry for non-custom operations

0f2d597

Signed-off-by: James Rhodes <jarhodes314@gmail.com>

jarhodes314 force-pushed the feat/dedup-operations branch from 288ab5e to 0f2d597 Compare March 25, 2025 11:21

jarhodes314 temporarily deployed to Test Pull Request March 25, 2025 11:21 — with GitHub Actions Inactive

jarhodes314 temporarily deployed to Test Auto March 25, 2025 11:27 — with GitHub Actions Inactive

didier-wenzek approved these changes Mar 25, 2025

View reviewed changes

jarhodes314 added this pull request to the merge queue Mar 25, 2025

Merged via the queue into thin-edge:main with commit 3f1b13d Mar 25, 2025
34 checks passed

reubenmiller mentioned this pull request Jul 3, 2025

Duplicate Cumulocity devicecontrol messages aren't being ignored #3721

Closed


		converter.supported_operations = after_registration;

		assert_ne!(

	if dbg!(self.active_commands_last_cleared.elapsed(&*self.clock)) > Duration::from_secs(3600)
	if self.clock.elapsed_since(&self.active_commands_last_cleared) > Duration::from_secs(3600)

Conversation

jarhodes314 commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Types of changes

Paste Link to the issue

Checklist

Further comments

Uh oh!

codecov bot commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Robot Results

Uh oh!

albinsuresh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albinsuresh commented Mar 10, 2025

Uh oh!

reubenmiller commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albinsuresh left a comment

Choose a reason for hiding this comment

Uh oh!

albinsuresh Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

didier-wenzek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

didier-wenzek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

albinsuresh left a comment

Choose a reason for hiding this comment

Uh oh!

albinsuresh Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jarhodes314 commented Mar 7, 2025 •

edited

Loading

codecov bot commented Mar 7, 2025 •

edited

Loading

github-actions bot commented Mar 7, 2025 •

edited

Loading

reubenmiller commented Mar 10, 2025 •

edited

Loading

albinsuresh Mar 14, 2025 •

edited

Loading

albinsuresh Mar 21, 2025 •

edited

Loading