Skip to content

fix: Don't fail if only some operations could not be loaded#3236

Merged
Bravo555 merged 3 commits intothin-edge:mainfrom
Bravo555:fix/3234/tedge-mapper-c8y-panics-on-startup
Nov 19, 2024
Merged

fix: Don't fail if only some operations could not be loaded#3236
Bravo555 merged 3 commits intothin-edge:mainfrom
Bravo555:fix/3234/tedge-mapper-c8y-panics-on-startup

Conversation

@Bravo555
Copy link
Copy Markdown
Member

@Bravo555 Bravo555 commented Nov 8, 2024

TODO

Proposed changes

When loading operations from $CONFIG_DIR/operations/c8y, log an error for operations we were unable to load but return other successfully loaded operations.

Because of #3160, this resulted in a lot of logs being emitted, so the PR also reworks the operation registration to not load and re-register all the operations from disk when only a single operation is registered. As a result, we still log invalid operation files 2 times on startup, because of inability to change subscriptions during runtime, see #3236 (comment).

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

@codecov
Copy link
Copy Markdown

codecov bot commented Nov 8, 2024

Codecov Report

Attention: Patch coverage is 84.54106% with 32 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/extensions/c8y_mapper_ext/src/converter.rs 82.69% 14 Missing and 13 partials ⚠️
crates/core/c8y_api/src/smartrest/operations.rs 90.19% 3 Missing and 2 partials ⚠️
Additional details and impacted files

📢 Thoughts on this report? Let us know!


🚨 Try these New Features:

@Bravo555 Bravo555 changed the title Don't fail if only some operations could not be loaded fix: Don't fail if only some operations could not be loaded Nov 8, 2024
@Bravo555 Bravo555 temporarily deployed to Test Pull Request November 8, 2024 18:25 — with GitHub Actions Inactive
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Nov 10, 2024

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
524 0 2 524 100 1h26m23.606578999s

@Bravo555 Bravo555 temporarily deployed to Test Pull Request November 12, 2024 14:13 — with GitHub Actions Inactive
@Bravo555 Bravo555 force-pushed the fix/3234/tedge-mapper-c8y-panics-on-startup branch from 7c40908 to 5dd7588 Compare November 13, 2024 12:15
@Bravo555 Bravo555 force-pushed the fix/3234/tedge-mapper-c8y-panics-on-startup branch from 5dd7588 to c024494 Compare November 13, 2024 12:17
@Bravo555 Bravo555 temporarily deployed to Test Pull Request November 13, 2024 12:17 — with GitHub Actions Inactive
@Bravo555
Copy link
Copy Markdown
Member Author

This PR changes logic of loading c8y operations in the CumulocityConverter so that it no longer panics, but invalid operation files are logged with error level. However, due to #3160, the output gets spammed by multiple errors regarding the same file.

To ensure that only a single log is emitted for a given operation file, this PR originally aimed to address #3160 as well, by refactoring registering operation flow so that operation directory would be read only a single time. However this optimal solution is made impossible by the inability to add an MQTT subscription after the actor is already spawned, so we're forced to scan the directory when creating C8yConverterConfig to read all MQTT topics that are possibly defined by custom c8y operations. Also there are some tests that assert some implementation details, which makes it impossible to completely remove duplicate c8y operation directory traversals.

As such, this PR will remove duplicates traversals where it's feasible, but will not completely address #3160, and as such, the log output will still be a bit spammed by duplicate errors, which will be annoying for users. Solving this may require bigger changes to our MQTT actor, so it will be left for another PR.

@Bravo555 Bravo555 force-pushed the fix/3234/tedge-mapper-c8y-panics-on-startup branch from c024494 to 95dc831 Compare November 13, 2024 12:54
@Bravo555 Bravo555 temporarily deployed to Test Pull Request November 13, 2024 12:54 — with GitHub Actions Inactive
@Bravo555 Bravo555 force-pushed the fix/3234/tedge-mapper-c8y-panics-on-startup branch from 95dc831 to ea2da72 Compare November 13, 2024 21:58
@Bravo555 Bravo555 temporarily deployed to Test Pull Request November 13, 2024 21:58 — with GitHub Actions Inactive
@Bravo555 Bravo555 force-pushed the fix/3234/tedge-mapper-c8y-panics-on-startup branch from ea2da72 to 51f6823 Compare November 14, 2024 12:15
@Bravo555 Bravo555 temporarily deployed to Test Pull Request November 14, 2024 12:15 — with GitHub Actions Inactive
@Bravo555 Bravo555 force-pushed the fix/3234/tedge-mapper-c8y-panics-on-startup branch from 61c89ab to bf9f88a Compare November 14, 2024 13:54
@Bravo555 Bravo555 temporarily deployed to Test Pull Request November 14, 2024 13:54 — with GitHub Actions Inactive
@Bravo555 Bravo555 force-pushed the fix/3234/tedge-mapper-c8y-panics-on-startup branch from bf9f88a to ac65be8 Compare November 14, 2024 14:34
@Bravo555 Bravo555 temporarily deployed to Test Pull Request November 14, 2024 14:34 — with GitHub Actions Inactive
Comment on lines +40 to +44
self.operations.dedup();
let pos = self.operations.iter().position(|op| op.name == name);
pos.map(|pos| self.operations.remove(pos))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the switch to a BTreeSet is very minimal, I'm even okay with you making that switch in this PR itself, to avoid complications like this. Or if you prefer a standalone PR, that's also fine.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, better to switch to a BTreeSet right now.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really prefer to avoid big changes to Operations struct in this PR because of parallel PR #3225, after one of these merges we expect to see conflicts which I'd like to minimise.
I'd also like to reorganize the module a bit in the refactor, so I'd rather do it as a separate PR.

Comment on lines +1577 to +1588
let operation = c8y_api::smartrest::operations::get_operation(
ops_file.as_std_path(),
&self.config.bridge_config,
)?;
let operations = self
.operations_for_device_mut(target)
.expect("entity should've been checked before that it's not a service");

let need_cloud_update = self.update_operations(ops_dir.as_std_path())?;
let prev_operation = operations.remove_operation(&operation.name);
// even if the body of the operation is different, as long as it has the same name, supported operations message
// will be the same, so we don't need to resend
let need_cloud_update = prev_operation.is_none();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not move this logic into the existing update_operations function, as that function already has the contract to return the boolean flag whether the cloud update is required or not?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the logic of checking if operations changed is a bit duplicated, but also slightly different:

  • update_operations reads all operations from the c8y operations dir and returns true if there were any updates in there. It compares two different instances of Operations.
  • in register_operation as we register only a single operation, we know the update is required when prev_operation is None. We only mutate a single Operations value.

So update_operation has a side effect of creating the entire new Operations value and it checks more that's necessary, so I can't really reuse it as-is, but I agree that the logic that decides if we publish a 114 message being split is suboptimal.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function changed a bit in a313601, and I ended up with two branches, in the case of a child device I just call update_operations outright. Let me know if this addresses it, or if you had something more particular in mind.


#[tokio::test]
// TODO: fix or remove test
#[ignore = "asserts that publishing a single operation capability message causes full rescan of c8y operations directory, which is undesirable behaviour"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that the rescan is not optimal. But, we shouldn't be skipping the whole test for that reason. The dynamic ops update feature is still relevant and must be tested.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this will need to be addressed before the merge.
Just that from my perspective, it looks like the primary point of this test is to assert the exact behaviour I'm removing (registering any operation via MQTT causes full rescan of c8y operations directory), so I'm not really sure what behaviour we could be checking instead.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I discovered the reason why the test checked that behaviour (#2614), and found that my implementation was incorrect, and in fact, we need to still maintain the old behaviour for child devices.

Fixed in a313601

@Bravo555 Bravo555 self-assigned this Nov 15, 2024
@Bravo555 Bravo555 temporarily deployed to Test Pull Request November 15, 2024 10:49 — with GitHub Actions Inactive
@Bravo555 Bravo555 temporarily deployed to Test Pull Request November 15, 2024 14:42 — with GitHub Actions Inactive
@Bravo555 Bravo555 removed their assignment Nov 15, 2024
Copy link
Copy Markdown
Member

@rina23q rina23q left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I confirmed that it no longer makes a panic. I would approve it. There are still many small things to be improved regarding Operations struct, however, I believe time to move on.

As such, this PR will remove duplicates traversals where it's feasible, but will not completely address #3160, and as such, the log output will still be a bit spammed by duplicate errors, which will be annoying for users. Solving this may require bigger changes to our MQTT actor, so it will be left for another PR.

I saw the error messages are now much less (now 2 vs before around 7, the numbers of operation files at tedge-mapper-c8y's startup). Thanks for improving.

Copy link
Copy Markdown
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite the shortcomings of the Operations structure, this is sufficient for the bug fix.

let operations = self
.operations_for_device_mut(target)
.expect("entity should've been checked before that it's not a service");
let need_cloud_update = match is_child_operation_path(ops_dir) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a request for change, but just curious why you chose match statement here instead of a simple if-else.

} else {
C8yTopic::upstream_topic(&c8y_prefix)
};
let prev_operation = current_operations.remove_operation(&operation.name);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your refactoring PR, consider updating the signature of Operations::add_operation function to return the previous operation so that we can avoid this removal first and addition one step later.

@Bravo555 Bravo555 force-pushed the fix/3234/tedge-mapper-c8y-panics-on-startup branch from a313601 to e4e1d22 Compare November 18, 2024 08:50
@Bravo555 Bravo555 temporarily deployed to Test Pull Request November 18, 2024 08:50 — with GitHub Actions Inactive
When loading operations from $CONFIG_DIR/operations/c8y, log an error
for operations we were unable to load but return other successfully
loaded operations.

Currently this spams the log output a bit, because we statically
register some operations by name, but it causes full directory scan and
traversal of all operation files anyway. Will be fixed in next commit.

Signed-off-by: Marcel Guzik <marcel.guzik@inetum.com>
Signed-off-by: Marcel Guzik <marcel.guzik@inetum.com>
… operation

[1] disabled dynamic operation reload for child devices because it
needed to support nested child devices as well. For this reason, when
receiving an MQTT command metadata message and registering that
operation, we still need to read the operation directory and register
and send all the new operations.

[1]: thin-edge#2614

Signed-off-by: Marcel Guzik <marcel.guzik@inetum.com>
@Bravo555 Bravo555 force-pushed the fix/3234/tedge-mapper-c8y-panics-on-startup branch from e4e1d22 to d66f2ca Compare November 19, 2024 09:49
@Bravo555 Bravo555 temporarily deployed to Test Pull Request November 19, 2024 09:49 — with GitHub Actions Inactive
@Bravo555 Bravo555 added this pull request to the merge queue Nov 19, 2024
Merged via the queue into thin-edge:main with commit 944dce0 Nov 19, 2024
@Bravo555 Bravo555 deleted the fix/3234/tedge-mapper-c8y-panics-on-startup branch November 19, 2024 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants