Skip to content

fix: Ignore extra HealthStatus fields in service monitor#3132

Merged
Bravo555 merged 2 commits intothin-edge:mainfrom
Bravo555:fix/3128/health-ignore-pid-in-mapper
Sep 25, 2024
Merged

fix: Ignore extra HealthStatus fields in service monitor#3132
Bravo555 merged 2 commits intothin-edge:mainfrom
Bravo555:fix/3128/health-ignore-pid-in-mapper

Conversation

@Bravo555
Copy link
Copy Markdown
Member

Proposed changes

tedge-mapper will now not only try to parse status field and will ignore other extra fields.

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

Features were added because without them it was impossible to build and
check `tedge-watchdog` by itself using a `-p` flag, i.e: `cargo build -p
tedge-watchdog` (`hide_env_values` requires `env` and `default_value`
variant that uses `OsString` required `string`)

Note that adding these features in workspace Cargo.toml should not add
more code to the binary as these features were already used because
other crate required them, so the only clap instance we're pulling have
these enabled. Only now we're making these features available to all the
workspace crates (including tedge-watchdog).

Signed-off-by: Marcel Guzik <marcel.guzik@inetum.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Sep 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Additional details and impacted files

📢 Thoughts on this report? Let us know!

@Bravo555 Bravo555 marked this pull request as draft September 24, 2024 09:01
Comment on lines +44 to +45
#[derive(Debug, Serialize, Deserialize)]
pub struct HealthStatusExt {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider to move this struct in tedge_api.

Also, I'm not really convinced by having two public structs with similar purpose. Wouldn't be better to keep the HealthStatus struct unchanged with a deserialize method that fallbacks on a minimal private struct in case some fields cannot be parsed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did talk about this with @albinsuresh because I started out with the same approach but concluded that having separate structs with the one in tedge_watchdog extending the one in tedge_api would be better.

Also, I'm not really convinced by having two public structs with similar purpose.

My bad, the struct wasn't supposed to be public and the aim was for it to be very much private to the tedge_watchdog crate. pub was a leftover from when I had it in a nested module, which is now removed.

Wouldn't be better to keep the HealthStatus struct unchanged with a deserialize method that fallbacks on a minimal private struct in case some fields cannot be parsed?

I understand that pros of the "one struct" approach to be the following:

  • simpler interface (only one type of HealthStatus) at the cost of a more complex implementation (custom Deserialize impl that tries to deserialize fields to an expected format but falls back to some other representation when it's not possible)
  • all possible fields in one place so one has to only look at tedge_api to see that HealthStatus can have these fields only

But IMHO it would make sense to keep HealthStatus in tedge_api as small as possible and not to put in fields that are not really required and erect more structure than necessary:

My view is that the payload under te/device/device001/service/service001/status/health is from an open set: it needs to have status property for its status to be synced with c8y, but it may also have pid and time when it's using tedge_watchdog. Perhaps some other component would require other fields to be present, but not care about status, pid, or time.

So we can't really deserialize to a struct that has all these fields defined because then if there's a problem with one of the fields that we're not using, we can't use other fields that we need and that are fine.

The easiest would be to just deserialize to HashMap<String, JsonValue>, but then the callers have to look up the fields they need and validate their types.

So we can do one additional step - make a struct for these properties so that serde will fetch the fields and validate their types and values for us. Because serde by default omits unknown fields, the struct doesn't say "Health status payload has only these fields", it says "it has at least these fields"

deserialize method that fallbacks on a minimal private struct in case some fields cannot be parsed

This is the somewhat problematic part. Deserialization either succeeds or fails, so it's not clear to me how to handle failures on individual fields while keeping other fields in a good way. If we want to deserialize e.g. pid: Option<u32>, how to handle "pid": "invalid" case? If we put None, we lose the error. If we deserialize to Result<u32, DeserializeError> that's just clumsy. Maybe we can put it into extra: HashMap<String, JsonValue>, but the outcome is not much different from just skipping the complex deserialize impl and using the extra hashmap directly, or even better defining another struct that has fields we require so that we don't have to extract from the hashmap manually.

IMO the default serde behaviour of "I want these fields to be present and valid and I want to ignore other fields" is conducive to the approach of defining multiple structs, even for things that are only slightly different.

TL;DR: IMO health status payload is a dynamic type, we shouldn't try to enumerate all possible fields for a deserialize impl, and coping with different components having different failure modes when some of the fields are missing. Instead it makes sense to only require fields we require in each caller, but be upfront in the documentation that the JSON payload may also contain other fields.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps some other component would require other fields to be present, but not care about status, pid, or time.

Here, you are going a bit too far. If there no relationship with an HealthStatus, what's the point to use that struct?

The easiest would be to just deserialize to HashMap<String, JsonValue>, but then the callers have to look up the fields they need and validate their types.

This definitely not what I'm asking for.

Deserialization either succeeds or fails, so it's not clear to me how to handle failures on individual fields while keeping other fields in a good way. If we want to deserialize e.g. pid: Option<u32>, how to handle "pid": "invalid" case? If we put None, we lose the error. If we deserialize to Result<u32, DeserializeError> that's just clumsy.

I would put None in that case (I agree this is not always appropriate), meaning there is no pid information that can be used.

IMO the default serde behaviour of "I want these fields to be present and valid and I want to ignore other fields" is conducive to the approach of defining multiple structs, even for things that are only slightly different.

It can be indeed appropriate to define multiple independent structs. But here you propose something a bit complicated for nothing: why an extension of tedge_api::HealthStatus and not a fully independent tedge_watchdog::HealthStatus.

This is even a surprise to me that tedge_watchdog doesn't use the status field! Did I miss something?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be indeed appropriate to define multiple independent structs. But here you propose something a bit complicated for nothing: why an extension of tedge_api::HealthStatus and not a fully independent tedge_watchdog::HealthStatus.

Right, if there were more fields in base HealthStatus that tedge_watchdog would be using then it would make sense to extend, right now they can be made independent with not much difference.

This is even a surprise to me that tedge_watchdog doesn't use the status field! Did I miss something?

You're correct, tedge_watchdog seems to not inspect status field before sending notifications to systemd, but it probably should, right?

In any case, I'll provide an alternative implementation more in line with your requirements - I had it working at one point but decided the current approach to be simpler, so I'll just find it in my git reflog and we can compare.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, I'll provide an alternative implementation more in line with your requirements - I had it working at one point but decided the current approach to be simpler, so I'll just find it in my git reflog and we can compare.

As tedge_watchdog is not using the status field, the simpler is to go your way but with two fully independent structs.

Copy link
Copy Markdown
Member Author

@Bravo555 Bravo555 Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@albinsuresh albinsuresh Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As tedge_watchdog is not using the status field, the simpler is to go your way but with two fully independent structs.

The watchdog should have been using that field, which is why I proposed the watchdog to use a HealthStatusExt struct, which is an extension of the HealthStatus struct, but with those additional fields relevant only to it. I believe this extended setup would make more sense when the watchdog is made to care about the status, which it should always have. I'm not quite sure if I missed it in the initial impl itself or got lost during some of those earlier refactoring exercises.

Looking at the alternate impl, I am also in favour of the simplified impl in this PR.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Sep 24, 2024

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
505 0 2 505 100 1h27m37.81966s

@Bravo555 Bravo555 force-pushed the fix/3128/health-ignore-pid-in-mapper branch from 74f6ac5 to 33ca589 Compare September 24, 2024 10:33
@Bravo555 Bravo555 temporarily deployed to Test Pull Request September 24, 2024 11:32 — with GitHub Actions Inactive
@Bravo555 Bravo555 marked this pull request as ready for review September 24, 2024 12:33
@Bravo555 Bravo555 changed the title Ignore extra HealthStatus fields in service monitor and allow PID to be a string Ignore extra HealthStatus fields in service monitor Sep 24, 2024
Copy link
Copy Markdown
Member

@rina23q rina23q left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question

if let Ok(message) = message.payload_str() {
debug!("Health response received: {message}");
if let Ok(health_status) = serde_json::from_str::<HealthStatus>(message) {
if let Ok(health_status) = serde_json::from_str::<HealthStatusExt>(message) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel a bit odd that the watchdog doesn't seem caring if the status is up or not. As long as pid and time are included in the health status message, the health status is considered "up"?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if it contains time and pid it sends systemd-notify notification without checking status, and this definitely seems like an overlook to me. I would expect notifications to be sent only when status is up.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so that's why you didn't include the HealthStatus struct as a field in the HealthStatusExt struct, as it was ignored. That needs to be fixed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if it contains time and pid it sends systemd-notify notification without checking status, and this definitely seems like an overlook to me. I would expect notifications to be sent only when status is up.

I also expect in this way. However, it's out of scope to fix it in this PR?

Copy link
Copy Markdown
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

///
/// https://thin-edge.github.io/thin-edge.io/operate/troubleshooting/monitoring-service-health/
#[derive(Debug, Serialize, Deserialize)]
struct HealthStatusExt {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can keep the former name.

Suggested change
struct HealthStatusExt {
struct HealthStatus {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting this struct to wrap the HealthStatus struct as a flattened field, hence the name.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting this struct to wrap the HealthStatus struct as a flattened field, hence the name.

Yeah, but the wrapped value is not used.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but the wrapped value is not used.

I think the value should have used. It's the implementation oversight that we are not using the status in watchdog. See this thread.

Signed-off-by: Marcel Guzik <marcel.guzik@inetum.com>
@Bravo555 Bravo555 force-pushed the fix/3128/health-ignore-pid-in-mapper branch from 1753f46 to de6e408 Compare September 25, 2024 08:54
@Bravo555 Bravo555 temporarily deployed to Test Pull Request September 25, 2024 08:54 — with GitHub Actions Inactive
@Bravo555 Bravo555 changed the title Ignore extra HealthStatus fields in service monitor fix: Ignore extra HealthStatus fields in service monitor Sep 25, 2024
@Bravo555 Bravo555 added the theme:monitoring Theme: Service monitoring and watchdogs label Sep 25, 2024
Copy link
Copy Markdown
Member

@rina23q rina23q left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved the code as of now. I'm looking forward the follow-up PR to fix the watchdog issue! 👍

@Bravo555 Bravo555 added this pull request to the merge queue Sep 25, 2024
Merged via the queue into thin-edge:main with commit ab258fc Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

theme:monitoring Theme: Service monitoring and watchdogs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants