Skip to content

feat: wait for mosquitto to start rather than blindly sleeping in tedge connect#3246

Merged
jarhodes314 merged 9 commits intothin-edge:mainfrom
jarhodes314:bug/tedge-connect-timeout
Nov 18, 2024
Merged

feat: wait for mosquitto to start rather than blindly sleeping in tedge connect#3246
jarhodes314 merged 9 commits intothin-edge:mainfrom
jarhodes314:bug/tedge-connect-timeout

Conversation

@jarhodes314
Copy link
Copy Markdown
Contributor

@jarhodes314 jarhodes314 commented Nov 13, 2024

Proposed changes

Change the behaviour of tedge connect to wait until mosquitto is listening on the configured port, rather than blindly sleeping for 5 seconds and hoping that's long enough. This drastically speeds up tedge connect c8y in most cases, and allows us to increase the timeout (which I've set here to 20 seconds) to allow for cases where mosquitto starts more slowly.

Additionally, I've now added some logic to republish the check connection message when the built-in bridge publishes a health up message. This fixes connection check errors caused when we send the message too soon and the bridge misses the message entirely. We can't rely on message persistence here since the bridge may have never previously connected.

Still TODO:

  • Finish migrating log messages, error messages and warnings to new format
  • Update tedge connect c8y --test to do URL check too

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s)
  • I ran cargo fmt as mentioned in CODING_GUIDELINES
  • I used cargo clippy as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

@jarhodes314 jarhodes314 added the theme:mqtt Theme: mqtt and mosquitto related topics label Nov 13, 2024
@codecov
Copy link
Copy Markdown

codecov bot commented Nov 13, 2024

@jarhodes314 jarhodes314 force-pushed the bug/tedge-connect-timeout branch from 6ce4bfb to 230e264 Compare November 13, 2024 13:20
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Nov 13, 2024

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
524 0 2 524 100 1h29m15.769199s

Copy link
Copy Markdown
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

Comment on lines +802 to +803
println!("Couldn't resolve configured mosquitto address, waiting {sleep_seconds} seconds for mosquitto to start");
std::thread::sleep(Duration::from_secs(sleep_seconds));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of this sleep, when the address can't even be resolved? Why not fail fast?

Comment on lines +611 to +665
} else if is_bridge_health_up_message(&response, &built_in_bridge_health) {
// Built in bridge is now up, republish the message in case it was never received by the bridge
client.publish(
&aws_topic_pub_check_connection,
AtLeastOnce,
false,
REGISTRATION_PAYLOAD,
)?;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this logic valid for the mosquitto bridge as well? Also wondering why don't we move the is_bridge_health_up_message check before entering this event loop or even before the test client connection is established.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this logic valid for the mosquitto bridge as well?

Yes, why not. This would even help to diagnose connection issues.

Also wondering why don't we move the is_bridge_health_up_message check before entering this event loop or even before the test client connection is established.

The event loop is required to get these messages.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the mosquitto bridge, I don't think this is true. We don't need the cloud connection to exist; we just need the bridge to have a subscription to the message. I haven't read into this to verify it, but my understanding is that the mosquitto bridge is subscribed to the relevant topics before mosquitto is up for others to listen to.

In the same vein, we arguably don't need to check that the built-in bridge health is up, a down message, even a retained one, proves that the subscription exists (even if it's a persisted retained message, as it's not a clean session). That said, using that as a check would be both confusing and not gain us much since we need the cloud connection anyway to make progress.

Copy link
Copy Markdown
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added connection checks are correct. And I'm okay to merge this PR.

However, can this PR be an opportunity to improve the output of tedge connect? The current output is verbose and adds little value. Something along the following lines:

- Connecting
   - device = demo-device-008
   - mqtt = didier.latest.stage.c8y.io
   - certificate = /etc/tedge/device-certs/demo-device-008.pem 
   - bridge = builtin 
- config: ok
- mosquitto: restarted
- bridge: up
- c8y: connected 

clap = { workspace = true }
doku = { workspace = true }
hyper = { workspace = true, default-features = false }
mqtt_channel = { workspace = true }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR. We could improve overall dependency management with appropriate pub use. Here, using tedge_api forces an explicit dependency on mqtt_channel just to use Topic.

Comment on lines +611 to +665
} else if is_bridge_health_up_message(&response, &built_in_bridge_health) {
// Built in bridge is now up, republish the message in case it was never received by the bridge
client.publish(
&aws_topic_pub_check_connection,
AtLeastOnce,
false,
REGISTRATION_PAYLOAD,
)?;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this logic valid for the mosquitto bridge as well?

Yes, why not. This would even help to diagnose connection issues.

Also wondering why don't we move the is_bridge_health_up_message check before entering this event loop or even before the test client connection is established.

The event loop is required to get these messages.

println!("Received expected response message.");
exists = true;
break;
} else if is_bridge_health_up_message(&response, &built_in_bridge_health) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to add the same test, for c8y, azure and aws, is okayish. However, it's maybe not the right time to fix that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, particularly with refactoring the output, the overlap has become obvious. I haven't yet attempted to tackle this, but there is a lot of commonality that we should be able to extract.

@jarhodes314
Copy link
Copy Markdown
Contributor Author

The added connection checks are correct. And I'm okay to merge this PR.

However, can this PR be an opportunity to improve the output of tedge connect? The current output is verbose and adds little value. Something along the following lines:

- Connecting
   - device = demo-device-008
   - mqtt = didier.latest.stage.c8y.io
   - certificate = /etc/tedge/device-certs/demo-device-008.pem 
   - bridge = builtin 
- config: ok
- mosquitto: restarted
- bridge: up
- c8y: connected 

I've had a bit of a hack at this locally. I've gone for something slightly more verbose than this suggestion in order to make it easy to see what is currently being waited on:
asciicast

image

The changes for this aren't finished yet, hence they aren't pushed, but they are working for tedge connect c8y

Signed-off-by: James Rhodes <jarhodes314@gmail.com>
Signed-off-by: James Rhodes <jarhodes314@gmail.com>
@jarhodes314
Copy link
Copy Markdown
Contributor Author

Really nice improvement. I will be happy to approve once the tests green.

@didier-wenzek I think the tests are now as green as they are going to get

Copy link
Copy Markdown
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirm my approval. Connection progress and errors are now nicely reported.

There are still 3 failing system tests - but all related to remote access, a feature currently broken on the CI c8y instance.

@jarhodes314 jarhodes314 added this pull request to the merge queue Nov 18, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 18, 2024
@jarhodes314 jarhodes314 added this pull request to the merge queue Nov 18, 2024
Merged via the queue into thin-edge:main with commit a51dc6a Nov 18, 2024
@reubenmiller reubenmiller added this to the 1.4.0 milestone Dec 5, 2024
@reubenmiller reubenmiller changed the title bug: wait for mosquitto to start rather than blindly sleeping in tedge connect feat: wait for mosquitto to start rather than blindly sleeping in tedge connect Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

theme:mqtt Theme: mqtt and mosquitto related topics

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants