feat: wait for mosquitto to start rather than blindly sleeping in tedge connect#3246
Conversation
6ce4bfb to
230e264
Compare
Robot Results
|
| println!("Couldn't resolve configured mosquitto address, waiting {sleep_seconds} seconds for mosquitto to start"); | ||
| std::thread::sleep(Duration::from_secs(sleep_seconds)); |
There was a problem hiding this comment.
What's the point of this sleep, when the address can't even be resolved? Why not fail fast?
| } else if is_bridge_health_up_message(&response, &built_in_bridge_health) { | ||
| // Built in bridge is now up, republish the message in case it was never received by the bridge | ||
| client.publish( | ||
| &aws_topic_pub_check_connection, | ||
| AtLeastOnce, | ||
| false, | ||
| REGISTRATION_PAYLOAD, | ||
| )?; | ||
| } |
There was a problem hiding this comment.
Isn't this logic valid for the mosquitto bridge as well? Also wondering why don't we move the is_bridge_health_up_message check before entering this event loop or even before the test client connection is established.
There was a problem hiding this comment.
Isn't this logic valid for the mosquitto bridge as well?
Yes, why not. This would even help to diagnose connection issues.
Also wondering why don't we move the
is_bridge_health_up_messagecheck before entering this event loop or even before the test client connection is established.
The event loop is required to get these messages.
There was a problem hiding this comment.
For the mosquitto bridge, I don't think this is true. We don't need the cloud connection to exist; we just need the bridge to have a subscription to the message. I haven't read into this to verify it, but my understanding is that the mosquitto bridge is subscribed to the relevant topics before mosquitto is up for others to listen to.
In the same vein, we arguably don't need to check that the built-in bridge health is up, a down message, even a retained one, proves that the subscription exists (even if it's a persisted retained message, as it's not a clean session). That said, using that as a check would be both confusing and not gain us much since we need the cloud connection anyway to make progress.
didier-wenzek
left a comment
There was a problem hiding this comment.
The added connection checks are correct. And I'm okay to merge this PR.
However, can this PR be an opportunity to improve the output of tedge connect? The current output is verbose and adds little value. Something along the following lines:
- Connecting
- device = demo-device-008
- mqtt = didier.latest.stage.c8y.io
- certificate = /etc/tedge/device-certs/demo-device-008.pem
- bridge = builtin
- config: ok
- mosquitto: restarted
- bridge: up
- c8y: connected
| clap = { workspace = true } | ||
| doku = { workspace = true } | ||
| hyper = { workspace = true, default-features = false } | ||
| mqtt_channel = { workspace = true } |
There was a problem hiding this comment.
Unrelated to this PR. We could improve overall dependency management with appropriate pub use. Here, using tedge_api forces an explicit dependency on mqtt_channel just to use Topic.
| } else if is_bridge_health_up_message(&response, &built_in_bridge_health) { | ||
| // Built in bridge is now up, republish the message in case it was never received by the bridge | ||
| client.publish( | ||
| &aws_topic_pub_check_connection, | ||
| AtLeastOnce, | ||
| false, | ||
| REGISTRATION_PAYLOAD, | ||
| )?; | ||
| } |
There was a problem hiding this comment.
Isn't this logic valid for the mosquitto bridge as well?
Yes, why not. This would even help to diagnose connection issues.
Also wondering why don't we move the
is_bridge_health_up_messagecheck before entering this event loop or even before the test client connection is established.
The event loop is required to get these messages.
| println!("Received expected response message."); | ||
| exists = true; | ||
| break; | ||
| } else if is_bridge_health_up_message(&response, &built_in_bridge_health) { |
There was a problem hiding this comment.
Having to add the same test, for c8y, azure and aws, is okayish. However, it's maybe not the right time to fix that.
There was a problem hiding this comment.
Yeah, particularly with refactoring the output, the overlap has become obvious. I haven't yet attempted to tackle this, but there is a lot of commonality that we should be able to extract.
I've had a bit of a hack at this locally. I've gone for something slightly more verbose than this suggestion in order to make it easy to see what is currently being waited on: The changes for this aren't finished yet, hence they aren't pushed, but they are working for |
connect` Signed-off-by: James Rhodes <jarhodes314@gmail.com>
ab36d12 to
0aaf9a1
Compare
Signed-off-by: James Rhodes <jarhodes314@gmail.com>
0aaf9a1 to
ab188c8
Compare
Signed-off-by: James Rhodes <jarhodes314@gmail.com>
@didier-wenzek I think the tests are now as green as they are going to get |
didier-wenzek
left a comment
There was a problem hiding this comment.
I confirm my approval. Connection progress and errors are now nicely reported.
There are still 3 failing system tests - but all related to remote access, a feature currently broken on the CI c8y instance.
tedge connecttedge connect

Proposed changes
Change the behaviour of
tedge connectto wait until mosquitto is listening on the configured port, rather than blindly sleeping for 5 seconds and hoping that's long enough. This drastically speeds uptedge connect c8yin most cases, and allows us to increase the timeout (which I've set here to 20 seconds) to allow for cases where mosquitto starts more slowly.Additionally, I've now added some logic to republish the check connection message when the built-in bridge publishes a health up message. This fixes connection check errors caused when we send the message too soon and the bridge misses the message entirely. We can't rely on message persistence here since the bridge may have never previously connected.
Still TODO:
tedge connect c8y --testto do URL check tooTypes of changes
Paste Link to the issue
Checklist
cargo fmtas mentioned in CODING_GUIDELINEScargo clippyas mentioned in CODING_GUIDELINESFurther comments