Skip to content

fix: ensure that tedge mqtt pub fully publishes messages sent with QoS 2#3603

Merged
jarhodes314 merged 1 commit intothin-edge:mainfrom
jarhodes314:bug/mqtt-channel-disconnect-before-pubcomp
May 12, 2025
Merged

fix: ensure that tedge mqtt pub fully publishes messages sent with QoS 2#3603
jarhodes314 merged 1 commit intothin-edge:mainfrom
jarhodes314:bug/mqtt-channel-disconnect-before-pubcomp

Conversation

@jarhodes314
Copy link
Copy Markdown
Contributor

Proposed changes

Ensure that all outstanding publishes are sent when running tedge mqtt pub -q 2 ....

The root cause was the queue in rumqttc's event loop. When we attempt to publish a message, this is sent to the event loop via a channel. In the case of this bug, the event loop didn't receive this publish notification before we detected the sender loop had shut down, and thus we started the disconnection process before publishing the message. The publish would still be processed before the disconnect request, but the disconnection caused us not to respond to the broker's PubRec acknowledgement, thus the QoS 2 message wasn't fully processed.

.Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Improvement (general improvements like code refactoring that doesn't explicitly fix a bug or add any new functionality)
  • Documentation Update (if none of the other choices apply)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Paste Link to the issue

Checklist

  • I have read the CONTRIBUTING doc
  • I have signed the CLA (in all commits with git commit -s. You can activate automatic signing by running just prepare-dev once)
  • I ran just format as mentioned in CODING_GUIDELINES
  • I used just check as mentioned in CODING_GUIDELINES
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Further comments

@codecov
Copy link
Copy Markdown

codecov bot commented May 9, 2025

Codecov Report

Attention: Patch coverage is 87.50000% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/common/mqtt_channel/src/connection.rs 87.50% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented May 9, 2025

Robot Results

✅ Passed ❌ Failed ⏭️ Skipped Total Pass % ⏱️ Duration
631 0 3 631 100 1h50m24.729813999s

@jarhodes314 jarhodes314 force-pushed the bug/mqtt-channel-disconnect-before-pubcomp branch from 7690f66 to 0e58116 Compare May 9, 2025 12:31
@jarhodes314 jarhodes314 temporarily deployed to Test Pull Request May 9, 2025 12:31 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to the issue that we're fixing, but, seeing the additional handling required for Outgoing QoS 2 packets published by this client, I'm wondering whether we should do something similar for Incoming QoS 2 packets as well, to make sure that this client properly acknowledges those packets received from the broker.

Comment on lines +229 to +230
let remaining_events_empty =
event_loop.state.inflight() == 0 && pub_count.load(Ordering::SeqCst) == 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't that be just the latter?

Suggested change
let remaining_events_empty =
event_loop.state.inflight() == 0 && pub_count.load(Ordering::SeqCst) == 0;
let remaining_events_empty = pub_count.load(Ordering::SeqCst) == 0;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because pub_count == 0 tells us that everything we intended to publish has been published, and inflight == 0 tells us that all the messages are acknowledged.

event = event_loop.poll() => {
// If we receive more events, we want to reset the disconnect permit
// so the event loop doesn't block disconnection
disconnect_permit.take();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that really help? After the permit is acquired in line no 254 during an iteration, we start the next iteration and would have triggered the disconnection in line no. 235. Since this event_loop.poll() happens after that, setting it back to None doesn't really help, does it? As the damage (triggering the disconnect task) has already been done.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought there was a chance it might not since we definitely have cases where we acquire the disconnect permit, but then can't send the disconnect due to pending packets, but if we have pending packets that means we have to wait for event_loop.poll() to complete before we can achieve the required precondition for sending the disconnect. So thinking about it more thoroughly, I think you're correct, this is redundant.

awaiting_ack.remove(&p.pkid);
}

Ok(Event::Incoming(Packet::PubRec(p))) => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it safer to remove the packet once PUBCOMP is received rather than PUBREC? Since that guarantees that the broker has released the message to other subscribers?

Suggested change
Ok(Event::Incoming(Packet::PubRec(p))) => {
Ok(Event::Incoming(Packet::PubComp(p))) => {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is probably true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually true, as PUBCOMP is the QoS 2 equivalent of PUBACK of QoS 1. By closing on PUBREC, the PUBREL might be missing and the broker let waiting for it.

@jarhodes314 jarhodes314 temporarily deployed to Test Pull Request May 9, 2025 14:03 — with GitHub Actions Inactive
@reubenmiller reubenmiller added theme:mqtt Theme: mqtt and mosquitto related topics theme:cli Theme: cli related topics labels May 9, 2025
Copy link
Copy Markdown
Contributor

@didier-wenzek didier-wenzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

@reubenmiller reubenmiller added this to the 1.5.1 milestone May 9, 2025
@didier-wenzek
Copy link
Copy Markdown
Contributor

Rerunning the failed test. Flaky, but no so:

$ invoke flake-finder --test-name "Firmware plugin supports restart via service manager #1932" --iterations 100 --outputdir output_ff --clean
...
Results: 100 iterations, 100 passed, 0 failed

@reubenmiller
Copy link
Copy Markdown
Contributor

reubenmiller commented May 9, 2025

Rerunning the failed test. Flaky, but no so:

$ invoke flake-finder --test-name "Firmware plugin supports restart via service manager #1932" --iterations 100 --outputdir output_ff --clean
...
Results: 100 iterations, 100 passed, 0 failed

yeah it looked like it was caused by a duplicate operation being received…we could relax the assertion to not check a max message count, but not sure if that would cause downstream failures anyway…but yes all unrelated

Copy link
Copy Markdown
Contributor

@albinsuresh albinsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

…QoS 2

Signed-off-by: James Rhodes <jarhodes314@gmail.com>
@jarhodes314 jarhodes314 force-pushed the bug/mqtt-channel-disconnect-before-pubcomp branch from 65840ef to f60f280 Compare May 12, 2025 09:35
@jarhodes314 jarhodes314 temporarily deployed to Test Pull Request May 12, 2025 09:35 — with GitHub Actions Inactive
@jarhodes314 jarhodes314 added this pull request to the merge queue May 12, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 12, 2025
@jarhodes314 jarhodes314 added this pull request to the merge queue May 12, 2025
Merged via the queue into thin-edge:main with commit 617f89a May 12, 2025
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

theme:cli Theme: cli related topics theme:mqtt Theme: mqtt and mosquitto related topics

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants