Check exclusive queue owner before deleting a queue (backport #15276) (backport #15286)#15287
Merged
michaelklishin merged 5 commits intov4.1.xfrom Jan 17, 2026
Merged
Conversation
[Why] For a long time, there has been race condition when deleting exclusive queues - if a connection was re-established and a queue with the same name was declared, we could delete the new queue. For example, with many MQTT consumers, if we performed a rolling restart of the cluster and the clients reconnected without any delay, after the restart, we sometimes had the expected number of connections but a lower number of queues, even though there should be a queue for each consumer. [How] Check that the exclusive_owner has the value we expect when requesting deletion. If the value is different, this means this is effectively a different queue (same name, but a different connection), so we should not delete it. (cherry picked from commit 31ba23a) (cherry picked from commit 8588cee) # Conflicts: # deps/rabbit/src/rabbit_db_queue.erl
Author
|
Cherry-pick of 8588cee has failed: To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally |
v4.1.x uses Khepri 0.16.0 where khepri_tx_adv:delete returns the old
single_result format ({ok, #{data := _}}), not the new many_results
format ({ok, #{Path := #{data := _}}}) introduced in Khepri 0.17.0.
The original main/v4.2.x version uses khepri_tx:does_api_comply_with/1
to handle both formats, but this function does not exist in Khepri
0.16.0.
Additionally, khepri_path:combine_with_conditions/2 is not in the
Horus allowed function list for transaction functions in Khepri 0.16.0.
Move the path computation outside the transaction function to avoid
the Horus extraction error.
Simplify the pattern matching to use only the Khepri 0.16.0 format
while preserving the fix: conditional deletion using
khepri_path:combine_with_conditions to check exclusive_owner before
deleting.
668cb7d to
8e4c549
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[Why]
For a long time, there has been race condition when deleting exclusive queues - if a connection was re-established and a queue with the same name was declared, we could delete the new queue.
For example, with many MQTT consumers, if we performed a rolling restart of the cluster and the clients reconnected without any delay, after the restart, we sometimes had the expected number of connections but a lower number of queues, even though there should be a queue for each consumer.
[How]
Check that the exclusive_owner has the value we expect when requesting deletion. If the value is different, this means this is effectively a different queue (same name, but a different connection), so we should not delete it.
[Testing]
Here's an example of how to test before/after:
In both cases, you will almost certainly see that once nodes are restarted, the number of published messages doesn't match the number of consumed messages.
list_queueswill almost certainly return fewer than 100 queues before the PR. With this PR, the number of queues and messages flowing should meet expectations.This is an automatic backport of pull request #15276 done by Mergify.
This is an automatic backport of pull request #15286 done by Mergify.