-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Add xfail test for RabbitMQ quorum queue global QoS race condition #9770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add xfail test for RabbitMQ quorum queue global QoS race condition #9770
Conversation
This test simulates a quorum queue cluster propagation race where the first worker fails quorum detection and others succeed. It starts multiple workers concurrently and expects at least one AMQP error (e.g., 540 NOT_IMPLEMENTED) caused by applying global QoS to a quorum queue. The test is marked xfail since this behavior is a known RabbitMQ limitation.
for more information, see https://pre-commit.ci
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #9770 +/- ##
=======================================
Coverage 78.51% 78.52%
=======================================
Files 153 153
Lines 19130 19130
Branches 2533 2533
=======================================
+ Hits 15020 15021 +1
+ Misses 3821 3819 -2
- Partials 289 290 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
|
tests and CI looks much better now |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds an integration test to simulate a RabbitMQ quorum queue global QoS race condition during Celery worker startup and refines timing and error-handling patterns in existing integration tests.
- Refactors task tests to consistently use billiard’s multiprocessing and functions from the time module.
- Improves the security test by using apply_async and introducing exception handling with clearer failure messages.
- Introduces a new integration test file simulating the quorum queue QoS race condition.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| t/integration/test_tasks.py | Refactored multiprocessing and timing function usage for clarity and consistency. |
| t/integration/test_security.py | Enhanced error handling in a security test to better capture timeout issues. |
| t/integration/test_quorum_queue_qos_cluster_simulation.py | Added tests to simulate a known RabbitMQ quorum queue QoS race condition. |
| t/integration/conftest.py | Updated configuration file handling and added logging to support smoother test execution. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ition (celery#9770)" This reverts commit 6d8bfd1.
…ition (celery#9770)" This reverts commit 6d8bfd1.
…ition (celery#9770)" This reverts commit 6d8bfd1.
…ition (celery#9770)" This reverts commit 6d8bfd1.
…dition (celery#9770)" This reverts commit c889df4.
…dition (celery#9770)" This reverts commit b10a55c.
…dition (celery#9770)" This reverts commit b10a55c.
…dition (celery#9770)" This reverts commit b10a55c.
* Disabled tests test_multiprocess_producer and test_multithread_producer * Refactor integration tests CI * Disable test_quorum_queue_qos_cluster_simulation.py * Reduce integration tests timeout from 30m -> 20m and increase attempts from 2 -> 3 (fail/retry faster) * Increase max attempts from 3 -> 5 with 1m break between each retry * TMP Dont wait for unit tests * Changed retry settings * Revert "Add xfail test for RabbitMQ quorum queue global QoS race condition (#9770)" This reverts commit 6d8bfd1. * Remove test_quorum_queue_qos_cluster_simulation.py from CI * Prevent the billiard QueueListener from deadlocking during worker shutdown * Revert "TMP Dont wait for unit tests" This reverts commit 3da612d. * Run smoke if integration passed * Changed retry settings * Disable test_groupresult_serialization * timeout 5, attempts 10, instead of 30m x 2 attempts * Split integration test jobs to be test per module * Disabled dep with unit tests (for faster testing) * Run all integration jobs together * Split smoke test jobs * Simplifed python-package.yml * Cleanup * max-parallel: 4 * fixed smoked tests ci * Removed Python 3.10-3.12 from the integration and smoke tests CI * Integration & Smoke run only if unit tests pass * Reduced more python versions for now (integration min/max, smoke-max) * Revert "Prevent the billiard QueueListener from deadlocking during worker shutdown" This reverts commit cebcb2b. * Reapply "Add xfail test for RabbitMQ quorum queue global QoS race condition (#9770)" This reverts commit b10a55c. * Added back `test_quorum_queue_qos_cluster_simulation` to the integration tests * max-parallel: 5 * Revert "Disable test_groupresult_serialization" This reverts commit ade8cbd.
Description:
This PR adds an integration test that simulates a RabbitMQ quorum queue cluster visibility race during Celery worker startup.
The test forces a quorum detection failure on the first worker, introduces a delay in queue declaration, and starts multiple workers concurrently. It asserts that at least one worker hits an AMQP-level error (e.g., 540 NOT_IMPLEMENTED or connection loss) caused by attempting to apply global QoS on a quorum queue during the race.
The test is marked xfail since this is a known RabbitMQ limitation.
This behavior has been seen in a production environment with
celery 5.5.3,kombu 5.5.2, andrabbitmq 3.13.7-management, where the broker is clustered.❌ Observed Behavior
When multiple workers start concurrently:
The first worker fails quorum detection (simulated fault).
Subsequent workers proceed with quorum detection and start normally.
At least one worker triggers a RabbitMQ 540 NOT_IMPLEMENTED error or similar AMQP-level failure due to global QoS being applied to a quorum queue.
✅ Expected Behavior
At least one worker should experience a broker-side failure (540 NOT_IMPLEMENTED or equivalent AMQP error) as a result of this race condition.
Test Status
The test is marked xfail to acknowledge the known RabbitMQ limitation regarding global QoS on quorum queues.
Test intentionally expects failure (AMQP error presence) to validate that the race condition surfaces as designed.