Skip to content

Conversation

@maccinza
Copy link
Contributor

Description:

This PR adds an integration test that simulates a RabbitMQ quorum queue cluster visibility race during Celery worker startup.

The test forces a quorum detection failure on the first worker, introduces a delay in queue declaration, and starts multiple workers concurrently. It asserts that at least one worker hits an AMQP-level error (e.g., 540 NOT_IMPLEMENTED or connection loss) caused by attempting to apply global QoS on a quorum queue during the race.

The test is marked xfail since this is a known RabbitMQ limitation.

This behavior has been seen in a production environment with celery 5.5.3, kombu 5.5.2, and rabbitmq 3.13.7-management, where the broker is clustered.

❌ Observed Behavior

When multiple workers start concurrently:

The first worker fails quorum detection (simulated fault).
Subsequent workers proceed with quorum detection and start normally.
At least one worker triggers a RabbitMQ 540 NOT_IMPLEMENTED error or similar AMQP-level failure due to global QoS being applied to a quorum queue.

✅ Expected Behavior

At least one worker should experience a broker-side failure (540 NOT_IMPLEMENTED or equivalent AMQP error) as a result of this race condition.

Test Status

The test is marked xfail to acknowledge the known RabbitMQ limitation regarding global QoS on quorum queues.
Test intentionally expects failure (AMQP error presence) to validate that the race condition surfaces as designed.

maccinza and others added 3 commits June 18, 2025 18:59
This test simulates a quorum queue cluster propagation race where the first worker fails quorum detection and others succeed. It starts multiple workers concurrently and expects at least one AMQP error (e.g., 540 NOT_IMPLEMENTED) caused by applying global QoS to a quorum queue.

The test is marked xfail since this behavior is a known RabbitMQ limitation.
@codecov
Copy link

codecov bot commented Jun 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.52%. Comparing base (de104a2) to head (9178e43).
⚠️ Report is 45 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9770   +/-   ##
=======================================
  Coverage   78.51%   78.52%           
=======================================
  Files         153      153           
  Lines       19130    19130           
  Branches     2533     2533           
=======================================
+ Hits        15020    15021    +1     
+ Misses       3821     3819    -2     
- Partials      289      290    +1     
Flag Coverage Δ
unittests 78.49% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@auvipy auvipy requested review from Nusnus, auvipy and Copilot June 19, 2025 05:11

This comment was marked as outdated.

auvipy and others added 2 commits June 19, 2025 05:15
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@auvipy auvipy added this to the 5.5.4 milestone Jun 19, 2025
@maccinza maccinza requested a review from auvipy June 27, 2025 17:22
@maccinza maccinza marked this pull request as ready for review June 27, 2025 17:22
@auvipy
Copy link
Member

auvipy commented Jun 28, 2025

tests and CI looks much better now

@maccinza maccinza requested a review from auvipy June 30, 2025 13:56
@auvipy auvipy requested a review from Copilot June 30, 2025 15:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds an integration test to simulate a RabbitMQ quorum queue global QoS race condition during Celery worker startup and refines timing and error-handling patterns in existing integration tests.

  • Refactors task tests to consistently use billiard’s multiprocessing and functions from the time module.
  • Improves the security test by using apply_async and introducing exception handling with clearer failure messages.
  • Introduces a new integration test file simulating the quorum queue QoS race condition.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
t/integration/test_tasks.py Refactored multiprocessing and timing function usage for clarity and consistency.
t/integration/test_security.py Enhanced error handling in a security test to better capture timeout issues.
t/integration/test_quorum_queue_qos_cluster_simulation.py Added tests to simulate a known RabbitMQ quorum queue QoS race condition.
t/integration/conftest.py Updated configuration file handling and added logging to support smoother test execution.

auvipy and others added 2 commits June 30, 2025 22:03
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@auvipy auvipy merged commit 6d8bfd1 into celery:main Jul 1, 2025
123 of 124 checks passed
@auvipy auvipy modified the milestones: 5.5.4, 5.6.0 Jul 13, 2025
Nusnus added a commit to Nusnus/celery that referenced this pull request Aug 12, 2025
Nusnus added a commit to Nusnus/celery that referenced this pull request Aug 12, 2025
Nusnus added a commit to Nusnus/celery that referenced this pull request Aug 12, 2025
Nusnus added a commit to Nusnus/celery that referenced this pull request Aug 12, 2025
Nusnus added a commit to Nusnus/celery that referenced this pull request Aug 13, 2025
Nusnus added a commit to Nusnus/celery that referenced this pull request Aug 25, 2025
Nusnus added a commit to Nusnus/celery that referenced this pull request Aug 25, 2025
Nusnus added a commit to Nusnus/celery that referenced this pull request Aug 25, 2025
Nusnus added a commit that referenced this pull request Aug 26, 2025
* Disabled tests test_multiprocess_producer and test_multithread_producer

* Refactor integration tests CI

* Disable test_quorum_queue_qos_cluster_simulation.py

* Reduce integration tests timeout from 30m -> 20m and increase attempts from 2 -> 3 (fail/retry faster)

* Increase max attempts from 3 -> 5 with 1m break between each retry

* TMP Dont wait for unit tests

* Changed retry settings

* Revert "Add xfail test for RabbitMQ quorum queue global QoS race condition (#9770)"

This reverts commit 6d8bfd1.

* Remove test_quorum_queue_qos_cluster_simulation.py from CI

* Prevent the billiard QueueListener from deadlocking during worker shutdown

* Revert "TMP Dont wait for unit tests"

This reverts commit 3da612d.

* Run smoke if integration passed

* Changed retry settings

* Disable test_groupresult_serialization

* timeout 5, attempts 10, instead of 30m x 2 attempts

* Split integration test jobs to be test per module

* Disabled dep with unit tests (for faster testing)

* Run all integration jobs together

* Split smoke test jobs

* Simplifed python-package.yml

* Cleanup

* max-parallel: 4

* fixed smoked tests ci

* Removed Python 3.10-3.12 from the integration and smoke tests CI

* Integration & Smoke run only if unit tests pass

* Reduced more python versions for now (integration min/max, smoke-max)

* Revert "Prevent the billiard QueueListener from deadlocking during worker shutdown"

This reverts commit cebcb2b.

* Reapply "Add xfail test for RabbitMQ quorum queue global QoS race condition (#9770)"

This reverts commit b10a55c.

* Added back `test_quorum_queue_qos_cluster_simulation` to the integration tests

* max-parallel: 5

* Revert "Disable test_groupresult_serialization"

This reverts commit ade8cbd.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants