fix: prevent celery from hanging due to spawned greenlet errors in greenlet drainers #9371

linusphan · 2024-10-21T22:29:10Z

Description

This PR fixes an issue where an error raised in the spawned greenlet from the greenlet drainer causes the drainer to stop retrieving task results. Currently, these errors are only logged, making it difficult for clients to handle the situation effectively. As a result, a client may wait indefinitely for task results that will never be fetched, since the greenlet from the drainer has already stopped running. This change ensures that an error is thrown back to clients, enabling them to handle the error appropriately, such as by exiting or restarting the process.

Here are the steps that we took in our testing to reproduce the issue:

Start up Redis.
Start an API client that uses gevent workers that allow for multiple connections.
The API has an endpoint that enqueues a task to the Redis result backend and also waits for the result before sending back a response to the API client.
Don't start any celery workers yet.
Ensure that we've reduced the max connection retries attempt to be able to get the spawned greenlet to raise an error sooner. We can set this in the celery app configuration for example:

    result_backend_transport_options={
        "retry_policy": {
            "max_retries": 1,
        }
    },

Enqueue a task to Redis successfully using a task's delay method.
Turn off Redis and confirm that a message was eventually logged after all connection attempts have been exhausted indicating that celery must be restarted.
Turn Redis back on.
Start a worker, observe that it received and processed the task, however, also observe that the API request is still hanging.
We can cancel the initial request, or leave it and make a new request, and see that while the worker receives it and processes it successfully, the API client still hangs due to the stopped spawned greenlet in the drainer, which is requiring a celery restart.

codecov · 2024-10-21T23:31:09Z

Codecov Report

❌ Patch coverage is 83.33333% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.69%. Comparing base (166f705) to head (1bf72dc).
⚠️ Report is 66 commits behind head on main.

Files with missing lines	Patch %	Lines
celery/backends/asynchronous.py	86.95%	5 Missing and 1 partial ⚠️
celery/backends/redis.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #9371      +/-   ##
==========================================
+ Coverage   78.63%   78.69%   +0.06%     
==========================================
  Files         153      153              
  Lines       19222    19243      +21     
  Branches     2555     2557       +2     
==========================================
+ Hits        15115    15144      +29     
+ Misses       3810     3801       -9     
- Partials      297      298       +1

Flag	Coverage Δ
unittests	`78.67% <83.33%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Nusnus

Please add tests to make sure the changes are behaving as expected 🙏

Thank you

linusphan · 2024-10-22T01:14:13Z

Please add tests to make sure the changes are behaving as expected 🙏

Thank you

Hi @Nusnus, thank you for looking at this PR. 🙏 I'm new here, and Python isn't my primary language. I may have rushed this. Let me make sure that I can run this locally and add the tests. As I look into this would you be able to take a look at the approach here and help confirm if the changes here are the right behavior?

celery/backends/redis.py

celery/backends/asynchronous.py

linusphan · 2024-10-23T16:54:43Z

Hi @Nusnus, I'm noticing that there are 2 smoke tests failing in specific environments that are succeeding in others here. Are you able to help me understand if those tests are flakey or not?

Nusnus · 2024-10-23T16:56:15Z

Hi @Nusnus, I'm noticing that there are 2 smoke tests failing in specific environments that are succeeding in others here. Are you able to help me understand if those tests are flakey or not?

Flaky.
Rerunning.

linusphan · 2024-10-24T19:52:22Z

quick update: we think that the PR is ready for a first pass review. We won't be pushing any more commits at the moment.

We also think that code coverage lint is cached or something because we're seeing a check warning stating that "line #L150 was not covered by tests" in celery/backends/asynchronous.py#L150 even though we can confirm there are tests that fail if we decide to update that line to throw a random exception.

The only uncovered part is where this discussion thread is at. Happy to add tests if we reach a conclusion about the specific error and whether this is the approach that we'd like to take here.

mothershipper · 2024-11-07T20:47:42Z

Hey @Nusnus don't mean to be a bother, but is there anything we can do to help move this along? We're using this fork in production, but would love to upstream it (or get this into a state that you'd be willing to take on). Thanks!

Nusnus · 2024-11-09T22:36:33Z

Hey @Nusnus don't mean to be a bother, but is there anything we can do to help move this along? We're using this fork in production, but would love to upstream it (or get this into a state that you'd be willing to take on). Thanks!

I'll give it another look this week, let's see what I can do to help

auvipy

can you also add some integration test for the change, please?

mothershipper · 2024-11-19T18:01:48Z

@auvipy happy to take a look at smoke tests, but is there anything in particular you're looking for coverage on?

The only behavior change should be for users that are already on a non-happy path, the happy path shouldn't have changed. My gut says this may be tough to test without stubbing as it'd require killing redis from within the testing process (and then restarting it for other tests?). Let me know your thoughts, we'll still take a look today either way!

mothershipper · 2024-11-19T19:45:50Z

I stand corrected, does look like you have support for initing / killing containers in the smoke tests already -- we'll see what we can do :)

Nusnus · 2024-11-19T20:28:14Z

I stand corrected, does look like you have support for initing / killing containers in the smoke tests already -- we'll see what we can do :)

Check out the pytest-celery docs for the smoke tests: https://pytest-celery.readthedocs.io

Nusnus · 2024-12-17T19:02:23Z

CI Issues fixed, 100% passing now.

auvipy

can we improve the test coverage?

mothershipper · 2024-12-23T18:31:18Z

@auvipy I've been struggling to get the smoke tests to run locally, and don't want to spam you all/force CI to run broken tests while we figure it out. I've got a bit of time over the holiday break to make a second attempt, but I think we may be close to (or hitting) our limit in terms of being able to contribute effectively here :/

For what it's worth, I don't believe the codecov comment on this PR is accurate, I think it was cached based on earlier commits to the branch that didn't add tests.

Nusnus · 2024-12-23T22:12:31Z

@auvipy I've been struggling to get the smoke tests to run locally

What issues are you having?
Maybe I can help

mothershipper · 2025-01-02T20:18:03Z

@Nusnus how do you all recommend running the smoke tests on local? Was kind of hoping there'd be a make smoke or equivalent to build/boot docker with the right flags set to run the tests. I don't think anything here is insurmountable, we're just running up against the time we can allocate to this right now.

Where I left off was having permission issues in the container when invoking tox, need to figure out which dirs need to be mounted as writable and hope that's the last blocker to execution. Haven't spent a much time trying to get the tests to run outside docker, but it seems like some of the deps are missing prebuilt wheels for apple silicon and need extra system deps -- didn't really want to install anything into my env unless really necessary.

celery/backends/asynchronous.py

Copilot

Pull Request Overview

This PR fixes an issue where Celery hangs indefinitely when errors occur in spawned greenlets from greenlet drainers. Previously, errors were only logged, causing clients to wait forever for task results that would never be fetched due to the stopped greenlet. The fix ensures errors are propagated back to clients, enabling proper error handling and recovery.

Key changes:

Added error propagation mechanism in greenlet drainers to surface exceptions to clients
Unified event handling across eventlet and gevent drainers with consistent API
Enhanced test coverage for greenlet error scenarios

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File	Description
celery/backends/asynchronous.py	Core fix implementing error propagation in greenlet drainers and unified event handling
celery/backends/redis.py	Modified exception handling to raise RuntimeError with descriptive message on connection retry exhaustion
t/unit/backends/test_asynchronous.py	Added comprehensive test coverage for greenlet error scenarios and drainer behavior

celery/backends/asynchronous.py

…anual restart Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com> Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com> Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

…est_EventletDrainer Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

done

Nusnus self-requested a review October 21, 2024 23:27

linusphan force-pushed the fix-result-backend-connection-error-handling branch from 13e673e to 0cfb89d Compare October 21, 2024 23:43

Nusnus previously requested changes Oct 21, 2024

View reviewed changes

linusphan marked this pull request as draft October 22, 2024 01:08

linusphan commented Oct 22, 2024

View reviewed changes

celery/backends/redis.py Show resolved Hide resolved

linusphan commented Oct 22, 2024

View reviewed changes

celery/backends/asynchronous.py Outdated Show resolved Hide resolved

linusphan changed the title ~~fix: prevent infinite loop from result backend connection errors in greenlet drainers~~ fix: prevent celery from hanging due to errors stopping greenlet drainers, causing clients to wait indefinitely for task results Oct 23, 2024

linusphan requested a review from Nusnus October 23, 2024 16:51

linusphan marked this pull request as ready for review October 23, 2024 16:52

linusphan changed the title ~~fix: prevent celery from hanging due to errors stopping greenlet drainers, causing clients to wait indefinitely for task results~~ fix: prevent celery from hanging due to spawned greenlet errors in greenlet drainers Oct 23, 2024

auvipy requested changes Nov 19, 2024

View reviewed changes

auvipy reviewed Dec 22, 2024

View reviewed changes

auvipy added this to the 5.5.1 milestone Feb 5, 2025

auvipy modified the milestones: 5.5.2, 5.5.3 Apr 30, 2025

auvipy requested changes Apr 30, 2025

View reviewed changes

celery/backends/asynchronous.py Outdated Show resolved Hide resolved

auvipy force-pushed the fix-result-backend-connection-error-handling branch from 6bfba65 to 22eefcf Compare August 26, 2025 05:09

auvipy requested a review from Copilot August 27, 2025 12:50

Copilot AI reviewed Aug 27, 2025

View reviewed changes

celery/backends/asynchronous.py Show resolved Hide resolved

celery/backends/asynchronous.py Outdated Show resolved Hide resolved

linusphan and others added 23 commits August 31, 2025 11:04

add .tool-versions to .gitignore

1f5cf9e

propagate event drainer errors to prevent infinite loop and require m…

5430972

…anual restart Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com> Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>

remove typing

8882737

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

add tests

b0f2000

add tests and refactor implementation

2b3dab9

Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com> Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>

remove test code and add pydoc for clarity

3c05932

Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com> Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>

raise error in greenlet to ensure it exits, and add more test coverage

de68ae3

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

calls teardown_thread when using schedule_thread in tests

b8ac2a8

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

use wait() instead of while loop for clarity in teardown_thread for t…

3f2dabf

…est_EventletDrainer Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

fix lint

fafaffc

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

Update celery/backends/asynchronous.py

3c0c26d

Update celery/backends/asynchronous.py

9e62c05

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update celery/backends/asynchronous.py

4ef8132

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update celery/backends/asynchronous.py

1172715

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Address race condition concern when setting and reading exception state

f8bd6d9

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

Revise docstring

6ccac1b

Fix bare except clause in test teardown_thread method

2084980

Revert test change

25d53ab

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

Improve naming and docstring/comment clarity

da4aa1f

Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>

Update celery/backends/asynchronous.py

935a52e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update celery/backends/asynchronous.py

d7be86b

Update celery/backends/asynchronous.py

8ff0ded

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Add logging import to asynchronous backend

1bf72dc

auvipy force-pushed the fix-result-backend-connection-error-handling branch from 0cb11ea to 1bf72dc Compare August 31, 2025 05:04

auvipy approved these changes Aug 31, 2025

View reviewed changes

auvipy merged commit 6804ea8 into celery:main Aug 31, 2025
104 of 108 checks passed

Uh oh!

fix: prevent celery from hanging due to spawned greenlet errors in greenlet drainers #9371

fix: prevent celery from hanging due to spawned greenlet errors in greenlet drainers #9371

Uh oh!

Conversation

linusphan commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

codecov bot commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Nusnus left a comment

Choose a reason for hiding this comment

Uh oh!

linusphan commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

linusphan commented Oct 23, 2024

Uh oh!

Nusnus commented Oct 23, 2024

Uh oh!

linusphan commented Oct 24, 2024

Uh oh!

mothershipper commented Nov 7, 2024

Uh oh!

Nusnus commented Nov 9, 2024

Uh oh!

auvipy left a comment

Choose a reason for hiding this comment

Uh oh!

mothershipper commented Nov 19, 2024 • edited by auvipy Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mothershipper commented Nov 19, 2024

Uh oh!

Nusnus commented Nov 19, 2024

Uh oh!

Nusnus commented Dec 17, 2024

Uh oh!

auvipy left a comment

Choose a reason for hiding this comment

Uh oh!

mothershipper commented Dec 23, 2024

Uh oh!

Nusnus commented Dec 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mothershipper commented Jan 2, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

linusphan commented Oct 21, 2024 •

edited

Loading

codecov bot commented Oct 21, 2024 •

edited

Loading

linusphan commented Oct 22, 2024 •

edited

Loading

mothershipper commented Nov 19, 2024 •

edited by auvipy

Loading

Nusnus commented Dec 23, 2024 •

edited

Loading