-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
fix: prevent celery from hanging due to spawned greenlet errors in greenlet drainers #9371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: prevent celery from hanging due to spawned greenlet errors in greenlet drainers #9371
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9371 +/- ##
==========================================
+ Coverage 78.63% 78.69% +0.06%
==========================================
Files 153 153
Lines 19222 19243 +21
Branches 2555 2557 +2
==========================================
+ Hits 15115 15144 +29
+ Misses 3810 3801 -9
- Partials 297 298 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
13e673e to
0cfb89d
Compare
Nusnus
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests to make sure the changes are behaving as expected 🙏
Thank you
Hi @Nusnus, thank you for looking at this PR. 🙏 I'm new here, and Python isn't my primary language. I may have rushed this. Let me make sure that I can run this locally and add the tests. As I look into this would you be able to take a look at the approach here and help confirm if the changes here are the right behavior? |
|
quick update: we think that the PR is ready for a first pass review. We won't be pushing any more commits at the moment. We also think that code coverage lint is cached or something because we're seeing a check warning stating that "line #L150 was not covered by tests" in The only uncovered part is where this discussion thread is at. Happy to add tests if we reach a conclusion about the specific error and whether this is the approach that we'd like to take here. |
|
Hey @Nusnus don't mean to be a bother, but is there anything we can do to help move this along? We're using this fork in production, but would love to upstream it (or get this into a state that you'd be willing to take on). Thanks! |
I'll give it another look this week, let's see what I can do to help |
auvipy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also add some integration test for the change, please?
|
@auvipy happy to take a look at smoke tests, but is there anything in particular you're looking for coverage on? The only behavior change should be for users that are already on a non-happy path, the happy path shouldn't have changed. My gut says this may be tough to test without stubbing as it'd require killing redis from within the testing process (and then restarting it for other tests?). Let me know your thoughts, we'll still take a look today either way! |
|
I stand corrected, does look like you have support for initing / killing containers in the smoke tests already -- we'll see what we can do :) |
Check out the pytest-celery docs for the smoke tests: https://pytest-celery.readthedocs.io |
|
CI Issues fixed, 100% passing now. |
auvipy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we improve the test coverage?
|
@auvipy I've been struggling to get the smoke tests to run locally, and don't want to spam you all/force CI to run broken tests while we figure it out. I've got a bit of time over the holiday break to make a second attempt, but I think we may be close to (or hitting) our limit in terms of being able to contribute effectively here :/ For what it's worth, I don't believe the codecov comment on this PR is accurate, I think it was cached based on earlier commits to the branch that didn't add tests. |
What issues are you having? |
|
@Nusnus how do you all recommend running the smoke tests on local? Was kind of hoping there'd be a Where I left off was having permission issues in the container when invoking tox, need to figure out which dirs need to be mounted as writable and hope that's the last blocker to execution. Haven't spent a much time trying to get the tests to run outside docker, but it seems like some of the deps are missing prebuilt wheels for apple silicon and need extra system deps -- didn't really want to install anything into my env unless really necessary. |
6bfba65 to
22eefcf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes an issue where Celery hangs indefinitely when errors occur in spawned greenlets from greenlet drainers. Previously, errors were only logged, causing clients to wait forever for task results that would never be fetched due to the stopped greenlet. The fix ensures errors are propagated back to clients, enabling proper error handling and recovery.
Key changes:
- Added error propagation mechanism in greenlet drainers to surface exceptions to clients
- Unified event handling across eventlet and gevent drainers with consistent API
- Enhanced test coverage for greenlet error scenarios
Reviewed Changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| celery/backends/asynchronous.py | Core fix implementing error propagation in greenlet drainers and unified event handling |
| celery/backends/redis.py | Modified exception handling to raise RuntimeError with descriptive message on connection retry exhaustion |
| t/unit/backends/test_asynchronous.py | Added comprehensive test coverage for greenlet error scenarios and drainer behavior |
…anual restart Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com> Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>
Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>
Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com> Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>
Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com> Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com>
Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>
Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>
…est_EventletDrainer Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>
Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>
Co-authored-by: Jack <57678801+mothershipper@users.noreply.github.com> Co-authored-by: Linus Phan <13613724+linusphan@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
0cb11ea to
1bf72dc
Compare
Fixes #4857
Description
This PR fixes an issue where an error raised in the spawned greenlet from the greenlet drainer causes the drainer to stop retrieving task results. Currently, these errors are only logged, making it difficult for clients to handle the situation effectively. As a result, a client may wait indefinitely for task results that will never be fetched, since the greenlet from the drainer has already stopped running. This change ensures that an error is thrown back to clients, enabling them to handle the error appropriately, such as by exiting or restarting the process.
Here are the steps that we took in our testing to reproduce the issue:
geventworkers that allow for multiple connections.delaymethod.