Skip to content

Conversation

@Nusnus
Copy link
Member

@Nusnus Nusnus commented Apr 29, 2025

Fixes #9505

After the soft shutdown is over, the cold shutdown cannot honor the error-handling flow, as the worker must terminate as fast as possible so we may skip the timeout failure in that specific edge case.

Summary by CodeRabbit

  • Bug Fixes

    • Improved worker shutdown handling to prevent redundant task failure reporting and backend updates during cold shutdown.
    • Enhanced cancellation logic to ensure successful but unacknowledged tasks are not canceled during shutdown or connection loss.
  • Tests

    • Added tests to verify tasks complete successfully during soft shutdown, including scenarios with task time limits.
    • Introduced unit tests to confirm correct suppression of failure and timeout handling when the worker is terminating.
    • Added tests to ensure successful tasks are preserved during cancellation of unacknowledged requests.

@codecov
Copy link

codecov bot commented Apr 29, 2025

Codecov Report

❌ Patch coverage is 74.19355% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.68%. Comparing base (30649db) to head (97a8ba5).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
celery/apps/worker.py 0.00% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9678      +/-   ##
==========================================
- Coverage   78.68%   78.68%   -0.01%     
==========================================
  Files         153      153              
  Lines       19314    19328      +14     
  Branches     2214     2221       +7     
==========================================
+ Hits        15198    15208      +10     
- Misses       3817     3822       +5     
+ Partials      299      298       -1     
Flag Coverage Δ
unittests 78.66% <74.19%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Nusnus Nusnus marked this pull request as ready for review April 29, 2025 00:51
@Nusnus Nusnus added this to the 5.5.3 milestone Apr 29, 2025
@auvipy auvipy requested a review from Copilot April 29, 2025 05:11

This comment was marked as resolved.

Copy link
Member

@auvipy auvipy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we consider improving test coverage and if possible some integration tests? what do you think?

@Nusnus
Copy link
Member Author

Nusnus commented May 4, 2025

@daveisfera do you want to give this PR a test to confirm it works for you as well?

daveisfera added a commit to daveisfera/celery_cold_shutdown that referenced this pull request May 5, 2025
@daveisfera
Copy link
Contributor

@daveisfera do you want to give this PR a test to confirm it works for you as well?

I tested this fix and it appears to be handling it correctly now, but the status is still incorrectly set to RETRY even though the task finished.

Here's the output:

2025-05-05T16:32:25.134218717Z [2025-05-05 16:32:25,133: INFO/MainProcess] Task mysite.celery.long_task[562e669b-41f4-4e7b-a188-e36f71560b9f] received
2025-05-05T16:32:25.135509561Z [2025-05-05 16:32:25,135: WARNING/ForkPoolWorker-1] Value: 8
2025-05-05T16:32:25.322559812Z 
2025-05-05T16:32:25.322610489Z worker: Hitting Ctrl+C again will terminate all running tasks!
2025-05-05T16:32:25.322829567Z [2025-05-05 16:32:25,322: WARNING/MainProcess] Initiating Soft Shutdown, terminating in 16 seconds
2025-05-05T16:32:33.145373138Z [2025-05-05 16:32:33,144: WARNING/ForkPoolWorker-1] Done: 8
2025-05-05T16:32:33.162053417Z [2025-05-05 16:32:33,161: INFO/ForkPoolWorker-1] Task mysite.celery.long_task[562e669b-41f4-4e7b-a188-e36f71560b9f] succeeded in 8.02601714399998s: None
2025-05-05T16:32:41.381467890Z 
2025-05-05T16:32:41.381502893Z worker: Cold shutdown (MainProcess)

And here's the result saved in the database:

sqlite> SELECT * FROM django_celery_results_taskresult ORDER BY id DESC LIMIT 10;
1|562e669b-41f4-4e7b-a188-e36f71560b9f|RETRY|application/json|utf-8|{"exc_type": "Retry", "exc_message": ["Retry(Retry(...), None, None)", null, null], "exc_module": "celery.exceptions"}|2025-05-05 16:32:41.348411||{"children": []}|"(8,)"|"{}"|mysite.celery.long_task|2025-05-05 16:32:33.155238|celery@celery|

@Nusnus
Copy link
Member Author

Nusnus commented May 8, 2025

@daveisfera do you want to give this PR a test to confirm it works for you as well?

I tested this fix and it appears to be handling it correctly now, but the status is still incorrectly set to RETRY even though the task finished.

Here's the output:

2025-05-05T16:32:25.134218717Z [2025-05-05 16:32:25,133: INFO/MainProcess] Task mysite.celery.long_task[562e669b-41f4-4e7b-a188-e36f71560b9f] received
2025-05-05T16:32:25.135509561Z [2025-05-05 16:32:25,135: WARNING/ForkPoolWorker-1] Value: 8
2025-05-05T16:32:25.322559812Z 
2025-05-05T16:32:25.322610489Z worker: Hitting Ctrl+C again will terminate all running tasks!
2025-05-05T16:32:25.322829567Z [2025-05-05 16:32:25,322: WARNING/MainProcess] Initiating Soft Shutdown, terminating in 16 seconds
2025-05-05T16:32:33.145373138Z [2025-05-05 16:32:33,144: WARNING/ForkPoolWorker-1] Done: 8
2025-05-05T16:32:33.162053417Z [2025-05-05 16:32:33,161: INFO/ForkPoolWorker-1] Task mysite.celery.long_task[562e669b-41f4-4e7b-a188-e36f71560b9f] succeeded in 8.02601714399998s: None
2025-05-05T16:32:41.381467890Z 
2025-05-05T16:32:41.381502893Z worker: Cold shutdown (MainProcess)

And here's the result saved in the database:

sqlite> SELECT * FROM django_celery_results_taskresult ORDER BY id DESC LIMIT 10;
1|562e669b-41f4-4e7b-a188-e36f71560b9f|RETRY|application/json|utf-8|{"exc_type": "Retry", "exc_message": ["Retry(Retry(...), None, None)", null, null], "exc_module": "celery.exceptions"}|2025-05-05 16:32:41.348411||{"children": []}|"(8,)"|"{}"|mysite.celery.long_task|2025-05-05 16:32:33.155238|celery@celery|

Thank you, I’ll check it out

@Nusnus Nusnus force-pushed the hotfix branch 2 times, most recently from c034f3a to 06dd9b3 Compare May 18, 2025 19:40
@daveisfera
Copy link
Contributor

For the sake of documentation, I tried this latest version (06dd9b3) and still had the issue of recording the status as RETRY

@Nusnus Nusnus force-pushed the hotfix branch 2 times, most recently from 6eecfaf to 5a62a26 Compare May 31, 2025 19:45
@Nusnus Nusnus marked this pull request as draft May 31, 2025 19:54
@Nusnus
Copy link
Member Author

Nusnus commented May 31, 2025

@daveisfera

For the sake of documentation, I tried this latest version (06dd9b3) and still had the issue of recording the status as RETRY

I’ve added a potential fix in 5a62a26

The idea is that canceling a successful task changes the status to RETRY, so avoiding it should fix this in theory.
Can you please verify the latest version of this PR works as expected?

Thank you!

@Nusnus Nusnus marked this pull request as ready for review June 1, 2025 20:02
auvipy
auvipy previously approved these changes Jun 2, 2025
Copy link
Member

@auvipy auvipy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me

@daveisfera
Copy link
Contributor

I’ve added a potential fix in 5a62a26

The idea is that canceling a successful task changes the status to RETRY, so avoiding it should fix this in theory. Can you please verify the latest version of this PR works as expected?

The status is still set to RETRY even though the task completes successfully. That doesn't reflect what actually happened and seems "wrong", so why not set it to SUCCESS to match the actual result?

daveisfera added a commit to daveisfera/celery_cold_shutdown that referenced this pull request Jun 3, 2025
@auvipy auvipy modified the milestones: 5.5.3, 5.5.4 Jun 10, 2025
@Nusnus
Copy link
Member Author

Nusnus commented Nov 16, 2025

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

@Nusnus
Copy link
Member Author

Nusnus commented Nov 16, 2025

This issue is much more challenging than what any AI can handle right now; avoid trusting it here @auvipy

I'm giving it another shot right now, please so not make any change to the PR/branch 🙏

@auvipy
Copy link
Member

auvipy commented Nov 16, 2025

This issue is much more challenging than what any AI can handle right now; avoid trusting it here @auvipy

I'm giving it another shot right now, please so not make any change to the PR/branch 🙏

no worries, not going to merge anything here from it, as already got fooled once by him! :p

@Nusnus
Copy link
Member Author

Nusnus commented Nov 16, 2025

Could be related: Ensure that task results are delivered during pool shutdown #435

Just for the record: this doesn't solve it.
Tested on the provided bug reproduction env.

@Nusnus
Copy link
Member Author

Nusnus commented Nov 16, 2025

@daveisfera

Ran my test with these latest changes and it's still failing. I noticed that it correctly sets the task to SUCCESS when it completes, but then when the cold shutdown happens, it sets it to RETRY and there's a delay in the shutdown happening even though the task is complete, so I think that's related to the incorrectly setting of status

Potential fix pushed, ready for QA!

@daveisfera
Copy link
Contributor

Potential fix pushed, ready for QA!

Saaaaweeeet!!!! Yes, I can confirm that my test repo now works correctly!

@Nusnus
Copy link
Member Author

Nusnus commented Nov 17, 2025

Potential fix pushed, ready for QA!

Saaaaweeeet!!!! Yes, I can confirm that my test repo now works correctly!

Woohoo 🎉

Thank you for the quick confirmation!

@Nusnus Nusnus merged commit 63c1910 into celery:main Nov 17, 2025
106 of 107 checks passed
@Nusnus Nusnus deleted the hotfix branch November 17, 2025 21:40
@Nusnus Nusnus modified the milestones: 5.6.x, 5.6.0 Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tasks completed during a cold shutdown incorrectly timeout with 5.5

3 participants