Handle stuck queued tasks in Celery for db backend#19769
Handle stuck queued tasks in Celery for db backend#19769ephraimbuddy merged 17 commits intoapache:mainfrom
Conversation
2419d79 to
3331926
Compare
3331926 to
50bdd10
Compare
1511252 to
6c03901
Compare
cb6ab1b to
39fc9c5
Compare
b32389d to
8b5e259
Compare
|
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
|
There are some sqlite tests failing here - likely related. |
8b5e259 to
7d52748
Compare
|
@ephraimbuddy Glad to report that the scheduler and system is still responsive +24h after the deployment. Looks like... all good... knock on wood... |
|
Thanks @kristoffern. Please can you update with the recent changes in #21556 |
Any update? |
|
@ephraimbuddy Sorry, but I'm on limited amount of time currently for this and won't have any bandwidth to test it. |
|
|
||
| if self.adopted_task_timeouts: | ||
| self._check_for_stalled_adopted_tasks() | ||
| if time.time() - self.stuck_tasks_last_check_time > self.stuck_queued_task_check_interval: |
There was a problem hiding this comment.
Won't this keep getting called every sync cycle? Shouldn't stuck_tasks_last_check_time be updated to the current time here.
|
Is there an issue you can link with respect to why it needs a reversion? I do recall having some issue w/ tasks being submitted to celery during a deployment and getting left in queued limbo, but unsure what about the current code is problematic. |
|
Since you are no longer having this issue, we should revert. I called your attention because of this issue: #21225 which is related to this |
|
Thanks for the work on this @ephraimbuddy. I am not able to test right now. Our previous usage pattern was farming out a ton of celery tasks that were monitoring work being done by a remote service (Batch). We've since moved these to use the deferred pattern to monitor these tasks (this was a great thing to gain in the most recent upgrade). So while I think the scenario still exists from my bug report, the chance of us hitting it are drastically lower (and we haven't hit it again yet) as the majority of our tasks are now in deferred. |
|
@ephraimbuddy since we switched to SQS as our message broker from redis (it only happened when under high load) 2 years, we haven't experienced this issue. |
|
@ephraimbuddy we can reproduce this reliably in an isolated test environment, and would love to help test this fix to get it merged up. We're running with a redis broker and pgsql results db, on k8s with KEDA auto scaling - it happens when under load during a scale-in event (shutting down a worker) We've spent a bit of time tracking it down and (at least in our case) it looks to be a problem in celery (possibly celery/celery#7266). Airflow throws the task at celery and it just never executes, never makes it into the I do have some concerns around |
|
Hi @repl-chris , thanks for the detailed description. |
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment) Hopefully, this will fix it GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment) Hopefully, this will fix it GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment) Hopefully, this will fix it GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment) Hopefully, this will fix it GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment) Hopefully, this will fix it GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment) Hopefully, this will fix it GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment) Hopefully, this will fix it GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment) Hopefully, this will fix it GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment) Hopefully, this will fix it GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment) Hopefully, this will fix it GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment) Hopefully, this will fix it GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment) Hopefully, this will fix it GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment) Hopefully, this will fix it GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment) Hopefully, this will fix it GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
Move the state of stuck queued tasks in Celery to Scheduled so that
the Scheduler can queue them again
Closes: #19699
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.