Skip to content

Handle stuck queued tasks in Celery for db backend#19769

Merged
ephraimbuddy merged 17 commits intoapache:mainfrom
astronomer:fix-stuck-task
Jan 14, 2022
Merged

Handle stuck queued tasks in Celery for db backend#19769
ephraimbuddy merged 17 commits intoapache:mainfrom
astronomer:fix-stuck-task

Conversation

@ephraimbuddy
Copy link
Contributor

Move the state of stuck queued tasks in Celery to Scheduled so that
the Scheduler can queue them again
Closes: #19699


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Nov 23, 2021
@ephraimbuddy ephraimbuddy force-pushed the fix-stuck-task branch 3 times, most recently from 2419d79 to 3331926 Compare November 24, 2021 10:15
@ephraimbuddy ephraimbuddy marked this pull request as ready for review November 24, 2021 10:17
@ephraimbuddy ephraimbuddy force-pushed the fix-stuck-task branch 3 times, most recently from cb6ab1b to 39fc9c5 Compare November 30, 2021 08:22
@ephraimbuddy ephraimbuddy requested a review from ashb December 2, 2021 13:50
@kaxil kaxil added this to the Airflow 2.3.0 milestone Dec 8, 2021
@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jan 6, 2022
@github-actions
Copy link

github-actions bot commented Jan 6, 2022

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@potiuk
Copy link
Member

potiuk commented Jan 6, 2022

There are some sqlite tests failing here - likely related.

@kristoffern
Copy link

@ephraimbuddy Glad to report that the scheduler and system is still responsive +24h after the deployment.
I've ran our full pipeline yesterday and now today without any hiccups.

Looks like... all good... knock on wood...

@ephraimbuddy
Copy link
Contributor Author

Thanks @kristoffern. Please can you update with the recent changes in #21556

@ephraimbuddy
Copy link
Contributor Author

Thanks @kristoffern. Please can you update with the recent changes in #21556

Any update?

@kristoffern
Copy link

@ephraimbuddy Sorry, but I'm on limited amount of time currently for this and won't have any bandwidth to test it.
Again, my apologies.

@ephraimbuddy
Copy link
Contributor Author

@Chris7 can you help and test the 3 patches here?
1: Patch #19769
2: Patch #21335
3: Patch #21556

Otherwise, we may have to revert this change


if self.adopted_task_timeouts:
self._check_for_stalled_adopted_tasks()
if time.time() - self.stuck_tasks_last_check_time > self.stuck_queued_task_check_interval:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this keep getting called every sync cycle? Shouldn't stuck_tasks_last_check_time be updated to the current time here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call

@Chris7
Copy link
Contributor

Chris7 commented Mar 10, 2022

Is there an issue you can link with respect to why it needs a reversion? I do recall having some issue w/ tasks being submitted to celery during a deployment and getting left in queued limbo, but unsure what about the current code is problematic.

@ephraimbuddy
Copy link
Contributor Author

Since you are no longer having this issue, we should revert. I called your attention because of this issue: #21225 which is related to this

@ephraimbuddy
Copy link
Contributor Author

@pingzh, Since you also experience this, could you help in testing it? We have these patches
1: Patch #19769
2: Patch #21335
3: Patch #21556
and on the verge of reverting all if no one tests them

@Chris7
Copy link
Contributor

Chris7 commented Mar 16, 2022

Thanks for the work on this @ephraimbuddy. I am not able to test right now. Our previous usage pattern was farming out a ton of celery tasks that were monitoring work being done by a remote service (Batch). We've since moved these to use the deferred pattern to monitor these tasks (this was a great thing to gain in the most recent upgrade).

So while I think the scenario still exists from my bug report, the chance of us hitting it are drastically lower (and we haven't hit it again yet) as the majority of our tasks are now in deferred.

@pingzh
Copy link
Contributor

pingzh commented Mar 28, 2022

@ephraimbuddy since we switched to SQS as our message broker from redis (it only happened when under high load) 2 years, we haven't experienced this issue.

@ephraimbuddy ephraimbuddy added changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) and removed type:bug-fix Changelog: Bug Fixes labels Apr 8, 2022
@repl-chris
Copy link
Contributor

repl-chris commented Apr 29, 2022

@ephraimbuddy we can reproduce this reliably in an isolated test environment, and would love to help test this fix to get it merged up. We're running with a redis broker and pgsql results db, on k8s with KEDA auto scaling - it happens when under load during a scale-in event (shutting down a worker)

We've spent a bit of time tracking it down and (at least in our case) it looks to be a problem in celery (possibly celery/celery#7266). Airflow throws the task at celery and it just never executes, never makes it into the celery_taskmeta table. Airflow gets a celery task_id back and assumes its running, polling celery for updated status returns celery_states.PENDING indefinitely, as it has vanished and that's celery's default response for an unknown task. It seems like celery's warm shutdown on SIGTERM somehow loses tasks sometimes. We've run some tests with this patch and it seems to recover from this state nicely - I believe it's (almost exactly) what you had implemented in pull 21556.

I do have some concerns around self.tasks - in your implementation when setting the task back to SCHEDULED it is never cleared out of self.tasks, so if the re-scheduled task happens to be picked up by a different scheduler instance it will just stay in there forever (and in self.running come to think of it - they'd potentially accumulate over time and eat up all the open slots). I think it would be desirable to add a filter so each executor will only "re-schedule" tasks which it "owns" - this way the executor can clean up its own internal structures at the same time....and any tasks which are not owned by a running executor will get picked up by the adoption code and then "re-scheduled" by the new owner. I tried taking a stab at that change but it didn't quite work and I haven't figured out why yet....so maybe there's something going on I don't fully understand, I'm going to keep poking around at it though. Let me know if I should open a new PR, or if you'd like to resurrect your existing one...

@ephraimbuddy
Copy link
Contributor Author

Hi @repl-chris , thanks for the detailed description.
I think it'd be best if you work on this since you can reproduce it. That way we won't go back and forth on testing.
So please go ahead with a new PR.

kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Jul 10, 2022
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment)
Hopefully, this will fix it

GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Aug 27, 2022
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment)
Hopefully, this will fix it

GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Oct 4, 2022
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment)
Hopefully, this will fix it

GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
aglipska pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Oct 7, 2022
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment)
Hopefully, this will fix it

GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Dec 7, 2022
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment)
Hopefully, this will fix it

GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Jan 27, 2023
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment)
Hopefully, this will fix it

GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
kosteev pushed a commit to kosteev/composer-airflow-test-copybara that referenced this pull request Sep 12, 2024
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment)
Hopefully, this will fix it

GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Sep 17, 2024
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment)
Hopefully, this will fix it

GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Nov 7, 2024
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment)
Hopefully, this will fix it

GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request May 2, 2025
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment)
Hopefully, this will fix it

GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request May 23, 2025
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment)
Hopefully, this will fix it

GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Sep 17, 2025
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment)
Hopefully, this will fix it

GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Oct 16, 2025
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment)
Hopefully, this will fix it

GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Feb 21, 2026
On testing #19769, it was reported that there was a spike in CPU usage apache/airflow#19769 (comment)
Hopefully, this will fix it

GitOrigin-RevId: a49224fa7ce45e9765c0d752edc30430e0d3ce14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) full tests needed We need to run full set of tests for this PR to merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

task_instances stuck in "queued" and are missing corresponding celery_taskmeta entries