Skip to content

Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout#30375

Merged
ephraimbuddy merged 34 commits intoapache:mainfrom
RNHTTR:main
Apr 14, 2023
Merged

Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout#30375
ephraimbuddy merged 34 commits intoapache:mainfrom
RNHTTR:main

Conversation

@RNHTTR
Copy link
Contributor

@RNHTTR RNHTTR commented Mar 30, 2023

I accidentally closed #30108, so this is basically reopening that PR. Some updates:

  • There's not really a mechanism to deprecate multiple configs into one. The nature of this change requires some unique deprecation logic which is implemented in scheduler_job.py.
  • This deprecates celery.stalled_task_timeout, kubernetes.worker_pods_pending_timeout, and celery.task_adoption_timeout

closes: #28120
closes: #21225
closes: #28943

Tasks occasionally get stuck in queued and aren't resolved by stalled_task_timeout (#28120). This PR moves the logic for handling stalled tasks to the scheduler and simplifies the logic by marking any task that has been queued for more than scheduler.task_queued_timeout as failed, allowing it to be retried if the task has available retries.

This doesn't require an additional scheduler nor allow for the possibility of tasks to get stuck in an infinite loop of scheduled -> queued -> scheduled ... -> queued as exists in #28943.

@boring-cyborg boring-cyborg bot added provider:cncf-kubernetes Kubernetes (k8s) provider related issues area:Scheduler including HA (high availability) scheduler labels Mar 30, 2023
Copy link
Contributor

@ephraimbuddy ephraimbuddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just reviewed only the scheduler_job, will come back again

@RNHTTR RNHTTR force-pushed the main branch 2 times, most recently from d000e7a to 03aaf06 Compare April 4, 2023 21:08
@RNHTTR RNHTTR requested a review from potiuk as a code owner April 13, 2023 02:26
RNHTTR and others added 20 commits April 14, 2023 09:44
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
This more closely mirrors how deprecations are raised for "normal"
deprecations.

I've removed the depth, as moving up the stack doesn't really help the
user at all in this situation.
for ti in tis:
readable_tis.append(repr(ti))
task_instance_key = ti.key
self.fail(task_instance_key, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if the data model allows, but is it possible to add some error details to the taskinstance here so a user can understands why it failed, by just looking in the web ui?

To avoid admins getting asked a lot of questions from users like "why did this task fail without any logs", having to open up scheduler logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that there's not an easy way to surface these logs in the UI. Such "missing" task logs could also be caused by zombies, which can be caused by... tons of stuff.

I think a good intermediate step will be to add a blurb in the docs about missing task logs for zombies as well as tasks stuck in queued. I plan to open such a docs PR soon-ish.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will likely be possible in Airflow 2.8.0 with this PR: #32646

wookiist pushed a commit to wookiist/airflow that referenced this pull request Apr 19, 2023
…timeout (apache#30375)

* simplify and consolidate logic for tasks stuck in queued

* simplify and consolidate logic for tasks stuck in queued

* simplify and consolidate logic for tasks stuck in queued

* fixed tests; updated fail stuck tasks to use run_with_db_retries

* mypy; fixed tests

* fix task_adoption_timeout in celery integration test

* addressing comments

* remove useless print

* fix typo

* move failure logic to executor

* fix scheduler job test

* adjustments for new scheduler job

* appeasing static checks

* fix test for new scheduler job paradigm

* Updating docs for deprecations

* news & small changes

* news & small changes

* Update newsfragments/30375.significant.rst

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

* Update newsfragments/30375.significant.rst

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

* added cleanup stuck task functionality to base executor

* fix sloppy mistakes & mypy

* removing self.fail from base_executor

* Update airflow/jobs/scheduler_job_runner.py

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

* Update airflow/jobs/scheduler_job_runner.py

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

* Fix job_id filter

* Don't even run query if executor doesn't support timing out queued tasks

* Add support for LocalKubernetesExecutor and CeleryKubernetesExecutor

* Add config option to control how often it runs - we want it quicker than
the timeout

* Fixup newsfragment

* mark old KE pending pod check interval as deprecated by new check interval

* Fixup deprecation warnings

This more closely mirrors how deprecations are raised for "normal"
deprecations.

I've removed the depth, as moving up the stack doesn't really help the
user at all in this situation.

* Another deprecation cleanup

* Remove db retries

* Fix test

---------

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Jed Cunningham <jedcunningham@apache.org>
Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>
@ephraimbuddy ephraimbuddy added this to the Airflow 2.6.0 milestone Apr 23, 2023
@chenzuoli chenzuoli mentioned this pull request Jun 16, 2023
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tasks stuck in queued despite stalled_task_timeout Tasks stuck in queued state

4 participants