Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout by RNHTTR · Pull Request #30375 · apache/airflow

RNHTTR · 2023-03-30T17:37:43Z

I accidentally closed #30108, so this is basically reopening that PR. Some updates:

There's not really a mechanism to deprecate multiple configs into one. The nature of this change requires some unique deprecation logic which is implemented in scheduler_job.py.
This deprecates celery.stalled_task_timeout, kubernetes.worker_pods_pending_timeout, and celery.task_adoption_timeout

closes: #28120
closes: #21225
closes: #28943

Tasks occasionally get stuck in queued and aren't resolved by stalled_task_timeout (#28120). This PR moves the logic for handling stalled tasks to the scheduler and simplifies the logic by marking any task that has been queued for more than scheduler.task_queued_timeout as failed, allowing it to be retried if the task has available retries.

This doesn't require an additional scheduler nor allow for the possibility of tasks to get stuck in an infinite loop of scheduled -> queued -> scheduled ... -> queued as exists in #28943.

ephraimbuddy

Just reviewed only the scheduler_job, will come back again

airflow/jobs/scheduler_job.py

tests/jobs/test_scheduler_job.py

airflow/executors/celery_executor.py

airflow/jobs/scheduler_job.py

airflow/executors/celery_executor.py

airflow/configuration.py

airflow/executors/kubernetes_executor.py

airflow/jobs/scheduler_job_runner.py

newsfragments/30375.significant.rst

airflow/executors/base_executor.py

airflow/executors/kubernetes_executor.py

airflow/jobs/scheduler_job_runner.py

stale

tests/jobs/test_scheduler_job.py

Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>

the timeout

…erval

This more closely mirrors how deprecations are raised for "normal" deprecations. I've removed the depth, as moving up the stack doesn't really help the user at all in this situation.

hterik · 2023-04-14T16:06:03Z

airflow/executors/celery_executor.py

+        for ti in tis:
+            readable_tis.append(repr(ti))
+            task_instance_key = ti.key
+            self.fail(task_instance_key, None)


Not sure if the data model allows, but is it possible to add some error details to the taskinstance here so a user can understands why it failed, by just looking in the web ui?

To avoid admins getting asked a lot of questions from users like "why did this task fail without any logs", having to open up scheduler logs.

My understanding is that there's not an easy way to surface these logs in the UI. Such "missing" task logs could also be caused by zombies, which can be caused by... tons of stuff.

I think a good intermediate step will be to add a blurb in the docs about missing task logs for zombies as well as tasks stuck in queued. I plan to open such a docs PR soon-ish.

This will likely be possible in Airflow 2.8.0 with this PR: #32646

…timeout (apache#30375) * simplify and consolidate logic for tasks stuck in queued * simplify and consolidate logic for tasks stuck in queued * simplify and consolidate logic for tasks stuck in queued * fixed tests; updated fail stuck tasks to use run_with_db_retries * mypy; fixed tests * fix task_adoption_timeout in celery integration test * addressing comments * remove useless print * fix typo * move failure logic to executor * fix scheduler job test * adjustments for new scheduler job * appeasing static checks * fix test for new scheduler job paradigm * Updating docs for deprecations * news & small changes * news & small changes * Update newsfragments/30375.significant.rst Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * Update newsfragments/30375.significant.rst Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * added cleanup stuck task functionality to base executor * fix sloppy mistakes & mypy * removing self.fail from base_executor * Update airflow/jobs/scheduler_job_runner.py Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * Update airflow/jobs/scheduler_job_runner.py Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * Fix job_id filter * Don't even run query if executor doesn't support timing out queued tasks * Add support for LocalKubernetesExecutor and CeleryKubernetesExecutor * Add config option to control how often it runs - we want it quicker than the timeout * Fixup newsfragment * mark old KE pending pod check interval as deprecated by new check interval * Fixup deprecation warnings This more closely mirrors how deprecations are raised for "normal" deprecations. I've removed the depth, as moving up the stack doesn't really help the user at all in this situation. * Another deprecation cleanup * Remove db retries * Fix test --------- Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> Co-authored-by: Jed Cunningham <jedcunningham@apache.org> Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>

RNHTTR requested review from XD-DENG, ashb, dstandish, jedcunningham, kaxil and o-nikolas as code owners March 30, 2023 17:37

boring-cyborg bot added provider:cncf-kubernetes Kubernetes (k8s) provider related issues area:Scheduler including HA (high availability) scheduler labels Mar 30, 2023

ephraimbuddy reviewed Apr 4, 2023

View reviewed changes

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

RNHTTR force-pushed the main branch 2 times, most recently from d000e7a to 03aaf06 Compare April 4, 2023 21:08

jedcunningham previously requested changes Apr 6, 2023

View reviewed changes

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

tests/jobs/test_scheduler_job.py Outdated Show resolved Hide resolved

ephraimbuddy reviewed Apr 8, 2023

View reviewed changes

airflow/executors/celery_executor.py Outdated Show resolved Hide resolved

airflow/jobs/scheduler_job.py Outdated Show resolved Hide resolved

RNHTTR force-pushed the main branch from 8429ac8 to e9f0c2a Compare April 12, 2023 16:52

ephraimbuddy reviewed Apr 12, 2023

View reviewed changes

airflow/executors/celery_executor.py Outdated Show resolved Hide resolved

RNHTTR requested a review from potiuk as a code owner April 13, 2023 02:26