Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout#30375
Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout#30375ephraimbuddy merged 34 commits intoapache:mainfrom
Conversation
ephraimbuddy
left a comment
There was a problem hiding this comment.
Just reviewed only the scheduler_job, will come back again
d000e7a to
03aaf06
Compare
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com>
This more closely mirrors how deprecations are raised for "normal" deprecations. I've removed the depth, as moving up the stack doesn't really help the user at all in this situation.
| for ti in tis: | ||
| readable_tis.append(repr(ti)) | ||
| task_instance_key = ti.key | ||
| self.fail(task_instance_key, None) |
There was a problem hiding this comment.
Not sure if the data model allows, but is it possible to add some error details to the taskinstance here so a user can understands why it failed, by just looking in the web ui?
To avoid admins getting asked a lot of questions from users like "why did this task fail without any logs", having to open up scheduler logs.
There was a problem hiding this comment.
My understanding is that there's not an easy way to surface these logs in the UI. Such "missing" task logs could also be caused by zombies, which can be caused by... tons of stuff.
I think a good intermediate step will be to add a blurb in the docs about missing task logs for zombies as well as tasks stuck in queued. I plan to open such a docs PR soon-ish.
There was a problem hiding this comment.
This will likely be possible in Airflow 2.8.0 with this PR: #32646
…timeout (apache#30375) * simplify and consolidate logic for tasks stuck in queued * simplify and consolidate logic for tasks stuck in queued * simplify and consolidate logic for tasks stuck in queued * fixed tests; updated fail stuck tasks to use run_with_db_retries * mypy; fixed tests * fix task_adoption_timeout in celery integration test * addressing comments * remove useless print * fix typo * move failure logic to executor * fix scheduler job test * adjustments for new scheduler job * appeasing static checks * fix test for new scheduler job paradigm * Updating docs for deprecations * news & small changes * news & small changes * Update newsfragments/30375.significant.rst Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * Update newsfragments/30375.significant.rst Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * added cleanup stuck task functionality to base executor * fix sloppy mistakes & mypy * removing self.fail from base_executor * Update airflow/jobs/scheduler_job_runner.py Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * Update airflow/jobs/scheduler_job_runner.py Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> * Fix job_id filter * Don't even run query if executor doesn't support timing out queued tasks * Add support for LocalKubernetesExecutor and CeleryKubernetesExecutor * Add config option to control how often it runs - we want it quicker than the timeout * Fixup newsfragment * mark old KE pending pod check interval as deprecated by new check interval * Fixup deprecation warnings This more closely mirrors how deprecations are raised for "normal" deprecations. I've removed the depth, as moving up the stack doesn't really help the user at all in this situation. * Another deprecation cleanup * Remove db retries * Fix test --------- Co-authored-by: Jed Cunningham <66968678+jedcunningham@users.noreply.github.com> Co-authored-by: Jed Cunningham <jedcunningham@apache.org> Co-authored-by: Ephraim Anierobi <splendidzigy24@gmail.com>
I accidentally closed #30108, so this is basically reopening that PR. Some updates:
scheduler_job.py.celery.stalled_task_timeout,kubernetes.worker_pods_pending_timeout, andcelery.task_adoption_timeoutcloses: #28120
closes: #21225
closes: #28943
Tasks occasionally get stuck in queued and aren't resolved by
stalled_task_timeout(#28120). This PR moves the logic for handling stalled tasks to the scheduler and simplifies the logic by marking any task that has been queued for more thanscheduler.task_queued_timeoutas failed, allowing it to be retried if the task has available retries.This doesn't require an additional scheduler nor allow for the possibility of tasks to get stuck in an infinite loop of scheduled -> queued -> scheduled ... -> queued as exists in #28943.