Skip to content

Tasks stuck in queued state #28206

@benrifkind

Description

@benrifkind

Apache Airflow version

2.5.0

What happened

Tasks are getting stuck in the queued state

What you think should happen instead

Tasks should get scheduled and run

How to reproduce

I am using the CeleryExecutor and deploying Airflow on AWS's EKS.

I have 3 DAGs with 10 tasks. Each task is a simple KubernetesPodOperator which just exits when it starts. If I run the Airflow deploy with CELERY__WORKER_CONCURRENCY set to something high like 32, the celery worker will fail and the tasks that were queued up to run on it will enter into a bad state. Even once I set the concurrency lower (16), the tasks continue to not be scheduled. Note that if I set the worker concurrency to 16 on the initial deploy the tasks never get into a bad state and everything works fine.

Clearing the tasks does not even fix the issue. I get this log line in the scheduler

ERROR - could not queue task TaskInstanceKey(dag_id='batch_1', task_id='task_2', run_id='scheduled__2022-01-07T04:05:00+00:00', try_number=1, map_index=-1) (still running after 4 attempts)

To me it seems like the scheduler thinks the task is still running even though it is not.

Clearing the task and restarting the scheduler seems to do the trick.

Happy to give any more information that would be needed. Tasks getting stuck in queued also sometimes happens in my production environment which is the impetus for this investigation. I'm not sure if it is the same problem but I would like to figure out if this is a bug or just a misconfiguration on my end. Thanks for your help.

Operating System

Debian GNU/Linux

Versions of Apache Airflow Providers

apache-airflow-providers-celery==3.1.0
apache-airflow-providers-cncf-kubernetes==5.0.0

Deployment

Other 3rd-party Helm chart

Deployment details

I am using this helm chart to deploy - https://github.com/airflow-helm/charts/tree/main/charts/airflow (v8.6.1)

I know that chart is not supported by Apache Airflow but don't think it's related to the chart. Based on the logs and the solution it seems like an issue with Airflow/Celery.

Anything else

This problem can be replicated each time following the steps I detailed above. Not sure if the way the celery worker fails matters.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions