Skip to content

Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout#30108

Closed
RNHTTR wants to merge 0 commit intoapache:mainfrom
RNHTTR:main
Closed

Simplify logic to resolve tasks stuck in queued despite stalled_task_timeout#30108
RNHTTR wants to merge 0 commit intoapache:mainfrom
RNHTTR:main

Conversation

@RNHTTR
Copy link
Contributor

@RNHTTR RNHTTR commented Mar 14, 2023

closes: #28120
closes: #21225
closes: #28943

Tasks occasionally get stuck in queued and aren't resolved by stalled_task_timeout (#28120). This PR moves the logic for handling stalled tasks to the scheduler and simplifies the logic by marking any task that has been queued for more than scheduler.task_queued_timeout as failed, allowing it to be retried if the task has available retries.

This doesn't require an additional scheduler nor allow for the possibility of tasks to get stuck in an infinite loop of scheduled -> queued -> scheduled ... -> queued as exists in #28943.

@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Mar 14, 2023
@collinmcnulty
Copy link
Contributor

For backward compatibility and semver reasons, would making the tasks go to failed be considered a breaking change? I know that behavior was different than the comparable k8s executor behavior and seems a little askew from expectations so it could be considered a bug I guess? And this does cover that case plus more, so that's good.

@RNHTTR
Copy link
Contributor Author

RNHTTR commented Mar 14, 2023

Would appreciate your feedback @okayhooni & @repl-chris !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need both of these lines? Don't you only want the one wrapped in timers.call_regular_interval

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the precedent set by adopt_or_reset_orphaned_tasks (line 858), which runs on scheduler start up and then at a regular interval.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably doesn't need to run at startup. Adoption makes sense to do because it's pretty likely that another scheduler just shut down if we have a new one starting, but I don't we have a similar situation here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this deprecation reflected in the code base by a DeprecationWarning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to include DeprecationWarning

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should remove it if no longer needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would removing it unnecessarily break backward compatibility?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would affect all executors, not just celery and we have some other settings in kubernetes for pending tasks etc. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is a general problem that applies to both kubernetes & celery executors. The relevant k8s exec setting is worker_pods_queued_check_interval -- I think this can also be handled in the scheduler. I also think this can probably replace task-adoption-timeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you agree, i'll remove those configurations as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worker_pods_queued_check_interval is similar, but different in that it won't automatically just reset the TI. It first checks to see if the pod exists.

worker_pods_pending_timeout is essentially this same process though. It should probably be deprecated as well (though, not sure how config handles many -> one).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to do this manually, Airflow does it for you when you marked it as a deprecated option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worker_pods_queued_check_interval is similar, but different in that it won't automatically just reset the TI. It first checks to see if the pod exists.

worker_pods_pending_timeout is essentially this same process though. It should probably be deprecated as well (though, not sure how config handles many -> one).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably doesn't need to run at startup. Adoption makes sense to do because it's pretty likely that another scheduler just shut down if we have a new one starting, but I don't we have a similar situation here.

Comment on lines 1427 to 1428
Copy link
Member

@ashb ashb Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any handling of retry state in this code, we go straight to failed and bypass retry logic (unless I've forgotten how that works?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be sufficient to call TI.handle_failure()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like it should be based on my read of TI.handle_failure.

@RNHTTR RNHTTR requested a review from dstandish as a code owner March 24, 2023 21:21
@collinmcnulty
Copy link
Contributor

I'm unclear on why task adoption was removed. It covers a whole class of problem, running tasks whose scheduler died, that doesn't seem to be otherwise affected by this PR.

@RNHTTR
Copy link
Contributor Author

RNHTTR commented Mar 28, 2023

Accidentally closed this and nuked my changes. I'll open a new PR or re-open this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tasks stuck in queued despite stalled_task_timeout Tasks stuck in queued state

6 participants