Skip to content

Set local task exit status when time limit is exceeded#6592

Merged
bentsherman merged 2 commits intomasterfrom
6549-fix-local-task-timeout-exit-status
Dec 3, 2025
Merged

Set local task exit status when time limit is exceeded#6592
bentsherman merged 2 commits intomasterfrom
6549-fix-local-task-timeout-exit-status

Conversation

@bentsherman
Copy link
Member

Close #6549

This PR fixes an issue where local tasks that exceed their time limit do not have an exit status, making it difficult to apply common error strategies such as retrying with more time based on the exit code.

The exit status is inferred from the sub-process.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman bentsherman requested a review from jorgee November 20, 2025 17:44
@netlify
Copy link

netlify bot commented Nov 20, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit a755bf3
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/6930ab0d582cc50008a8b65f

*/
if( elapsedTimeMillis() > wallTimeMillis ) {
destroy()
task.exitStatus = process.exitValue()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not make sense to use a predictable exit status (we event is known). Otherwise it can entirely depend on "destroy" implementation, and I'm not even sure that's going to a deterministic value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an interesting question. I think you could argue either way -- to make it consistent at Nextflow level, or make it consistent with the underlying OS, which the JVM implementation is likely to imitate

I figure that if a user has a pipeline with a mix of e.g. local and SLURM jobs, they would prefer to use the same error codes for both. That's why I lean towards deferring to the JVM

When I tested locally, it returned 143 which corresponds to SIGTERM. This is probably what most implementations do. Some might use 137 (SIGKILL) but that seems less likely since it's more aggressive

Copy link
Contributor

@jorgee jorgee Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current approach is correct and coherent with the other executors. The UNIXProcess implementation in most of JVM sends the signal TERM. Moreover, the .command.run has a on_term trap for TERM INT USR2 and another on_exit trap that unifies the kill of children processes with signal TERM, and the on_exit that manages the final exit code. So, if for some reason there is a JVM, Queue system or Cloud provider that uses signals INT or USR2 instead of TERM to cancel a task, we should get the same exit code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However guessing approach and I don't see what's preventing using a static code, so it's entirely predictable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a big deal either way. This is a very minor point. Let's see how the JVM behaves and if any interesting cases come about

*/
if( elapsedTimeMillis() > wallTimeMillis ) {
destroy()
task.exitStatus = process.exitValue()
Copy link
Contributor

@jorgee jorgee Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current approach is correct and coherent with the other executors. The UNIXProcess implementation in most of JVM sends the signal TERM. Moreover, the .command.run has a on_term trap for TERM INT USR2 and another on_exit trap that unifies the kill of children processes with signal TERM, and the on_exit that manages the final exit code. So, if for some reason there is a JVM, Queue system or Cloud provider that uses signals INT or USR2 instead of TERM to cancel a task, we should get the same exit code.

@bentsherman bentsherman merged commit d3f8e13 into master Dec 3, 2025
24 checks passed
@bentsherman bentsherman deleted the 6549-fix-local-task-timeout-exit-status branch December 3, 2025 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

task.exitStatus is unset on timeout with local executor, breaking dynamic errorStrategy

3 participants