Fix K8s job fallback to not return hardcoded exit code by jorgee · Pull Request #6746 · nextflow-io/nextflow

jorgee · 2026-01-22T08:31:38Z

Summary

Fixes #6636 - K8s executor was incorrectly reporting exit code 0 for failed processes. This PR addresses two separate but related issues that combined to cause the problem.

Problem 1: Hardcoded Exit Code in Fallback Logic

When a Kubernetes job completes and its pod is garbage collected (via ttlSecondsAfterFinished), the jobStateFallback0 method was returning a hardcoded exit code of 0. This caused failed tasks to be incorrectly reported as successful when the pod was cleaned up before Nextflow could retrieve the actual exit code from the pod's container status.

The issue was introduced by the interaction of:

PR Optimize exit code handling by relying on scheduler status for successful executions #6484: Changed exit code handling to use k8sExitCode != null ? k8sExitCode : readExitFile()
PR Do not delete K8s jobs when ttlSecondsAfterFinished is set #6597: Added TTL-based cleanup, causing pods to be garbage collected while jobs are still tracked

Root Cause

In K8sClient.jobStateFallback0(), when the pod is gone but the job shows succeeded: 1, the code created a dummy status with:

terminated: [
    exitcode: 0,  // ← Hardcoded!
    reason: "Completed",
    ...
]

This hardcoded 0 prevented the fallback to reading the actual exit code from the .exitcode file.

Solution

Remove the hardcoded exitcode field from the dummy status. When exitCode is not present (null), K8sTaskHandler will properly fall back to reading the actual exit code from the .exitcode file, which contains the correct value.

Problem 2: Missing `pipefail` in Pipeline Command

The K8s executor was not properly capturing exit codes from failed tasks because bash pipelines return the exit code of the last command (tee), not the actual script.

Root Cause

The command was:

bash -ue -c "bash .command.run 2>&1 | tee .command.log"

When .command.run fails with exit code 3, the pipeline returns 0 because tee succeeds, and bash returns the exit code of the last command in the pipeline.

Solution

Add the pipefail option to make bash return the exit code of the first failing command in the pipeline:

bash -ue -o pipefail -c "bash .command.run 2>&1 | tee .command.log"

This aligns with how other executors handle this (Azure Batch, Google Batch) and ensures proper error propagation.

Changes

K8sClient.groovy: Remove hardcoded exitcode: 0 from jobStateFallback0 dummy status
K8sTaskHandler.groovy: Add -o pipefail to classicSubmitCli() method
K8sClientTest.groovy: Add test to verify exitCode is not present in fallback response
K8sTaskHandlerTest.groovy: Update all tests to expect the new command format with pipefail

Test Plan

Added unit test: should fallback to job status when pod is gone and not return hardcoded exit code
Updated existing tests to verify pipefail option is included
All K8s plugin tests pass

🤖 Generated with Claude Code

Signed-off-by: jorgee <jorge.ejarque@seqera.io>

netlify · 2026-01-22T08:31:45Z

✅ Deploy Preview for nextflow-docs-staging ready!

Name	Link
🔨 Latest commit	`adeee37`
🔍 Latest deploy log	https://app.netlify.com/projects/nextflow-docs-staging/deploys/697203527065260008172069
😎 Deploy Preview	https://deploy-preview-6746--nextflow-docs-staging.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

pditommaso · 2026-01-22T09:39:57Z

Should this be backported to 25.10.x?

jorgee · 2026-01-22T09:42:05Z

Sorry, I have moved to draft, we still need to validate it fixes the issue by the reported. I tried to reproduce the issue but I couldn't

Signed-off-by: jorgee <jorge.ejarque@seqera.io>

jorgee · 2026-01-22T10:46:40Z

Finally, I was able to reproduce the error with jorgee/nf-test -r pod_ttl-exitcode. I was trying to reproduce with the wrong version.

The problem is the pipe in the pod execution command. It is returning the tee exitcode instead of the .command.run exitcode.
I think the hardcoded exitcode: 0 in the fallback mechanism could also create problems with #6484. So I have kept it in the PR.

I have generated an image for testing with the PR changes( jorgeejarquea/nextflow:25.12.0-edge-ae3419ebe ) to test the PR changes and it works in my case. I use the following command

nextflow kuberun -head-image jorgeejarquea/nextflow:25.12.0-edge-ae3419ebe jorgee/nf-test -r pod_ttl-exitcode -latest -v nextflow-pvc:/mnt/data/launch

@BioWilko, are you able to test your pipeline with this PR and check if the issue is solved also in your case?

I think the hardcoded 0 in the fallback mechanism could also create problems with #6484. So I have kept it in the PR.

@pditommaso, no need to backport as it is not reproduced in the 25.10.x

BioWilko · 2026-01-22T12:30:09Z

@jorgee I'm trying to get this running but the pipeline is an old one with some quirky syntax so it's failing in a lot of places, this new version is wayyyyyy more picky about syntax

jorgee · 2026-01-22T12:36:35Z

@jorgee I'm trying to get this running but the pipeline is an old one with some quirky syntax so it's failing in a lot of places, this new version is wayyyyyy more picky about syntax

Ah, I think they must be because the main branch has V2 parser by default. You can set the V1 parser setting this env var NXF_SYNTAX_PARSER=v1

BioWilko · 2026-01-22T12:47:19Z

Ah, I think they must be because the main branch has V2 parser by default. You can set the V1 parser setting this env var NXF_SYNTAX_PARSER=v1

Yep that sorted it, thanks!

Everything seems to be working as it should now on my infra :) thanks a lot!

erasmussenilm · 2026-02-05T15:35:19Z

@jorgee / @bentsherman - can this fix please be incorporated in a v25.10.x patch? The original changed behavior (returning 0 for failing tasks) required us to add 1-2 weeks of developer work to handle generally. In addition, when combined with other changes from v22 and v24, we had a customer experience a task failing completely silently upon upgrade, but without generating any of the .command.begin [...] .exitcode files. The cause was that the main container failed immediately (but entirely silently w/o errors and with 0 exit code) due to other changed behaviors related to using command vs args and ENTRYPOINT/CMD overrides.

We would really appreciate being able to bump our v25.10 patch number, especially since we don't have plans for upgrading to a new major/minor version until after v26.10 comes out.

bentsherman · 2026-02-06T15:44:42Z

@erasmussenilm I don't think it makes sense to backport this change since it builds on previous changes that were added in 25.11.0-edge

It will be available in the next stable release, 26.04, which usually comes out in early May

Fix K8s job fallback to not return hardcoded exit code

7529982

Signed-off-by: jorgee <jorge.ejarque@seqera.io>

pditommaso approved these changes Jan 22, 2026

View reviewed changes

jorgee marked this pull request as draft January 22, 2026 09:41

Add pipefail in pod launch command

ae3419e

Signed-off-by: jorgee <jorge.ejarque@seqera.io>

jorgee requested a review from pditommaso January 22, 2026 11:00

jorgee marked this pull request as ready for review January 22, 2026 11:00

Merge branch 'master' into fix-k8s-job-fallback-exitcode

adeee37

jorgee requested a review from bentsherman January 22, 2026 13:11

bentsherman added the executor/k8s label Jan 23, 2026

bentsherman merged commit 5730679 into master Jan 23, 2026
41 of 42 checks passed

bentsherman deleted the fix-k8s-job-fallback-exitcode branch January 23, 2026 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix K8s job fallback to not return hardcoded exit code#6746

Fix K8s job fallback to not return hardcoded exit code#6746
bentsherman merged 3 commits intomasterfrom
fix-k8s-job-fallback-exitcode

jorgee commented Jan 22, 2026 •

edited

Loading

Uh oh!

netlify bot commented Jan 22, 2026 •

edited

Loading

Uh oh!

pditommaso commented Jan 22, 2026

Uh oh!

jorgee commented Jan 22, 2026 •

edited

Loading

Uh oh!

jorgee commented Jan 22, 2026

Uh oh!

BioWilko commented Jan 22, 2026

Uh oh!

jorgee commented Jan 22, 2026 •

edited

Loading

Uh oh!

BioWilko commented Jan 22, 2026

Uh oh!

Uh oh!

erasmussenilm commented Feb 5, 2026

Uh oh!

bentsherman commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

jorgee commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem 1: Hardcoded Exit Code in Fallback Logic

Root Cause

Solution

Problem 2: Missing pipefail in Pipeline Command

Root Cause

Solution

Changes

Test Plan

Uh oh!

netlify bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for nextflow-docs-staging ready!

Uh oh!

pditommaso commented Jan 22, 2026

Uh oh!

jorgee commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorgee commented Jan 22, 2026

Uh oh!

BioWilko commented Jan 22, 2026

Uh oh!

jorgee commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BioWilko commented Jan 22, 2026

Uh oh!

Uh oh!

erasmussenilm commented Feb 5, 2026

Uh oh!

bentsherman commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jorgee commented Jan 22, 2026 •

edited

Loading

Problem 2: Missing `pipefail` in Pipeline Command

netlify bot commented Jan 22, 2026 •

edited

Loading

jorgee commented Jan 22, 2026 •

edited

Loading

jorgee commented Jan 22, 2026 •

edited

Loading