Skip to content

Fix K8s job fallback to not return hardcoded exit code#6746

Merged
bentsherman merged 3 commits intomasterfrom
fix-k8s-job-fallback-exitcode
Jan 23, 2026
Merged

Fix K8s job fallback to not return hardcoded exit code#6746
bentsherman merged 3 commits intomasterfrom
fix-k8s-job-fallback-exitcode

Conversation

@jorgee
Copy link
Contributor

@jorgee jorgee commented Jan 22, 2026

Summary

Fixes #6636 - K8s executor was incorrectly reporting exit code 0 for failed processes. This PR addresses two separate but related issues that combined to cause the problem.

Problem 1: Hardcoded Exit Code in Fallback Logic

When a Kubernetes job completes and its pod is garbage collected (via ttlSecondsAfterFinished), the jobStateFallback0 method was returning a hardcoded exit code of 0. This caused failed tasks to be incorrectly reported as successful when the pod was cleaned up before Nextflow could retrieve the actual exit code from the pod's container status.

The issue was introduced by the interaction of:

Root Cause

In K8sClient.jobStateFallback0(), when the pod is gone but the job shows succeeded: 1, the code created a dummy status with:

terminated: [
    exitcode: 0,  // ← Hardcoded!
    reason: "Completed",
    ...
]

This hardcoded 0 prevented the fallback to reading the actual exit code from the .exitcode file.

Solution

Remove the hardcoded exitcode field from the dummy status. When exitCode is not present (null), K8sTaskHandler will properly fall back to reading the actual exit code from the .exitcode file, which contains the correct value.

Problem 2: Missing pipefail in Pipeline Command

The K8s executor was not properly capturing exit codes from failed tasks because bash pipelines return the exit code of the last command (tee), not the actual script.

Root Cause

The command was:

bash -ue -c "bash .command.run 2>&1 | tee .command.log"

When .command.run fails with exit code 3, the pipeline returns 0 because tee succeeds, and bash returns the exit code of the last command in the pipeline.

Solution

Add the pipefail option to make bash return the exit code of the first failing command in the pipeline:

bash -ue -o pipefail -c "bash .command.run 2>&1 | tee .command.log"

This aligns with how other executors handle this (Azure Batch, Google Batch) and ensures proper error propagation.

Changes

  1. K8sClient.groovy: Remove hardcoded exitcode: 0 from jobStateFallback0 dummy status
  2. K8sTaskHandler.groovy: Add -o pipefail to classicSubmitCli() method
  3. K8sClientTest.groovy: Add test to verify exitCode is not present in fallback response
  4. K8sTaskHandlerTest.groovy: Update all tests to expect the new command format with pipefail

Test Plan

  • Added unit test: should fallback to job status when pod is gone and not return hardcoded exit code
  • Updated existing tests to verify pipefail option is included
  • All K8s plugin tests pass

🤖 Generated with Claude Code

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@netlify
Copy link

netlify bot commented Jan 22, 2026

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit adeee37
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/697203527065260008172069
😎 Deploy Preview https://deploy-preview-6746--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@pditommaso
Copy link
Member

Should this be backported to 25.10.x?

@jorgee jorgee marked this pull request as draft January 22, 2026 09:41
@jorgee
Copy link
Contributor Author

jorgee commented Jan 22, 2026

Sorry, I have moved to draft, we still need to validate it fixes the issue by the reported. I tried to reproduce the issue but I couldn't

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee
Copy link
Contributor Author

jorgee commented Jan 22, 2026

Finally, I was able to reproduce the error with jorgee/nf-test -r pod_ttl-exitcode. I was trying to reproduce with the wrong version.

The problem is the pipe in the pod execution command. It is returning the tee exitcode instead of the .command.run exitcode.
I think the hardcoded exitcode: 0 in the fallback mechanism could also create problems with #6484. So I have kept it in the PR.

I have generated an image for testing with the PR changes( jorgeejarquea/nextflow:25.12.0-edge-ae3419ebe ) to test the PR changes and it works in my case. I use the following command

nextflow kuberun -head-image jorgeejarquea/nextflow:25.12.0-edge-ae3419ebe jorgee/nf-test -r pod_ttl-exitcode -latest -v nextflow-pvc:/mnt/data/launch

@BioWilko, are you able to test your pipeline with this PR and check if the issue is solved also in your case?

I think the hardcoded 0 in the fallback mechanism could also create problems with #6484. So I have kept it in the PR.

@pditommaso, no need to backport as it is not reproduced in the 25.10.x

@jorgee jorgee requested a review from pditommaso January 22, 2026 11:00
@jorgee jorgee marked this pull request as ready for review January 22, 2026 11:00
@BioWilko
Copy link
Contributor

@jorgee I'm trying to get this running but the pipeline is an old one with some quirky syntax so it's failing in a lot of places, this new version is wayyyyyy more picky about syntax

@jorgee
Copy link
Contributor Author

jorgee commented Jan 22, 2026

@jorgee I'm trying to get this running but the pipeline is an old one with some quirky syntax so it's failing in a lot of places, this new version is wayyyyyy more picky about syntax

Ah, I think they must be because the main branch has V2 parser by default. You can set the V1 parser setting this env var NXF_SYNTAX_PARSER=v1

@BioWilko
Copy link
Contributor

Ah, I think they must be because the main branch has V2 parser by default. You can set the V1 parser setting this env var NXF_SYNTAX_PARSER=v1

Yep that sorted it, thanks!

Everything seems to be working as it should now on my infra :) thanks a lot!

@jorgee jorgee requested a review from bentsherman January 22, 2026 13:11
@bentsherman bentsherman merged commit 5730679 into master Jan 23, 2026
41 of 42 checks passed
@bentsherman bentsherman deleted the fix-k8s-job-fallback-exitcode branch January 23, 2026 16:00
@erasmussenilm
Copy link

@jorgee / @bentsherman - can this fix please be incorporated in a v25.10.x patch? The original changed behavior (returning 0 for failing tasks) required us to add 1-2 weeks of developer work to handle generally. In addition, when combined with other changes from v22 and v24, we had a customer experience a task failing completely silently upon upgrade, but without generating any of the .command.begin [...] .exitcode files. The cause was that the main container failed immediately (but entirely silently w/o errors and with 0 exit code) due to other changed behaviors related to using command vs args and ENTRYPOINT/CMD overrides.

We would really appreciate being able to bump our v25.10 patch number, especially since we don't have plans for upgrading to a new major/minor version until after v26.10 comes out.

@bentsherman
Copy link
Member

@erasmussenilm I don't think it makes sense to backport this change since it builds on previous changes that were added in 25.11.0-edge

It will be available in the next stable release, 26.04, which usually comes out in early May

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Exitcode being ignored in k8s in some cases

5 participants