Fix K8s job fallback to not return hardcoded exit code#6746
Fix K8s job fallback to not return hardcoded exit code#6746bentsherman merged 3 commits intomasterfrom
Conversation
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
✅ Deploy Preview for nextflow-docs-staging ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Should this be backported to 25.10.x? |
|
Sorry, I have moved to draft, we still need to validate it fixes the issue by the reported. I tried to reproduce the issue but I couldn't |
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
|
Finally, I was able to reproduce the error with jorgee/nf-test -r pod_ttl-exitcode. I was trying to reproduce with the wrong version. The problem is the pipe in the pod execution command. It is returning the I have generated an image for testing with the PR changes( @BioWilko, are you able to test your pipeline with this PR and check if the issue is solved also in your case? I think the hardcoded 0 in the fallback mechanism could also create problems with #6484. So I have kept it in the PR. @pditommaso, no need to backport as it is not reproduced in the 25.10.x |
|
@jorgee I'm trying to get this running but the pipeline is an old one with some quirky syntax so it's failing in a lot of places, this new version is wayyyyyy more picky about syntax |
Ah, I think they must be because the main branch has V2 parser by default. You can set the V1 parser setting this env var |
Yep that sorted it, thanks! Everything seems to be working as it should now on my infra :) thanks a lot! |
|
@jorgee / @bentsherman - can this fix please be incorporated in a v25.10.x patch? The original changed behavior (returning 0 for failing tasks) required us to add 1-2 weeks of developer work to handle generally. In addition, when combined with other changes from v22 and v24, we had a customer experience a task failing completely silently upon upgrade, but without generating any of the .command.begin [...] .exitcode files. The cause was that the main container failed immediately (but entirely silently w/o errors and with 0 exit code) due to other changed behaviors related to using command vs args and ENTRYPOINT/CMD overrides. We would really appreciate being able to bump our v25.10 patch number, especially since we don't have plans for upgrading to a new major/minor version until after v26.10 comes out. |
|
@erasmussenilm I don't think it makes sense to backport this change since it builds on previous changes that were added in 25.11.0-edge It will be available in the next stable release, 26.04, which usually comes out in early May |
Summary
Fixes #6636 - K8s executor was incorrectly reporting exit code 0 for failed processes. This PR addresses two separate but related issues that combined to cause the problem.
Problem 1: Hardcoded Exit Code in Fallback Logic
When a Kubernetes job completes and its pod is garbage collected (via
ttlSecondsAfterFinished), thejobStateFallback0method was returning a hardcoded exit code of 0. This caused failed tasks to be incorrectly reported as successful when the pod was cleaned up before Nextflow could retrieve the actual exit code from the pod's container status.The issue was introduced by the interaction of:
k8sExitCode != null ? k8sExitCode : readExitFile()Root Cause
In
K8sClient.jobStateFallback0(), when the pod is gone but the job showssucceeded: 1, the code created a dummy status with:This hardcoded 0 prevented the fallback to reading the actual exit code from the
.exitcodefile.Solution
Remove the hardcoded
exitcodefield from the dummy status. WhenexitCodeis not present (null),K8sTaskHandlerwill properly fall back to reading the actual exit code from the.exitcodefile, which contains the correct value.Problem 2: Missing
pipefailin Pipeline CommandThe K8s executor was not properly capturing exit codes from failed tasks because bash pipelines return the exit code of the last command (tee), not the actual script.
Root Cause
The command was:
bash -ue -c "bash .command.run 2>&1 | tee .command.log"When
.command.runfails with exit code 3, the pipeline returns 0 becauseteesucceeds, and bash returns the exit code of the last command in the pipeline.Solution
Add the
pipefailoption to make bash return the exit code of the first failing command in the pipeline:bash -ue -o pipefail -c "bash .command.run 2>&1 | tee .command.log"This aligns with how other executors handle this (Azure Batch, Google Batch) and ensures proper error propagation.
Changes
exitcode: 0fromjobStateFallback0dummy status-o pipefailtoclassicSubmitCli()methodTest Plan
should fallback to job status when pod is gone and not return hardcoded exit code🤖 Generated with Claude Code