Skip to content

Save .command.log for k8s tasks#2782

Closed
bentsherman wants to merge 13 commits intomasterfrom
issue-2364_k8s-command-log
Closed

Save .command.log for k8s tasks#2782
bentsherman wants to merge 13 commits intomasterfrom
issue-2364_k8s-command-log

Conversation

@bentsherman
Copy link
Member

Resolves #2364

This PR updates the K8s task handler to always save .command.log. I also removed some code that saved the log for failed pods in a more roundabout way.

However, although the build succeeds and the relevant tests pass, I haven't gotten it to work in my local environment. The task pod command is still:

    command:
    - '/bin/bash'
    - '-ue'
    - '.command.run'

when it should be:

    command:
    - '/bin/bash'
    - '-ue'
    - 'bash .command.run 2>&1 | tee .command.log'

Am I missing something?

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman bentsherman requested a review from pditommaso April 8, 2022 17:24
@pditommaso
Copy link
Member

Has this been tested?

@pditommaso
Copy link
Member

Do you think there's a real need for this log? or it's just for consistency with other executors?

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman
Copy link
Member Author

There are cases where things get printed to .command.log but not .command.out / .command.err. I ran into these cases during my PhD work. But it would also make the nextflow / k8s experience more consistent with other executors.

I think these changes are correct, but it wasn't working during the full test because kuberun wasn't using my development version of Nextflow on the k8s cluster. I assume the autoMountHostPaths allows you to bypass that on a local k8s cluster so I will try that.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman
Copy link
Member Author

Okay, I figured out how to test with k8s like you showed, but I hit some weird errors. So I backed up to master, tried again, and got this error:

 ./launch.sh -c k8s.config run hello
N E X T F L O W  ~  version 22.05.0-SNAPSHOT
Launching `https://github.com/nextflow-io/hello` [mighty_goldwasser] DSL2 - revision: 4eab81bd42 [master]
executor >  k8s (4)
executor >  k8s (4)
[77/d802dd] process > sayHello (1) [ 75%] 3 of 4, failed: 3
Error executing process > 'sayHello (3)'

Caused by:
  Process `sayHello (3)` terminated for an unknown reason -- Likely it has been terminated by the external system

Command executed:

  echo 'Hello world!'

Command exit status:
  -

Command output:
  (empty)

Command wrapper:
  /bin/bash: .command.run: No such file or directory

Work dir:
  /home/bent/workspace/nextflow-io_nextflow/work/eb/7880bd01e64a40002f55f7ade78c6b

However, I don't see anything wrong with the pod yaml:

apiVersion: v1
kind: Pod
metadata:
  name: nf-eb7880bd01e64a40002f55f7ade78c6b
  namespace: tower-nf
  labels: {app: nextflow, taskName: sayHello_3, sessionId: uuid-0b909bb8-de05-49e2-99b1-d4d07b3aad1c,
    processName: sayHello, runName: mighty_goldwasser}
spec:
  restartPolicy: Never
  containers:
  - name: nf-eb7880bd01e64a40002f55f7ade78c6b
    image: quay.io/nextflow/bash
    command: [/bin/bash, -ue, /home/bent/workspace/nextflow-io_nextflow/work/eb/7880bd01e64a40002f55f7ade78c6b/.command.run]
    resources:
      requests: {cpu: 1}
      limits: {cpu: 1}
    volumeMounts:
    - {name: vol-1, mountPath: /home/bent/workspace/nextflow-io_nextflow/work/eb/7880bd01e64a40002f55f7ade78c6b}
  volumes:
  - name: vol-1
    hostPath: {path: /home/bent/workspace/nextflow-io_nextflow/work/eb/7880bd01e64a40002f55f7ade78c6b}

It says it can't find .command.run but the work dir is mounted to the pod and .command.run is there.

@pditommaso
Copy link
Member

pditommaso commented May 27, 2022

It says .command.log: No such file or directory

/Users/pditommaso/Projects/nextflow/work/49/c3e72cd5686f8a328e553a0b482619/.command.run 2>&1 | tee /Users/pditommaso/Projects/nextflow/work/49/c3e72cd5686f8a328e553a0b482619/.command.log: No such file or directory

bit weird tho

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
@pditommaso
Copy link
Member

ok, found the reasons, the pipe concatenation should be executed as a subshell. But still not sure, not see a big value to add this log because this is running within the container.

it only helps when the .command.run crashes .. (that the reason why you want to add :))

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman
Copy link
Member Author

I believe the original motivation was that beforeScript and afterScript are only saved to .command.log. See the related issue for details.

Even if it's a small improvement, why not do it? The awsbatch executor does the same thing for a similar reason:

protected List<String> getSubmitCommand() {
// the cmd list to launch it
def opts = getAwsOptions()
def cli = opts.getAwsCli()
def debug = opts.debug ? ' --debug' : ''
def sse = opts.storageEncryption ? " --sse $opts.storageEncryption" : ''
def kms = opts.storageKmsKeyId ? " --sse-kms-key-id $opts.storageKmsKeyId" : ''
def aws = "$cli s3 cp --only-show-errors${sse}${kms}${debug}"
def cmd = "trap \"{ ret=\$?; $aws ${TaskRun.CMD_LOG} s3:/${getLogFile()}||true; exit \$ret; }\" EXIT; $aws s3:/${getWrapperFile()} - | bash 2>&1 | tee ${TaskRun.CMD_LOG}"
// final launcher command
return ['bash','-o','pipefail','-c', cmd.toString() ]
}

@bentsherman
Copy link
Member Author

Also, thanks for the fix. I thought it was already executing in a shell but I forgot about the -c option.

@pditommaso
Copy link
Member

Even if it's a small improvement, why not do it?

Well, mainly to prevent the creation of files, but I agree it can be useful to debug the launcher script and it's consistent with the existing pattern.

Can you please have a look to the failing tests and then I'll merge it. tx!

bentsherman and others added 2 commits May 27, 2022 13:32
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@pditommaso pditommaso added this to the 22.10.0 milestone May 27, 2022
@pditommaso pditommaso added this to the 23.04.0 milestone Oct 14, 2022
@pditommaso pditommaso force-pushed the master branch 2 times, most recently from 2c461f8 to 23432f3 Compare November 24, 2022 19:15
@pditommaso pditommaso force-pushed the master branch 2 times, most recently from e2b4a93 to f32ea0b Compare December 8, 2022 15:16
@pditommaso pditommaso force-pushed the master branch 2 times, most recently from cefb067 to e523afd Compare December 22, 2022 20:43
@pditommaso pditommaso force-pushed the master branch 2 times, most recently from 0d59b4c to b93634e Compare March 11, 2023 11:20
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman
Copy link
Member Author

@pditommaso This PR is old so I updated it and cleaned it up. It is ready to merge whenever you want.

@pditommaso pditommaso modified the milestones: 23.04.0, 23.10.0 Apr 9, 2023
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@pditommaso
Copy link
Member

I believe fetching the log over the stream it's better because it allows capturing the error output even when the .command.log cannot be accessed (think Fusion enable to mount the bucket for example).

Likely the best would be to implement both approaches (.command.log and fallback to the stream if missing), but I think it's overkill without a real need.

Closing without merging. We can reconsider it in the future.

@pditommaso pditommaso closed this Sep 3, 2023
@bentsherman
Copy link
Member Author

Fetching the log in this manner does not work when the pod is deleted immediately, for example:

WARN: Failed to copy log for job nf-b29b6c384d74ee3b41844239686e8a6c

@bentsherman
Copy link
Member Author

Interestingly this particular failure happens because the file already exists:

java.nio.file.FileAlreadyExistsException: /scidev-eu-west-2/scratch/4Yy5aUWbYwTXhy/b2/9b6c384d74ee3b41844239686e8a6c/.command.log

But I checked this file and it is empty. Does Fusion create this file automatically?

@pditommaso
Copy link
Member

The pod should not be deleted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Task .command.log not saved by k8s executor

2 participants