-
Notifications
You must be signed in to change notification settings - Fork 634
Description
Snakemake version
8.16.0
Describe the bug
I am currently attempting to execute Snakemake with the --immediate-submit flag in order to submit all jobs to a SLURM cluster simultaneously, without waiting for the presence of input files for subsequent jobs. Previously, this flag worked seamlessly for me on the same SLURM cluster using Snakemake v5.31. However, after migrating to Snakemake v8.16, I noticed that the immediate submit functionality no longer functions as expected. Despite successfully adapting my Snakefile and configuration to be compatible with the changes in version 8, I can only run a simple workflow without the --immediate-submit flag. When immediate-submit is enabled, Snakemake appears to continually check for input files for the next step, even while a submitted job is still running. Consequently, job submission terminates prematurely with a MissingOutputException. I wonder whether this behaviour is a bug or if there’s a specific configuration in version 8 that I need to set? Any advice on this would be much appreciated. Thanks.
Logs
Execute snakemake without immediate-submit flag work fine
Command executed: snakemake -s snakefile --profile slurm
Using profile slurm for setting default command line arguments.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 999
Job stats:
job count
----- -------
all 1
step1 1
step2 1
step3 1
total 4
Select jobs to execute...
Execute 1 jobs...
[Wed Jul 31 10:37:52 2024]
rule step1:
input: input.txt
output: output1.txt
jobid: 3
reason: Missing output files: output1.txt
resources: mem_mb=2000, mem_mib=1908, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, cpus=1
cat input.txt > output1.txt
Submitted job 3 with external jobid '1089'.
[Wed Jul 31 10:38:02 2024]
Finished job 3.
1 of 4 steps (25%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Jul 31 10:38:02 2024]
rule step2:
input: output1.txt
output: output2.txt
jobid: 2
reason: Missing output files: output2.txt; Input files updated by another job: output1.txt
resources: mem_mb=2000, mem_mib=1908, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, cpus=1
sleep 20; head -n2 output1.txt > output2.txt
Submitted job 2 with external jobid '1090'.
[Wed Jul 31 10:38:32 2024]
Finished job 2.
2 of 4 steps (50%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Jul 31 10:38:32 2024]
rule step3:
input: output2.txt
output: output3.txt
jobid: 1
reason: Missing output files: output3.txt; Input files updated by another job: output2.txt
resources: mem_mb=2000, mem_mib=1908, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, cpus=1
head -n1 output2.txt > output3.txt
Submitted job 1 with external jobid '1091'.
[Wed Jul 31 10:38:42 2024]
Finished job 1.
3 of 4 steps (75%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Jul 31 10:38:42 2024]
localrule all:
input: output3.txt
jobid: 0
reason: Input files updated by another job: output3.txt
resources: mem_mb=2000, mem_mib=1908, disk_mb=1000, disk_mib=954, tmpdir=/tmp, cpus=1
[Wed Jul 31 10:38:42 2024]
Finished job 0.
4 of 4 steps (100%) done
Complete log: .snakemake/log/2024-07-31T103751.980121.snakemake.log`
Execute snakemake with immediate-submit flag failed to submit all jobs and exit prematurely with the following error:
Command executed: snakemake -s snakefile --profile slurm --immediate-submit --notemp
Using profile slurm for setting default command line arguments.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided remote nodes: 999
Job stats:
job count
----- -------
all 1
step1 1
step2 1
step3 1
total 4
Select jobs to execute...
Execute 1 jobs...
[Wed Jul 31 10:47:31 2024]
rule step1:
input: input.txt
output: output1.txt
jobid: 3
reason: Missing output files: output1.txt
resources: mem_mb=2000, mem_mib=1908, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, cpus=1
cat input.txt > output1.txt
Submitted job 3 with external jobid '1092'.
Waiting at most 5 seconds for missing files.
[Wed Jul 31 10:47:33 2024]
Finished job 3.
1 of 4 steps (25%) done
Select jobs to execute...
Execute 1 jobs...
[Wed Jul 31 10:47:33 2024]
rule step2:
input: output1.txt
output: output2.txt
jobid: 2
reason: Missing output files: output2.txt; Input files updated by another job: output1.txt
resources: mem_mb=2000, mem_mib=1908, disk_mb=1000, disk_mib=954, tmpdir=<TBD>, cpus=1
sleep 20; head -n2 output1.txt > output2.txt
Submitted job 2 with external jobid '1093'.
Waiting at most 5 seconds for missing files.
MissingOutputException in rule step2 in file /mnt/scratch2/users/3056021/sm_centos8/immSub/snakefile, line 11:
Job 2 completed successfully, but some output files are missing. Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
output2.txt (missing locally, parent dir not present)
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-07-31T104731.768237.snakemake.log
WorkflowError:
At least one job did not complete successfully.
Minimal example
Content of the snakefile:
rule all:
input:
'output3.txt'
rule step1:
input:
'input.txt'
output:
'output1.txt'
shell:
'cat {input} > {output}'
rule step2:
input:
'output1.txt'
output:
'output2.txt'
shell:
'sleep 20; head -n2 {input} > {output}'
rule step3:
input:
'output2.txt'
output:
'output3.txt'
shell:
'head -n1 {input} > {output}'
Additional context
config.yaml file:
executor: cluster-generic
jobs: 999
default-resources: [cpus=1, mem_mb=2000]
cluster-generic-submit-cmd: "./slurm/sbatch.py {resources.cpus} {resources.mem_mb} {rule} {dependencies}"
max-status-checks-per-second: 10
rerun-incomplete: True
scheduler: greedy
keep-going: True
printshellcmds: True
show-failed-logs: True
sbatch.py
#!/usr/bin/env python3
import os
import sys
import subprocess
from snakemake.utils import read_job_properties
# last command-line argument is the job script -- required to submit jobs to slurm
jobscript = sys.argv[-1]
cpu = sys.argv[1]
mem = sys.argv[2]
rulename = sys.argv[3]
dependencies = set(sys.argv[4:-1])
cmdline = ["sbatch --chdir ./log_hpc --output=%j.out --error=%j.err --nodes=1 --parsable"]
cmdline.append("--job-name="+rulename)
cmdline.append("--ntasks="+cpu)
cmdline.append("--mem="+str(mem))
cmdline.append("--partition=k2-hipri --time=2:59:00")
if dependencies:
cmdline.append("--dependency")
cmdline.append( "afterok:" + ",".join(dependencies))
cmdline.append(jobscript)
# Constructs and submits
cmdline = " ".join(cmdline)
print (cmdline)
os.system(cmdline)