-
Notifications
You must be signed in to change notification settings - Fork 634
Description
Snakemake version
8.4.11
Describe the bug
I'm running a snakemake workflow on a SLURM cluster (using snakemake-executor-plugin-cluster-generic), using cluster nodes with 48 cores and a head node with 8 cores.
My profile file looks like this:
jobs: 600
cores: 48
local-cores: 4
My workflow has a rule which waits for all samples to complete, then executes an I/O intensive step for each sample. To minimize filesystem burden, I specify this rule as a localrule (due to filesystem sync issues between cluster nodes).
However, when Snakemake executes this step, it launches up to 600 jobs on the localnode instead of 4. I understand that, as --jobs is described, this makes sense ; however, it seems sensible that snakemake should consider the resources available on the local node when selecting how many local jobs to execute simultaneously.
Logs
Minimal example
Expand example Snakefile
#!/usr/bin/env snakemakelocalrules:
test_rulesamples=list(x for x in range(0,1000))
rule test_rule:
output:
txt="output/{sample}.txt"
shell:
"""echo SCIENCE > {output.txt}
&&
sleep 10
"""rule all:
input:
expand("output/{sample}.txt", sample=samples)
Expand Profile
executor: cluster-generic
cluster-generic-submit-cmd:
mkdir -p results/logs/cluster/{rule}/ &&
sbatch
--parsable
--cpus-per-task={threads}
--time={resources.time}
--mem={resources.mem_mb}
--job-name=smk-{rule}
--output=results/logs/cluster/{rule}/{jobid}.out
--error=results/logs/cluster/{rule}/{jobid}.err
--partition={resources.partition}
default-resources:
- time=1440
- partition='large'
- tmpdir='/tmp'
local-cores: 4
jobs: 600
cores: 48 # Match maximum node size.
latency-wait: 120
keep-going: True
rerun-incomplete: True
printshellcmds: True
scheduler: greedy
use-conda: True
conda-frontend: mamba
cluster-generic-cancel-cmd: scancel
Log of running the above Snakefile using cluster-generic
Building DAG of jobs... Using shell: /usr/bin/bash Provided remote nodes: 600 Job stats: job count --------- ------- all 1 test_rule 1000 total 1001Select jobs to execute...
Execute 600 jobs...[Mon May 27 10:56:12 2024]
localrule test_rule:
output: output/141.txt
jobid: 142
reason: Missing output files: output/141.txt
wildcards: sample=141
resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=/tmp, time=1440, partition=largeecho SCIENCE > output/141.txt &&
sleep 10
...more job messages...
Additional context