-
Notifications
You must be signed in to change notification settings - Fork 634
Description
Snakemake v6.3
Describe the bug
When submitting many jobs it is possible that PuLP's solver will fill up /tmp with temporary files to 100%, which in HPC environments results in angry emails.
Longer story: We have a student who was submitting a snakefile which runs blastx for 160k input files and then concatenates the output. The student was running into mysterious 'No space left on device' errors during the 'Select jobs to execute...' step. After a bit of digging, simultaneously with a slightly upset HPC-maintainer email, we realised that pulp's mps_lp.py was writing 10s of GB into the login node's /tmp which therefore had filled up. It's this line causing the error: https://github.com/coin-or/pulp/blob/df9d41dd07d7fd65851db7e4cf13a75f540c982c/pulp/mps_lp.py#L245
Pulp is normally good at cleaning up its temporary files, but the student was hitting CTRL-C too often for the cleanup to finish properly, so /tmp contained a few 10GB files from previous runs.
I don't even know if this counts as a bug, but at least others will find our workaround via google now.
Workaround:
At the top of this snakefile we now have
os.environ['TMPDIR'] = os.environ['MYSCRATCH']
because pulp checks the TMPDIR environment variable and will use /tmp if not found. https://github.com/coin-or/pulp/blob/a293fff94f9da90ced4f34cfa51bf90cdd7cd81d/pulp/apis/core.py#L446
That will put the potentially large temporary files into our /scratch via our own environment variable which is slower, but avoids angry emails.
(Edit: I guess you could also change the --scheduler to greedy?)
Logs
Job counts:
count jobs
1 all
164574 blastx
1 merge_blasts
164576
Select jobs to execute...
Traceback (most recent call last):
File "longpath_here_redacted/pulp/mps_lp.py", line 230, in writeMPS
f.write(''.join(columns_lines))
OSError: [Errno 28] No space left on device
During handling of the above exception, another exception occurred:
OSError: [Errno 28] No space left on device
[some more stuff here which repeats the whole traceback, I'm typing this from the student's screenshot, going back to]
/snakemake/scheduler.py", line 766, in _solve_ilp
prob.solve(solver)
Minimal example
It's more about the number of input files than the snakefile itself, here you go:
# check if logfile exists or make new if it doesn't
import os
if not os.path.exists("slurm_logs"):
os.mkdir("slurm_logs")
# define samples used for the whole process
(IDS,) = glob_wildcards('contigs.fa.0.split/{id}.0')
# this rule collects results and is the final step of the code
# once the input of rule all is reached, snakemake knows that it's finished
rule all:
input:
'merged_results/all_contig_blasts.txt'
rule blastx_slices:
input:
"contigs.fa.0.split/{id}.0",
output:
"slice_blasts/{id}.txt"
log:
"logs/{id}.blastx.log",
benchmark:
"benchmarks/{id}.blastx.benchmark"
resources:
cpus=24,
time="01:00:00",
cluster="magnus",
mem="58G",
shell:
"""blastn -db nt -outfmt '6 qseqid sseqid qlen slen length evalue bitscore staxid ssciname sskingdom' -max_target_seqs 1 -max_hsps 1 -query {input} -num_threads 24 > {output} 2> {log}"""
rule merge_blasts:
input:
expand("slice_blasts/{id}.txt", id=IDS),
output:
'merged_results/all_contig_blasts.txt'
log:
"logs/merge_blasts.log",
benchmark:
"benchmarks/merge_blasts.benchmark"
resources:
cpus=1,
time="24:00:00",
cluster="magnus",
mem="58G",
shell:
""" cat {input} > {output}"""
Make >100k input files and you should also have >10GB files in /tmp
Additional context
Thank you for your work on this fantastic project!
The student installed snakemake via mamba in a clean environment, their pulp version is 2.4, python is 3.9, snakemake is 6.3.0.