Skip to content

pulp filling up /tmp to 100% for many jobs - with workaround? #1003

@philippbayer

Description

@philippbayer

Snakemake v6.3

Describe the bug
When submitting many jobs it is possible that PuLP's solver will fill up /tmp with temporary files to 100%, which in HPC environments results in angry emails.

Longer story: We have a student who was submitting a snakefile which runs blastx for 160k input files and then concatenates the output. The student was running into mysterious 'No space left on device' errors during the 'Select jobs to execute...' step. After a bit of digging, simultaneously with a slightly upset HPC-maintainer email, we realised that pulp's mps_lp.py was writing 10s of GB into the login node's /tmp which therefore had filled up. It's this line causing the error: https://github.com/coin-or/pulp/blob/df9d41dd07d7fd65851db7e4cf13a75f540c982c/pulp/mps_lp.py#L245

Pulp is normally good at cleaning up its temporary files, but the student was hitting CTRL-C too often for the cleanup to finish properly, so /tmp contained a few 10GB files from previous runs.

I don't even know if this counts as a bug, but at least others will find our workaround via google now.

Workaround:
At the top of this snakefile we now have

os.environ['TMPDIR'] = os.environ['MYSCRATCH']

because pulp checks the TMPDIR environment variable and will use /tmp if not found. https://github.com/coin-or/pulp/blob/a293fff94f9da90ced4f34cfa51bf90cdd7cd81d/pulp/apis/core.py#L446
That will put the potentially large temporary files into our /scratch via our own environment variable which is slower, but avoids angry emails.

(Edit: I guess you could also change the --scheduler to greedy?)

Logs

Job counts:
     count   jobs
     1     all
     164574  blastx
      1  merge_blasts
     164576
Select jobs to execute...
Traceback (most recent call last):
   File "longpath_here_redacted/pulp/mps_lp.py", line 230, in writeMPS
     f.write(''.join(columns_lines))
OSError: [Errno 28] No space left on device

During handling of the above exception, another exception occurred:

OSError: [Errno 28] No space left on device

[some more stuff here which repeats the whole traceback, I'm typing this from the student's screenshot, going back to]

/snakemake/scheduler.py", line 766, in _solve_ilp
prob.solve(solver)

Minimal example
It's more about the number of input files than the snakefile itself, here you go:

# check if logfile exists or make new if it doesn't
import os

if not os.path.exists("slurm_logs"):
    os.mkdir("slurm_logs")

# define samples used for the whole process
(IDS,) = glob_wildcards('contigs.fa.0.split/{id}.0')

# this rule collects results and is the final step of the code
# once the input of rule all is reached, snakemake knows that it's finished
rule all:
    input:
        'merged_results/all_contig_blasts.txt'

rule blastx_slices:
    input:
        "contigs.fa.0.split/{id}.0",
    output:
        "slice_blasts/{id}.txt"
    log:
        "logs/{id}.blastx.log",
    benchmark:
        "benchmarks/{id}.blastx.benchmark"
    resources:
        cpus=24,
        time="01:00:00",
        cluster="magnus",
        mem="58G",
    shell:
        """blastn -db nt -outfmt '6 qseqid sseqid qlen slen length evalue bitscore staxid ssciname sskingdom' -max_target_seqs 1 -max_hsps 1 -query {input} -num_threads 24 > {output} 2> {log}"""

rule merge_blasts:
    input:
        expand("slice_blasts/{id}.txt", id=IDS),
    output:
        'merged_results/all_contig_blasts.txt'
    log:
        "logs/merge_blasts.log",
    benchmark:
        "benchmarks/merge_blasts.benchmark"
    resources:
        cpus=1,
        time="24:00:00",
        cluster="magnus",
        mem="58G",
    shell:
        """ cat {input} > {output}"""

Make >100k input files and you should also have >10GB files in /tmp

Additional context
Thank you for your work on this fantastic project!

The student installed snakemake via mamba in a clean environment, their pulp version is 2.4, python is 3.9, snakemake is 6.3.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions