Skip to content

A node_temp() modifier for temporary files on node-local storage? #1474

@DrYak

Description

@DrYak

Is your feature request related to a problem? Please describe.

According to official Snakemake documentations:

There is currently no way of doing this in Snakemake, but a possible workaround involves the shadow directive and setting the --shadow-prefix flag to e.g. /scratch.

Shadow is indeed one possible solution, but...

Describe the solution you'd like

I would like Snakemake to introduce some kind of node_temp() modifier.

If you squint at it, a lot of the necessary work already exists in Snakemake:

  • pipe() modifier already specifies special files that:
    • make mandatory to run both the producing rule and the consuming rule within the same group on the same node.
      • the difference is that pipe() requires both rule to be run in parallel (and considers sum of required resources).
      • whereas a putative node_temp() modifier would simply require both rules to run one after the other (and considers the max of required resources), just like any other group: rules' property.
    • expects the file to only be alive for the duration of the group and not outlive it/not exist outside of it.
  • temp()modifier already specifies files that:
    • will not be kept long term
    • will be cleaned up after all the rules finish.
    • the difference being (due to the scratch directory being node-local):
      • the main Snakemake process that dispatches to the cluster should not wait for the files to appear (if files aren't visible by the end of --latency-wait, do not consider a failure).
      • the responsibility of cleaning this up shouldn't be on the shoulders of the main Snakemake (the one that submits jobs to the cluster), but on those of the job's Snakemake (the instance that is started at the main point of the cluster job and which runs all the rules that are members of the group) as the later one has visibility to the node-local storage.
  • the resource tmpdir already takes care of passing/setting up a temp directory, so it should be possible to directly use it as a prefix for node_temp() (instead of requiring the use to add one manually like in the case of shared scratch)

Describe alternatives you've considered

  • shadow: does cover some possible use cases (uses within a single rule), but not all cases (e.g.: passing temporary files between rules).
  • temp() will not work, because node-local storage only exists during the execution of group, not outside.
    • thus as mentionned before, snakemake errors because it's waiting for a file that will never show up by the end of --latency-wait
  • pipe() has some interesting properties already (snakemake automatically groups rules together); but is specifically for pipes, not regular files; and mandates that the rule run in parallel which is not required for a regular file.

Additional context

An example use case is a snake rule file that we have in V-pipe for depleting human-mapping reads from viral sequencing raw data (e.g. before uploading them onto ENA).
This generate a few intermediates:

  • a SAM file after aligning to virus' reference, from which to extract the alignment rejects.
  • the .fastq.gz file holding the alignment rejects.
  • a SAM file after aligning to hosts (Homo Sapiens) reference, from which to extract the list of ID to deplete.
  • a copy of the original .fastq.gz files with the above mentioned read IDs filtered out.
  • another SAM file as part of the reference-based compression, temporary files as part of samtools' kmer sorting of non-reference compression, etc.

All these can get pretty big, and put some I/O stress on a the shared file system, but really aren't used outside of this specific chain of rules and could be stored on node-local temp space instead for more efficient I/O.

(At least, luckily, the intermediate .fastq files can be GZipped on the fly to reduce storage I/O).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions