A `node_temp()` modifier for temporary files on node-local storage?

**Is your feature request related to a problem? Please describe.**

According to [official Snakemake documentations](https://snakemake.readthedocs.io/en/stable/project_info/faq.html#how-can-i-make-use-of-node-local-storage-when-running-cluster-jobs):
> There is currently no way of doing this in Snakemake, but a possible workaround involves the shadow directive and setting the --shadow-prefix flag to e.g. /scratch.

Shadow is indeed one possible solution, but...

**Describe the solution you'd like**

I would like Snakemake to introduce some kind of `node_temp()` modifier.

If you squint at it, a lot of the necessary work already exists in Snakemake:

 - `pipe()` modifier already specifies special files that:
   - make mandatory to run both the producing rule and the consuming rule within the same group on the same node.
      - the difference is that `pipe()` requires both rule to be run in parallel (and considers sum of required resources).
      - whereas a putative `node_temp()` modifier would simply require both rules to run one after the other (and considers the max of required resources), just like any other `group: ` rules' property.
   - expects the file to only be alive for the duration of the group and not outlive it/not exist outside of it.
 - `temp()`modifier already specifies files that:
   - will not be kept long term
   - will be cleaned up after all the rules finish.
   - the difference being (due to the scratch directory being node-local):
     - the main Snakemake process that dispatches to the cluster should not wait for the files to appear (if files aren't visible by the end of `--latency-wait`, do not consider a failure).
     - the responsibility of cleaning this up shouldn't be on the shoulders of the main Snakemake (the one that submits jobs to the cluster), but on those of the job's Snakemake (the instance that is started at the main point of the cluster job and which runs all the rules that are members of the group) as the later one has visibility to the node-local storage.
 - the resource `tmpdir` already takes care of passing/setting up a temp directory, so it should be possible to directly use it as a prefix for `node_temp()` (instead of requiring the use to add one manually like [in the case of shared scratch](https://github.com/cbg-ethz/V-pipe/blob/67e3e7d214840748d67175ab67f061edce1fe980/workflow/rules/common.smk#L621))

**Describe alternatives you've considered**

 - `shadow:` _does_ cover some possible use cases (uses within a single rule), but not all cases (e.g.: passing temporary files between rules).
 - `temp()` will not work, because node-local storage only exists during the execution of group, not outside.
   - thus as mentionned before, snakemake errors because it's waiting for a file that will _never_ show up by the end of `--latency-wait`
 - `pipe()` has some interesting properties already (snakemake automatically groups rules together); but is specifically for pipes, not regular files; and mandates that the rule run in parallel which is not required for a regular file.

**Additional context**

An example use case is [a snake rule file that we have in V-pipe for depleting human-mapping reads from viral sequencing raw data](https://github.com/cbg-ethz/V-pipe/blob/rubicon/workflow/rules/dehuman.smk) (e.g. before uploading them [onto ENA](https://www.ebi.ac.uk/ena/browser/view/PRJEB44932)).
This generate a few intermediates:
- a SAM file after aligning to virus' reference, from which to extract the alignment rejects.
- the .fastq.gz file holding the alignment rejects.
- a SAM file after aligning to hosts (Homo Sapiens) reference, from which to extract the list of ID to deplete.
- a copy of the original .fastq.gz files with the above mentioned read IDs filtered out.
- another SAM file as part of the reference-based compression, temporary files as part of samtools' kmer sorting of non-reference compression, etc.

All these can get pretty big, and put some I/O stress on a the shared file system, but really aren't used outside of this specific chain of rules and could be stored on node-local temp space instead for more efficient I/O.

(At least, luckily, the intermediate .fastq files can be GZipped on the fly to reduce storage I/O).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A `node_temp()` modifier for temporary files on node-local storage? #1474

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A node_temp() modifier for temporary files on node-local storage? #1474

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

A `node_temp()` modifier for temporary files on node-local storage? #1474