-
Notifications
You must be signed in to change notification settings - Fork 634
Description
Is your feature request related to a problem? Please describe.
According to official Snakemake documentations:
There is currently no way of doing this in Snakemake, but a possible workaround involves the shadow directive and setting the --shadow-prefix flag to e.g. /scratch.
Shadow is indeed one possible solution, but...
Describe the solution you'd like
I would like Snakemake to introduce some kind of node_temp() modifier.
If you squint at it, a lot of the necessary work already exists in Snakemake:
pipe()modifier already specifies special files that:- make mandatory to run both the producing rule and the consuming rule within the same group on the same node.
- the difference is that
pipe()requires both rule to be run in parallel (and considers sum of required resources). - whereas a putative
node_temp()modifier would simply require both rules to run one after the other (and considers the max of required resources), just like any othergroup:rules' property.
- the difference is that
- expects the file to only be alive for the duration of the group and not outlive it/not exist outside of it.
- make mandatory to run both the producing rule and the consuming rule within the same group on the same node.
temp()modifier already specifies files that:- will not be kept long term
- will be cleaned up after all the rules finish.
- the difference being (due to the scratch directory being node-local):
- the main Snakemake process that dispatches to the cluster should not wait for the files to appear (if files aren't visible by the end of
--latency-wait, do not consider a failure). - the responsibility of cleaning this up shouldn't be on the shoulders of the main Snakemake (the one that submits jobs to the cluster), but on those of the job's Snakemake (the instance that is started at the main point of the cluster job and which runs all the rules that are members of the group) as the later one has visibility to the node-local storage.
- the main Snakemake process that dispatches to the cluster should not wait for the files to appear (if files aren't visible by the end of
- the resource
tmpdiralready takes care of passing/setting up a temp directory, so it should be possible to directly use it as a prefix fornode_temp()(instead of requiring the use to add one manually like in the case of shared scratch)
Describe alternatives you've considered
shadow:does cover some possible use cases (uses within a single rule), but not all cases (e.g.: passing temporary files between rules).temp()will not work, because node-local storage only exists during the execution of group, not outside.- thus as mentionned before, snakemake errors because it's waiting for a file that will never show up by the end of
--latency-wait
- thus as mentionned before, snakemake errors because it's waiting for a file that will never show up by the end of
pipe()has some interesting properties already (snakemake automatically groups rules together); but is specifically for pipes, not regular files; and mandates that the rule run in parallel which is not required for a regular file.
Additional context
An example use case is a snake rule file that we have in V-pipe for depleting human-mapping reads from viral sequencing raw data (e.g. before uploading them onto ENA).
This generate a few intermediates:
- a SAM file after aligning to virus' reference, from which to extract the alignment rejects.
- the .fastq.gz file holding the alignment rejects.
- a SAM file after aligning to hosts (Homo Sapiens) reference, from which to extract the list of ID to deplete.
- a copy of the original .fastq.gz files with the above mentioned read IDs filtered out.
- another SAM file as part of the reference-based compression, temporary files as part of samtools' kmer sorting of non-reference compression, etc.
All these can get pretty big, and put some I/O stress on a the shared file system, but really aren't used outside of this specific chain of rules and could be stored on node-local temp space instead for more efficient I/O.
(At least, luckily, the intermediate .fastq files can be GZipped on the fly to reduce storage I/O).