Skip to content

Temp files are not deleted for rule before checkpoint that is rerun after checkpoint with new data.  #2982

@LeeBergstrand

Description

@LeeBergstrand

Snakemake version

Snakemake Version: 8.16.0

Bug first occurs at version 8.6.0
The bug does not occur in 8.5.5

Describe the bug

I have a rule that polishes a nanopore genome. This polishing rule leads to a series of other rules, which activate a checkpoint. After checkpoint DAG generation, the polishing rule is again run on a subset of contigs in a different directory. The polishing rule uses wild cards for which directory to run: either the original polishing directory (before the checkpoint) or a later circularization directory (after the checkpoint).

Snakemake only deletes the temp files for the first polishing rule invocation after the checkpoint (in the circularization directory) but doesn't delete the temp files from when the file was initially invoked (in the polishing directory). It also does not delete any temp file generated before the checkpoint (this may be my use case), even if they are not needed after the checkpoint. Running snakemake with flag --delete-temp-output deletes these files.

Before 8.6.0 snakemake would delete the temp files for both rules, which is the expected behavior for this use case because these temp files are not needed for later steps, even after DAG regeneration.

This bug appears to occur between 8.5.5 and 8.6.0. Specifically, the release for 8.6.0 mentions #2732 being fixed by #2737 (fix premature deletion of temp files in combination with checkpoints). I believe the bug was introduced during this pull request.

Logs

TODO

Minimal example

TODO --> The pipeline is pretty complex, so getting a simplified example may take some time.

Here is the complete rule. _R1.clean.sam and _R2.clean.sam, as well as temp files in rules leading up to this rule, are not deleted if they are outputs of the rule when invoked before the checkpoint. However, the files generated are deleted for the rule invocation after the checkpoint, which occurs in a second directory.

rule polish_polypolish:
    input:
        contigs="{sample}/{step}/polypolish/input/{sample}_input.fasta",
        mapping_r1 = "{sample}/{step}/polypolish/{sample}_R1.sam",
        mapping_r2 = "{sample}/{step}/polypolish/{sample}_R2.sam"
    output:
        mapping_clean_r1 = temp("{sample}/{step}/polypolish/{sample}_R1.clean.sam"),
        mapping_clean_r2 = temp("{sample}/{step}/polypolish/{sample}_R2.clean.sam"),
        polished = "{sample}/{step}/polypolish/{sample}_polypolish.fasta",
        debug = temp("{sample}/{step}/polypolish/polypolish.debug.log"),
        debug_stats = "{sample}/stats/{step}/polypolish_changes.log"
    conda:
        "../envs/polypolish.yaml"
    log:
        "{sample}/logs/{step}/polypolish.log"
    benchmark:
        "{sample}/benchmarks/{step}/polypolish.txt"
    params:
        careful = "--careful" if CAREFUL_SHORT_READ_POLISHING else ""
    shell:
        """
        printf "### Polypolish insert filter ###\n" >> {log}
        polypolish filter --in1 {input.mapping_r1} --in2 {input.mapping_r2} \
          --out1 {output.mapping_clean_r1} --out2 {output.mapping_clean_r2} 2>> {log}
          
        printf "\n\n### Polypolish ###\n" >> {log}
        polypolish polish --debug {output.debug} {params.careful} {input.contigs}  \
          {output.mapping_clean_r1} {output.mapping_clean_r2} 2>> {log} |
          seqtk seq -A -C -l 60 > {output.polished} 2>> {log}
          
        head -n 1 {output.debug} > {output.debug_stats}
        grep changed {output.debug} >> {output.debug_stats}
        
        printf "\n\n### Done. ###\n"
        """

Additional context

Here are the snakemake files that are involved in the issue.

https://github.com/rotary-genomics/rotary/blob/develop_lee/rotary/rules/polish.smk
https://github.com/rotary-genomics/rotary/blob/develop_lee/rotary/rules/circularize.smk

The polishing rule is: https://github.com/rotary-genomics/rotary/blob/44026f07772f26591341709a8eb592c00a4a36aa/rotary/rules/polish.smk#L187C6-L187C23

Which gets reinvoked by: https://github.com/rotary-genomics/rotary/blob/44026f07772f26591341709a8eb592c00a4a36aa/rotary/rules/circularize.smk#L268C6-L268C39

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions