Skip to content

Endless rerun of following jobs with a checkpoint output updated. #3559

@Hocnonsense

Description

@Hocnonsense

Snakemake version >=8.16, still on 9.3

Describe the bug

Logs

Minimal example

given the snakefile:

snakefile
samples = list(range(2))


rule b:
    input:
        "linkout/a/1.out"
    output:
        touch("inputs/b/1.out")


rule a:
    output:
        touch("inputs/a/{sample}.out")

checkpoint a1:
    output:
        "checkpoint/a1.ls"
    run:
        with open(output[0], "w") as f:
            for sample in samples:
                f.write(f"{sample}\n")


def usea1(wildcards):
    with open(checkpoints.a1.get().output[0], "r") as f:
        lines = [i.strip() for i in f]
    return [f"inputs/a/{i}.out" for i in lines]


rule aggregate:
    input:
        assembly_files=usea1
    output:
        expand("linkout/a/{sample}.out", sample=samples)
    params:
      assembly_dir = "linkout/a",
    shell:
        """
        mkdir -p {params.assembly_dir}
        ln -rs {input.assembly_files} {params.assembly_dir}
        """

where:

  • checkpoint a1 defines the sample list
  • function usea1 defines the real input file
  • rule aggregate recognize and link the file to output path
The first run is successful

$ snakemake -s snakefile -d testout -c all
Assuming unrestricted shared filesystem usage.
host: Matebook14
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 22
Rules claiming more threads will be scaled down.
Job stats:
job          count
---------  -------
a1               1
aggregate        1
b                1
total            3

Select jobs to execute...
Execute 1 jobs...

[Sun May  4 00:14:00 2025]
localcheckpoint a1:
    output: checkpoint/a1.ls
    jobid: 2
    reason: Missing output files: <TBD>
    resources: tmpdir=/tmp

DAG of jobs will be updated after completion.
[Sun May  4 00:14:01 2025]
Finished jobid: 2 (Rule: a1)
1 of 3 steps (33%) done
Updating checkpoint dependencies.
Select jobs to execute...
Execute 2 jobs...

[Sun May  4 00:14:01 2025]
localrule a:
    output: inputs/a/1.out
    jobid: 6
    reason: Missing output files: inputs/a/1.out
    wildcards: sample=1
    resources: tmpdir=/tmp


[Sun May  4 00:14:01 2025]
localrule a:
    output: inputs/a/0.out
    jobid: 5
    reason: Missing output files: inputs/a/0.out
    wildcards: sample=0
    resources: tmpdir=/tmp

Touching output file inputs/a/1.out.
[Sun May  4 00:14:01 2025]
Finished jobid: 6 (Rule: a)
2 of 5 steps (40%) done
Touching output file inputs/a/0.out.
[Sun May  4 00:14:01 2025]
Finished jobid: 5 (Rule: a)
3 of 5 steps (60%) done
Select jobs to execute...
Execute 1 jobs...

[Sun May  4 00:14:01 2025]
localrule aggregate:
    input: inputs/a/0.out, inputs/a/1.out
    output: linkout/a/0.out, linkout/a/1.out
    jobid: 1
    reason: Missing output files: linkout/a/1.out; Input files updated by another job: inputs/a/1.out, inputs/a/0.out
    resources: tmpdir=/tmp

[Sun May  4 00:14:02 2025]
Finished jobid: 1 (Rule: aggregate)
4 of 5 steps (80%) done
Select jobs to execute...
Execute 1 jobs...

[Sun May  4 00:14:02 2025]
localrule b:
    input: linkout/a/1.out
    output: inputs/b/1.out
    jobid: 0
    reason: Missing output files: inputs/b/1.out; Input files updated by another job: linkout/a/1.out
    resources: tmpdir=/tmp

Touching output file inputs/b/1.out.
[Sun May  4 00:14:02 2025]
Finished jobid: 0 (Rule: b)
5 of 5 steps (100%) done
Complete log(s): testout/.snakemake/log/2025-05-04T001400.239820.snakemake.log

However, some times rules are updated, and, for example, the output of checkpoint a1 will be updated:
touch testout/checkpoint/a1.ls

After that, rerun of rule b will be endless.

$ touch testout/checkpoint/a1.ls
(snakemake-dev) [hwrn@Matebook14] -- 2025-05-04 00:14:04 -- (main)=
$ snakemake -s snakefile -d testout -c all -n
host: Matebook14
Building DAG of jobs...
Updating checkpoint dependencies.
Job stats:
job      count
-----  -------
b            1
total        1


[Sun May  4 00:14:08 2025]
rule b:
    input: linkout/a/1.out
    output: inputs/b/1.out
    jobid: 0
    reason: Input files updated by another job: linkout/a/1.out
    resources: tmpdir=<TBD>

Job stats:
job      count
-----  -------
b            1
total        1

Reasons:
    (check individual jobs above for details)
    input files updated by another job:
        b
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

Additional context

https://github.com/snakemake/snakemake/blob/main/src/snakemake/dag.py#L1399-L1408

After digging into the python code, I found that as job.input is point to the checkpoint/a1.ls instead of the real file, the job aggregate always considered to be run as 'Updated input files: <TBD>', and update all following jobs. Next job aggregate just absense here magicly https://github.com/snakemake/snakemake/blob/main/src/snakemake/dag.py#L1527-L1554

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions