Skip to content

fix: do not rerun when checkpoint job missing but downstream file exists#4124

Merged
johanneskoester merged 7 commits into
mainfrom
fix/checkpoint-rerun2
May 29, 2026
Merged

fix: do not rerun when checkpoint job missing but downstream file exists#4124
johanneskoester merged 7 commits into
mainfrom
fix/checkpoint-rerun2

Conversation

@Hocnonsense

@Hocnonsense Hocnonsense commented Mar 20, 2026

Copy link
Copy Markdown
Contributor

will fix #3879. See also the discussion in #4092

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • This is not necessary to update the documentation (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).
  • I, as a human being, have checked each line of code in this pull request

AI-assistance disclosure

I used AI assistance for:

  • Code generation (e.g., when writing an implementation or fixing a bug)
  • Test/benchmark generation
  • Documentation (including examples)
  • Research and understanding

Summary by CodeRabbit

  • Bug Fixes

    • Checkpoint handling now skips instances whose outputs are missing, preventing failures when checkpoint outputs are deleted and allowing workflows to continue.
  • Tests

    • Added an integration test that verifies reruns succeed when checkpoint outputs are removed after creation.
    • Removed an obsolete checkpoint-related test.

Review Change Stack

@coderabbitai

coderabbitai Bot commented Mar 20, 2026

Copy link
Copy Markdown
Contributor

Caution

Review failed

Failed to post review comments

📝 Walkthrough

Walkthrough

Require checkpoint outputs to both not need rerun and be present before enqueuing instances; add an integration test that deletes checkpoint outputs between runs to verify they are regenerated; remove an outdated checkpoint test.

Changes

DAG checkpoint logic

Layer / File(s) Summary
Checkpoint enqueue decision
src/snakemake/dag.py
Added explicit job_queue: Dict[Job, Set[Job]] annotation and changed filtering in update_checkpoint_dependencies to add checkpoint instances only when not self.needrun(checkpoint) AND await is_output_present(checkpoint).

Tests

Layer / File(s) Summary
New checkpoint integration test
tests/test_checkpoint_missing_output/Snakefile, tests/test_checkpoint_missing_output/expected-results/output/selection.txt
Add a checkpoint-producing workflow and expected selection.txt result; aggregation selects the max-valued checkpoint output.
Test harness update
tests/tests.py
Add test_checkpoint_missing_output which runs the workflow, deletes checkpoint outputs between runs, and verifies reruns succeed and regenerate missing outputs.
Removed legacy test
tests/test_checkpoint_missout/Snakefile
Delete an obsolete checkpoint test workflow that previously covered related behavior.

Sequence Diagram(s)

sequenceDiagram
    participant Runner as Test/Runner
    participant DAG as DAG.update_checkpoint_dependencies
    participant Checkpoint as CheckpointJob
    participant FS as Filesystem

    Runner->>DAG: request candidate checkpoint jobs
    DAG->>Checkpoint: for each candidate, evaluate needrun(checkpoint)
    alt needrun == True
        Checkpoint-->>DAG: flagged for rerun (skip enqueue)
    else needrun == False
        DAG->>FS: await is_output_present(checkpoint)
        alt output exists
            FS-->>DAG: present
            DAG-->>Checkpoint: add to job_queue
        else output missing
            FS-->>DAG: missing
            DAG-->>Checkpoint: do not add (will trigger rerun)
        end
    end
    DAG-->>Runner: updated job_queue
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: checkpoint jobs are not rerun when their outputs are missing but downstream files exist, which aligns with the core modification in DAG.update_checkpoint_dependencies.
Linked Issues check ✅ Passed The code changes directly address issue #3879 by adding a check for await is_output_present(checkpoint) in the checkpoint dependency update logic, ensuring missing checkpoint outputs trigger reruns as required.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the checkpoint missing output detection issue: DAG logic modification, new test case, expected results, and old test replacement are all aligned with the stated objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/checkpoint-rerun2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Hocnonsense Hocnonsense changed the title fix: fix: do not rerun when checkpoint job missing but downstream file exists Mar 20, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/tests.py (1)

1628-1639: Test validates the fix, but comments may be misleading.

The test correctly verifies that the workflow doesn't crash when checkpoint outputs are deleted between runs, which addresses issue #3879. However, the comments "should not fail (but nothing to do)" are potentially misleading — when checkpoint outputs are deleted, the checkpoint should be re-executed, so there is actually work to do.

Consider verifying that deleted outputs are actually regenerated after rerun to strengthen the test:

💡 Optional: Verify outputs are regenerated
 def test_checkpoint_missing_output():
     """test for issue 3879, also covers 3009"""
     # normal run to create the checkpoint output and final output
     tmpdir = run(dpath("test_checkpoint_missing_output"), cleanup=False)
     assert tmpdir
-    # should not fail (but nothing to do)
+    # Delete checkpoint output; rerun should regenerate it
     (tmpdir / "output" / "test_1.txt").unlink()
     run(dpath("test_checkpoint_missing_output"), cleanup=False, tmpdir=tmpdir)
-    # should not fail (but nothing to do)
+    assert (tmpdir / "output" / "test_1.txt").exists(), "test_1.txt should be regenerated"
+    # Delete another checkpoint output; rerun should regenerate it
     (tmpdir / "output" / "test_0.txt").unlink()
     run(dpath("test_checkpoint_missing_output"), cleanup=False, tmpdir=tmpdir)
+    assert (tmpdir / "output" / "test_0.txt").exists(), "test_0.txt should be regenerated"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/tests.py` around lines 1628 - 1639, The test_comment is misleading and
the test should assert that deleted checkpoint outputs are regenerated: in
test_checkpoint_missing_output update the comments to reflect that deleting
checkpoint outputs should trigger re-execution, and after each
run(dpath("test_checkpoint_missing_output"), cleanup=False, tmpdir=tmpdir) add
assertions that (tmpdir / "output" / "test_1.txt") and (tmpdir / "output" /
"test_0.txt") respectively exist (or contain expected contents) to verify the
checkpoint was rerun; reference the test function name
test_checkpoint_missing_output, the run(...) helper, and the tmpdir / "output" /
"<file>" paths when adding the checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/tests.py`:
- Around line 1628-1639: The test_comment is misleading and the test should
assert that deleted checkpoint outputs are regenerated: in
test_checkpoint_missing_output update the comments to reflect that deleting
checkpoint outputs should trigger re-execution, and after each
run(dpath("test_checkpoint_missing_output"), cleanup=False, tmpdir=tmpdir) add
assertions that (tmpdir / "output" / "test_1.txt") and (tmpdir / "output" /
"test_0.txt") respectively exist (or contain expected contents) to verify the
checkpoint was rerun; reference the test function name
test_checkpoint_missing_output, the run(...) helper, and the tmpdir / "output" /
"<file>" paths when adding the checks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 651adaa0-e9b3-444e-b3a3-2039c86d9178

📥 Commits

Reviewing files that changed from the base of the PR and between 2432a6c and c502139.

📒 Files selected for processing (7)
  • src/snakemake/dag.py
  • tests/test_checkpoint_missing_output/Snakefile
  • tests/test_checkpoint_missing_output/expected-results/output/selection.txt
  • tests/test_checkpoint_missout/Snakefile
  • tests/test_checkpoint_missout/expected-results/final.txt
  • tests/test_checkpoint_missout/problem.txt
  • tests/tests.py
💤 Files with no reviewable changes (1)
  • tests/test_checkpoint_missout/Snakefile

@github-actions

Copy link
Copy Markdown
Contributor

Please format your code with pixi run format

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/tests.py (1)

1628-1639: Test lacks assertions verifying checkpoint outputs are regenerated.

The test validates that the workflow doesn't crash when checkpoint outputs are missing, but doesn't assert that the deleted files (test_1.txt, test_0.txt) were actually regenerated after each rerun. Given the PR objective is to ensure "checkpoint outputs are properly detected and regenerated when deleted," consider adding assertions to verify the files exist after each run.

Additionally, the comments "target file exists so nothing to do" are misleading since the files are being deleted and the checkpoint should be re-run to regenerate them.

💡 Suggested improvement
 def test_checkpoint_missing_output():
     """test for issue 3879, also covers 3009"""
     # normal run to create the checkpoint output and final output
     tmpdir = run(dpath("test_checkpoint_missing_output"), cleanup=False)
     assert tmpdir
-    # should not fail (target file exists so nothing to do)
+    # delete checkpoint output and verify it gets regenerated
     (tmpdir / "output" / "test_1.txt").unlink()
     run(dpath("test_checkpoint_missing_output"), cleanup=False, tmpdir=tmpdir)
-    # should not fail (target file exists so nothing to do)
+    assert (tmpdir / "output" / "test_1.txt").exists(), "test_1.txt should be regenerated"
+    # delete another checkpoint output and verify regeneration
     (tmpdir / "output" / "test_0.txt").unlink()
     run(dpath("test_checkpoint_missing_output"), cleanup=False, tmpdir=tmpdir)
+    assert (tmpdir / "output" / "test_0.txt").exists(), "test_0.txt should be regenerated"
+    shutil.rmtree(tmpdir, ignore_errors=ON_WINDOWS)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/tests.py` around lines 1628 - 1639, The test function
test_checkpoint_missing_output currently deletes checkpoint output files but
doesn't assert they are regenerated; update test_checkpoint_missing_output to
assert existence of (tmpdir / "output" / "test_1.txt") after the first rerun and
assert existence of (tmpdir / "output" / "test_0.txt") after the second rerun
(using the existing run(...) helper and dpath(...) to locate the files), and
revise the misleading comments to say that files are deleted and should be
regenerated rather than "target file exists so nothing to do".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/tests.py`:
- Around line 1628-1639: The test function test_checkpoint_missing_output
currently deletes checkpoint output files but doesn't assert they are
regenerated; update test_checkpoint_missing_output to assert existence of
(tmpdir / "output" / "test_1.txt") after the first rerun and assert existence of
(tmpdir / "output" / "test_0.txt") after the second rerun (using the existing
run(...) helper and dpath(...) to locate the files), and revise the misleading
comments to say that files are deleted and should be regenerated rather than
"target file exists so nothing to do".

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 41e6491f-054c-4f95-a328-06b7a604543d

📥 Commits

Reviewing files that changed from the base of the PR and between c502139 and 49d3451.

📒 Files selected for processing (1)
  • tests/tests.py

@themavik themavik left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gating checkpoint enqueue on is_output_present fixes the stale-node case. nit: confirm dropping test_checkpoint_missout does not leave 3009/3879 uncovered.

@Hocnonsense Hocnonsense added the bug Something isn't working label Mar 27, 2026
Comment thread src/snakemake/dag.py
@johanneskoester johanneskoester enabled auto-merge (squash) May 29, 2026 13:39
@johanneskoester johanneskoester merged commit a060b93 into main May 29, 2026
60 checks passed
@johanneskoester johanneskoester deleted the fix/checkpoint-rerun2 branch May 29, 2026 13:55
@github-project-automation github-project-automation Bot moved this from In review to Done in Snakemake Hackathon 2026 May 29, 2026
johanneskoester pushed a commit that referenced this pull request May 29, 2026
🤖 I have created a release *beep* *boop*
---


##
[9.21.1](v9.21.0...v9.21.1)
(2026-05-29)


### Bug Fixes

* add default json function to benchmarks
([#4128](#4128))
([41fab22](41fab22))
* do not rerun when checkpoint job missing but downstream file exists
([#4124](#4124))
([a060b93](a060b93))
* ensure that error logs contain all available details
([#4183](#4183))
([74a86e9](74a86e9))
* handle missing pss attribute in benchmark on Windows
([#4160](#4160))
([da52080](da52080))
* implement Resources.setdefault
([#3968](#3968))
([2413e99](2413e99))
* reporting remote nodes number
([#3978](#3978))
([8c534f0](8c534f0))
* resolve pathvars before constructing storage queries
([#3969](#3969))
([bd15237](bd15237))
* storage temp() file cleanup with RemoteProvider
([#4189](#4189))
([898bad1](898bad1))
* tolerate FileNotFoundError in drop_iocache
([#4153](#4153))
([#4191](#4191))
([ce26b28](ce26b28))


### Documentation

* Added guide on debugging workflows
([#4029](#4029))
([3d052ae](3d052ae))
* **cli:** Remove broken ref bold markup
([#4204](#4204))
([1200ebf](1200ebf))
* remove duplicated resources attribute in rules.rst
([#4190](#4190))
([6c8ecdd](6c8ecdd))
* **rules:** Update script type hint advice
([#4193](#4193))
([6108712](6108712))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Deletion of checkpoint outputs is not properly detected

4 participants