Skip to content

fix: consider checkpoint jobs during update_needrun#4092

Closed
tedil wants to merge 2 commits into
mainfrom
fix/checkpoint-rerun
Closed

fix: consider checkpoint jobs during update_needrun#4092
tedil wants to merge 2 commits into
mainfrom
fix/checkpoint-rerun

Conversation

@tedil

@tedil tedil commented Mar 13, 2026

Copy link
Copy Markdown
Contributor

Resolves #3879

Not sure this is the correct solution, though (might have unforeseeable side consequences, though I believe this should be fine, may just cause some more I/O overhead for checking the checkpoint's dependencies).

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

Summary by CodeRabbit

  • Bug Fixes

    • Improved checkpoint job handling to ensure they are properly considered alongside target jobs during workflow execution evaluation, affecting how missing outputs and updated inputs are handled.
  • Tests

    • Added test to verify that missing checkpoint outputs are correctly regenerated when workflows are re-run.

@coderabbitai

coderabbitai Bot commented Mar 13, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 41c7da43-36ea-4bac-a5e8-e8416942e441

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

A single conditional in update_needrun is expanded to include checkpoint jobs alongside target jobs when determining which jobs need to run. A corresponding test verifies that re-running a workflow properly regenerates deleted checkpoint outputs.

Changes

Cohort / File(s) Summary
DAG needrun logic
src/snakemake/dag.py
Modified the condition in update_needrun to treat checkpoint jobs as target jobs (elif job in self.targetjobs or job.is_checkpoint), ensuring missing or modified checkpoint outputs trigger re-execution.
Test coverage
tests/tests.py
Added test_checkpoint_missing_output() to verify that deleting a checkpoint output file causes the checkpoint to re-run and regenerate the missing file on subsequent workflow execution.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately describes the main change: adding checkpoint job consideration during update_needrun function execution.
Description check ✅ Passed The description includes linked issue reference and completes all QC checklist items (test case included and documentation status addressed).
Linked Issues check ✅ Passed The PR directly addresses issue #3879 by expanding update_needrun to consider checkpoint jobs, which resolves the reported problem of undetected missing checkpoint outputs.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the checkpoint detection issue: one-line conditional change in dag.py and a corresponding test case.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/checkpoint-rerun
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can customize the high-level summary generated by CodeRabbit.

Configure the reviews.high_level_summary_instructions setting to provide custom instructions for generating the high-level summary.

@tedil tedil moved this to In review in Snakemake Hackathon 2026 Mar 13, 2026
@tedil tedil requested a review from cademirch March 13, 2026 14:43

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/tests.py`:
- Around line 1632-1648: The test failure is due to a missing test fixture
directory referenced by test_checkpoint_missing_output via
dpath("test_checkpoint_missing_output"); add a new
tests/test_checkpoint_missing_output/ directory to the repo containing the
Snakefile and any input files and expected output files that the test relies on
(ensure the expected output file "output/test_4.txt" is produced by the
Snakefile so the test can remove and verify regeneration), and include any
helper files used by run(...) so the test can execute deterministically.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8e01e412-40bd-4966-8a8e-52f503addfe0

📥 Commits

Reviewing files that changed from the base of the PR and between 034a9e7 and 0f54a61.

📒 Files selected for processing (2)
  • src/snakemake/dag.py
  • tests/tests.py

Comment thread tests/tests.py
Comment on lines +1632 to +1648
def test_checkpoint_missing_output():
"""test for issue 3879"""
test_dir = dpath("test_checkpoint_missing_output")
tmpdir = run(test_dir, cleanup=False, check_results=False)

try:
missing_file = Path(tmpdir) / "output" / "test_4.txt"
os.remove(missing_file)

run(test_dir, tmpdir=tmpdir, cleanup=False, check_results=False)

assert (
missing_file.exists()
), "The missing checkpoint output was not regenerated."
finally:
shutil.rmtree(tmpdir, ignore_errors=ON_WINDOWS)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the test directory exists with required files
echo "=== Checking test directory ==="
fd -t d "test_checkpoint_missing_output" tests/

echo "=== Listing contents if directory exists ==="
if [ -d "tests/test_checkpoint_missing_output" ]; then
    tree tests/test_checkpoint_missing_output/ 2>/dev/null || ls -laR tests/test_checkpoint_missing_output/
else
    echo "ERROR: Test directory tests/test_checkpoint_missing_output/ not found"
fi

Repository: snakemake/snakemake

Length of output: 209


Add the test directory tests/test_checkpoint_missing_output/ to the repository.

The test calls dpath("test_checkpoint_missing_output"), which requires tests/test_checkpoint_missing_output/ to exist with a Snakefile and expected outputs. This directory is missing from the PR and must be included for the test to run.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/tests.py` around lines 1632 - 1648, The test failure is due to a
missing test fixture directory referenced by test_checkpoint_missing_output via
dpath("test_checkpoint_missing_output"); add a new
tests/test_checkpoint_missing_output/ directory to the repo containing the
Snakefile and any input files and expected output files that the test relies on
(ensure the expected output file "output/test_4.txt" is produced by the
Snakefile so the test can remove and verify regeneration), and include any
helper files used by run(...) so the test can execute deterministically.

cademirch
cademirch previously approved these changes Mar 13, 2026

@cademirch cademirch left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a reasonable solution to me.

edit: i re-read the referenced issue. i am unsure about the solution here now. how can we know if (and which) outputs of a checkpoint are missing? i guess we assume that checkpoints are determinstic each run? in any case im confused

@cademirch cademirch self-requested a review March 13, 2026 14:55
@cademirch cademirch dismissed their stale review March 13, 2026 14:56

actually upon re-reading the issue i'm unsure if a solution is even possible?

@tedil

tedil commented Mar 13, 2026

Copy link
Copy Markdown
Contributor Author

After discussion w/ @johanneskoester this may turn out to be problematic when a checkpoint generates temporary files. While this fix would be the only way to notice this, this would also force re-runs for checkpoints with temp outputs, even though technically no changes that'd change the outcome have occurred.
Since this may be unexpected behaviour for most users, perhaps we should instead just document that behaviour and add an example that demonstrates this?

@tedil tedil marked this pull request as draft March 13, 2026 14:58

@johanneskoester johanneskoester left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First decision after live discussion: we think that it rather should be documented that intermediate jobs between the checkpoint and the input function can not be triggered for reruns if the checkpoint output files are missing. Avoiding that would mean to cause reruns of the checkpoint and everything downstream if the checkpoint output is missing (e.g. because of being marked as temp). Workaround: use -R <checkpoint_name> in such cases.

@Hocnonsense Hocnonsense requested review from Hocnonsense and removed request for Hocnonsense March 18, 2026 15:32
Comment thread src/snakemake/dag.py
if is_forced(job):
reason.forced = True
elif job in self.targetjobs:
elif job in self.targetjobs or job.is_checkpoint:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the accurate place to fix this.
Marking every checkpoint as needrun unconditionally here has a broader effect than intended: it forces re-evaluation of checkpoints whose outputs already exist and are up to date, which can cause unnecessary re-runs in incremental workflows.

This is a similar case to #3009, and the fix should be in update_checkpoint_dependencies.

johanneskoester added a commit that referenced this pull request May 29, 2026
…sts (#4124)

will fix #3879. See also the discussion in #4092

### QC
<!-- Make sure that you can tick the boxes below. -->

* [x] The PR contains a test case for the changes or the changes are
already covered by an existing test case.
* [x] This is not necessary to update the documentation (e.g. if the
change does neither modify the language nor the behavior or
functionalities of Snakemake).
* [x] I, as a human being, have checked each line of code in this pull
request

### AI-assistance disclosure
<!--
If AI tools were involved in creating this PR, please check all boxes
that apply
below and make sure that you adhere to our AI-assisted contributions
policy:

https://github.com/snakemake/snakemake/blob/main/docs/project_info/contributing.rst
-->
I used AI assistance for:
* [x] Code generation (e.g., when writing an implementation or fixing a
bug)
* [ ] Test/benchmark generation
* [ ] Documentation (including examples)
* [ ] Research and understanding


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Checkpoint handling now skips instances whose outputs are missing,
preventing failures when previously created checkpoint outputs are
deleted and allowing workflows to continue.

* **Tests**
* Added an integration test verifying reruns succeed when checkpoint
outputs are removed after creation.
  * Removed an obsolete checkpoint-related test.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/snakemake/snakemake/pull/4124?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Till Hartmann <till.hartmann@bih-charite.de>
Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>
@johanneskoester

Copy link
Copy Markdown
Contributor

Closing in favor of the merged #4124. Nevertheless, thanks for bringing this up and helping to find the correct solution!

@github-project-automation github-project-automation Bot moved this from In review to Done in Snakemake Hackathon 2026 Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Deletion of checkpoint outputs is not properly detected

4 participants