Skip to content

fix: fix checkpoint handling corner cases (#3870 and #3559)#4015

Merged
johanneskoester merged 11 commits intomainfrom
fix/uncreatable_checkpoint_input
Mar 10, 2026
Merged

fix: fix checkpoint handling corner cases (#3870 and #3559)#4015
johanneskoester merged 11 commits intomainfrom
fix/uncreatable_checkpoint_input

Conversation

@Hocnonsense
Copy link
Copy Markdown
Contributor

@Hocnonsense Hocnonsense commented Mar 9, 2026

will fix #3870 and fix #3559

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The change does neither modify the language nor the behavior or functionalities of Snakemake.

Summary by CodeRabbit

  • Bug Fixes

    • Improved checkpoint eviction handling so outputs evicted from memory are correctly considered when computing dynamic downstream targets, making checkpoint-driven workflows more reliable under memory pressure.
  • Tests

    • Added a test workflow and unit test covering checkpoints that produce dynamic targets, validating behavior for uncreatable or hard-to-resolve checkpoint inputs.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Tracks outputs of checkpoint jobs evicted when their inputs are missing by adding DAG._evicted_checkpoint_outputs and propagating those outputs into created_output during checkpoint dependency updates; adds a test exercising a checkpoint with an uncreatable input.

Changes

Cohort / File(s) Summary
Checkpoint eviction tracking
src/snakemake/dag.py
Add public attribute DAG._evicted_checkpoint_outputs: Set; on update() record outputs of checkpoint jobs that raise MissingInputException; merge evicted outputs into created_output in update_checkpoint_dependencies().
New uncreatable-checkpoint test
tests/test_uncreatable_checkpoint_input/Snakefile, tests/tests.py
Add a Snakefile reproducing a checkpoint with an uncreatable input and a dynamic consumer; add test_uncreatable_checkpoint_input() to invoke that workflow.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • johanneskoester
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description follows the template but lacks implementation details explaining how the fix addresses the reported issues. Add a brief explanation of how the changes in dag.py fix the uncreatable checkpoint input issue, particularly how _evicted_checkpoint_outputs addresses the silent failure described in issue #3870.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The changes in dag.py track evicted checkpoint outputs and propagate them into created_output; the test case reproduces issue #3870, validating that checkpoints with uncreatable inputs now work correctly [#3870].
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the checkpoint eviction and uncreatable input issue; the new attribute, logic, and test are all within scope of the stated objectives.
Title check ✅ Passed The PR title mentions fixing checkpoint handling corner cases and references issues #3870 and #3559, which aligns with the actual changes that address uncreatable checkpoint input issues.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/uncreatable_checkpoint_input

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 9, 2026

Please format your code with pixi run format

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/snakemake/dag.py`:
- Around line 1357-1360: The current logic adds files to
_evicted_checkpoint_outputs just because out.exists(), which can let stale files
satisfy downstream inputs; change the guard so we only record outputs for
checkpoints that have actually completed previously—for example, require that
the output is present in the checkpoint's recorded created_output (or otherwise
verify checkpoint completion via Checkpoint.get()/a checkpoint.completed flag)
before adding to self._evicted_checkpoint_outputs; update the block around
job.is_checkpoint/job.output/out.exists() to check that authoritative completion
evidence (e.g. out in checkpoint.created_output) instead of plain filesystem
existence.

In `@tests/test_uncreatable_checkpoint_input/Snakefile`:
- Around line 1-2: The Snakefile currently mutates the checkpoint output by
opening and writing "a.txt" during parsing (the two-line block that writes
"a\nb\n"), which updates its mtime every Snakemake run; remove that file-writing
from the Snakefile and instead perform the setup in the test harness or an
explicit preparation step (e.g., test setup function or fixture) that
creates/updates "a.txt" before invoking Snakemake so checkpoint behavior is
modeled by external changes rather than by Snakefile evaluation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b7e94157-d903-4bfa-93e2-7e31f5b40aa6

📥 Commits

Reviewing files that changed from the base of the PR and between b88171c and d399d48.

📒 Files selected for processing (3)
  • src/snakemake/dag.py
  • tests/test_uncreatable_checkpoint_input/Snakefile
  • tests/test_uncreatable_checkpoint_input/expected-results/files/b

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/tests.py`:
- Around line 1499-1500: Update test_uncreatable_checkpoint_input to reproduce
the rerun regression: perform the first successful run via
run(dpath("test_uncreatable_checkpoint_input")) while preserving the workdir
(use the existing preserve_workdir helper or avoid cleanup), then update the
timestamp on the checkpoint input file with os.utime(<path_to_a.txt>, None), and
then invoke a second dry-run/run (e.g., run(..., dry_run=True) or run(...)
again) and assert that the downstream rule is NOT rescheduled by checking the
run output/return (e.g., assert "Input files updated by another job" not in
output or ensure no downstream job appears in the scheduling output). Ensure you
reference the test function name test_uncreatable_checkpoint_input, use run and
dpath to locate the test tree, and touch a.txt with os.utime.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 473fbcde-86a6-4bc6-b660-aedc8aa9178a

📥 Commits

Reviewing files that changed from the base of the PR and between d399d48 and 16b7444.

📒 Files selected for processing (2)
  • src/snakemake/dag.py
  • tests/tests.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/snakemake/dag.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/snakemake/dag.py`:
- Around line 1357-1363: The async method
self.workflow.persistence.incomplete(job) is being called without awaiting,
causing is_incomplete to be a coroutine and the condition to behave incorrectly;
update the checkpoint handling in the block that checks job.is_checkpoint (which
iterates over job.output and calls out.exists()) to await the incomplete call
(i.e., use await self.workflow.persistence.incomplete(job)) and ensure has_meta
is evaluated before awaiting incomplete if you want to avoid unnecessary awaits;
this will allow the subsequent if not is_incomplete: branch to run and correctly
add outputs to self._evicted_checkpoint_outputs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e96b5863-729e-458d-b18c-0263147d41cc

📥 Commits

Reviewing files that changed from the base of the PR and between 1b9b43e and dc3dc96.

📒 Files selected for processing (1)
  • src/snakemake/dag.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/snakemake/dag.py (1)

1357-1363: ⚠️ Potential issue | 🟠 Major

Don’t ignore incomplete markers when metadata is missing.

Line 1361 only checks persistence.incomplete(job) if has_metadata(job) is true. But incomplete(job) already detects existing outputs that still have incomplete markers, so a checkpoint that failed before writing metadata can pass this guard, get added here, and then be promoted into checkpoints.created_output at Lines 2187-2189. That lets Checkpoint.get() resolve against incomplete files instead of raising.

Proposed fix
         if missing_input:
             if job.is_checkpoint:
-                for out in job.output:
-                    if await out.exists():
-                        has_meta = self.workflow.persistence.has_metadata(job)  # type: ignore[reportOptionalMemberAccess]
-                        is_incomplete = has_meta and await self.workflow.persistence.incomplete(job)  # type: ignore[reportOptionalMemberAccess]
-                        if not is_incomplete:
-                            self._evicted_checkpoint_outputs.add(out)
+                is_incomplete = bool(
+                    await self.workflow.persistence.incomplete(job)  # type: ignore[reportOptionalMemberAccess]
+                )
+                if not is_incomplete:
+                    for out in job.output:
+                        if await out.exists():
+                            self._evicted_checkpoint_outputs.add(out)
             self.delete_job(job, recursive=False)  # delete job from tree
             raise MissingInputException(job, missing_input)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/snakemake/dag.py` around lines 1357 - 1363, The code currently only calls
self.workflow.persistence.incomplete(job) when has_metadata(job) is true, which
allows checkpoint outputs without metadata to be treated as complete; instead
call await self.workflow.persistence.incomplete(job) unconditionally for each
out in job.output (i.e., remove the has_metadata gate and compute is_incomplete
= await self.workflow.persistence.incomplete(job)), and only add outputs to
self._evicted_checkpoint_outputs when that unconditional incomplete check is
false so incomplete-marker files are not promoted into
checkpoints.created_output (this touches job.is_checkpoint handling and uses
self.workflow.persistence.incomplete(job) and self._evicted_checkpoint_outputs).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/snakemake/dag.py`:
- Around line 1357-1363: The code currently only calls
self.workflow.persistence.incomplete(job) when has_metadata(job) is true, which
allows checkpoint outputs without metadata to be treated as complete; instead
call await self.workflow.persistence.incomplete(job) unconditionally for each
out in job.output (i.e., remove the has_metadata gate and compute is_incomplete
= await self.workflow.persistence.incomplete(job)), and only add outputs to
self._evicted_checkpoint_outputs when that unconditional incomplete check is
false so incomplete-marker files are not promoted into
checkpoints.created_output (this touches job.is_checkpoint handling and uses
self.workflow.persistence.incomplete(job) and self._evicted_checkpoint_outputs).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0a958038-8f14-48bd-9c2a-463fc7dd101e

📥 Commits

Reviewing files that changed from the base of the PR and between dc3dc96 and 7519318.

📒 Files selected for processing (1)
  • src/snakemake/dag.py

@Hocnonsense Hocnonsense moved this to In review in Snakemake Hackathon 2026 Mar 9, 2026
@johanneskoester johanneskoester changed the title fix: uncreatable checkpoint input fix: fix checkpoint handling corner cases (#3870 and #3559) Mar 10, 2026
@johanneskoester johanneskoester merged commit 63f4257 into main Mar 10, 2026
59 checks passed
@johanneskoester johanneskoester deleted the fix/uncreatable_checkpoint_input branch March 10, 2026 21:21
@github-project-automation github-project-automation bot moved this from In review to Done in Snakemake Hackathon 2026 Mar 10, 2026
@Hocnonsense Hocnonsense self-assigned this Mar 10, 2026
@Hocnonsense Hocnonsense added the bug Something isn't working label Mar 10, 2026
johanneskoester pushed a commit that referenced this pull request Mar 13, 2026
🤖 I have created a release *beep* *boop*
---


##
[9.17.0](v9.16.3...v9.17.0)
(2026-03-13)


### Features

* Allow storing snakemake metadata in files or databases
([#4012](#4012))
([dd75f31](dd75f31))
* Allow to specify comparison command per-unit test
([#3956](#3956))
([b88171c](b88171c))
* job table orderd topological when run is started
([#4018](#4018))
([75cf506](75cf506))
* lambda functions for priority in rules
([#3253](#3253))
([d2aa226](d2aa226))
* Make on... directive of modules accessible
([#4050](#4050))
([e9f2e1c](e9f2e1c))


### Bug Fixes

* adjust conda tests to not fail on apple silicon; fix
[#4040](#4040)
([#4049](#4049))
([f5b0142](f5b0142))
* allow "--containerize apptainer" to output apptainer format instead of
dockerfile ([#4030](#4030))
([f5cac30](f5cac30))
* apptainer command not recognized when singularity is absent
([#4010](#4010))
([b8162e2](b8162e2))
* capture stderr when tests fail
([#3995](#3995))
([97d74ba](97d74ba))
* **docs:** make Data-dependent conditional execution a complete example
([#4043](#4043))
([3a1d7f2](3a1d7f2))
* don't build the DAG when running unlock. Fixes
[#4000](#4000) and
[#198](#198)
([#4007](#4007))
([acf79fd](acf79fd))
* Ensure pixi tasks may be run as advertised
([#4046](#4046))
([88253c2](88253c2))
* fix checkpoint handling corner cases
([#3870](#3870) and
[#3559](#3559))
([#4015](#4015))
([63f4257](63f4257))
* issue 3642
([#4054](#4054))
([76e6fc2](76e6fc2))
* issue 3815
([#4026](#4026))
([b0eec96](b0eec96))
* logging None in shellcmd context causes error
([#4064](#4064))
([d0652cd](d0652cd))
* lookup function returns default value for empty DataFrame queries
([#4056](#4056))
([f71de97](f71de97))
* make `cache: omit-software` a rule specific property
([#4085](#4085))
([034a9e7](034a9e7))
* reduce number of tests leaving temporary files behind
([#4033](#4033))
([a3a1c97](a3a1c97))
* regression in dynamic resource handling
([#4038](#4038))
([f2c554a](f2c554a))
* somewhat shorter announce message
([#4080](#4080))
([57efc71](57efc71))


### Performance Improvements

* switch reretry with tenacity; decouple container classes (with Python
3.7 compat for old scripts) from rest of the codebase (enabling moving
to newer python versions)
([#4032](#4032))
([ffb19e7](ffb19e7))


### Documentation

* Add AI-assisted contributions policy to contributing guidelines
([#4051](#4051))
([dd70526](dd70526))
* **codebase:** Update & simplify plugin architecture section
([#4052](#4052))
([176cf63](176cf63))
* Correct workflow.source_path() description in documentation
([#4036](#4036))
([45883c5](45883c5))
* fixed wrong code example for collect() function
([#4037](#4037))
([5c85ed8](5c85ed8))
* Minor docs improvements
([#4089](#4089))
([29ea226](29ea226))
* switch to sphinx_design for tabs
([#3976](#3976))
([9674614](9674614))
* typo in the migration table breaking a pip install command
([#4024](#4024))
([66f9dda](66f9dda))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Checkpoint with uncreatable input cannot goes on Endless rerun of following jobs with a checkpoint output updated.

2 participants