fix: fix checkpoint handling corner cases (#3870 and #3559)#4015
fix: fix checkpoint handling corner cases (#3870 and #3559)#4015johanneskoester merged 11 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughTracks outputs of checkpoint jobs evicted when their inputs are missing by adding DAG._evicted_checkpoint_outputs and propagating those outputs into created_output during checkpoint dependency updates; adds a test exercising a checkpoint with an uncreatable input. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Please format your code with |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/snakemake/dag.py`:
- Around line 1357-1360: The current logic adds files to
_evicted_checkpoint_outputs just because out.exists(), which can let stale files
satisfy downstream inputs; change the guard so we only record outputs for
checkpoints that have actually completed previously—for example, require that
the output is present in the checkpoint's recorded created_output (or otherwise
verify checkpoint completion via Checkpoint.get()/a checkpoint.completed flag)
before adding to self._evicted_checkpoint_outputs; update the block around
job.is_checkpoint/job.output/out.exists() to check that authoritative completion
evidence (e.g. out in checkpoint.created_output) instead of plain filesystem
existence.
In `@tests/test_uncreatable_checkpoint_input/Snakefile`:
- Around line 1-2: The Snakefile currently mutates the checkpoint output by
opening and writing "a.txt" during parsing (the two-line block that writes
"a\nb\n"), which updates its mtime every Snakemake run; remove that file-writing
from the Snakefile and instead perform the setup in the test harness or an
explicit preparation step (e.g., test setup function or fixture) that
creates/updates "a.txt" before invoking Snakemake so checkpoint behavior is
modeled by external changes rather than by Snakefile evaluation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b7e94157-d903-4bfa-93e2-7e31f5b40aa6
📒 Files selected for processing (3)
src/snakemake/dag.pytests/test_uncreatable_checkpoint_input/Snakefiletests/test_uncreatable_checkpoint_input/expected-results/files/b
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/tests.py`:
- Around line 1499-1500: Update test_uncreatable_checkpoint_input to reproduce
the rerun regression: perform the first successful run via
run(dpath("test_uncreatable_checkpoint_input")) while preserving the workdir
(use the existing preserve_workdir helper or avoid cleanup), then update the
timestamp on the checkpoint input file with os.utime(<path_to_a.txt>, None), and
then invoke a second dry-run/run (e.g., run(..., dry_run=True) or run(...)
again) and assert that the downstream rule is NOT rescheduled by checking the
run output/return (e.g., assert "Input files updated by another job" not in
output or ensure no downstream job appears in the scheduling output). Ensure you
reference the test function name test_uncreatable_checkpoint_input, use run and
dpath to locate the test tree, and touch a.txt with os.utime.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 473fbcde-86a6-4bc6-b660-aedc8aa9178a
📒 Files selected for processing (2)
src/snakemake/dag.pytests/tests.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/snakemake/dag.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/snakemake/dag.py`:
- Around line 1357-1363: The async method
self.workflow.persistence.incomplete(job) is being called without awaiting,
causing is_incomplete to be a coroutine and the condition to behave incorrectly;
update the checkpoint handling in the block that checks job.is_checkpoint (which
iterates over job.output and calls out.exists()) to await the incomplete call
(i.e., use await self.workflow.persistence.incomplete(job)) and ensure has_meta
is evaluated before awaiting incomplete if you want to avoid unnecessary awaits;
this will allow the subsequent if not is_incomplete: branch to run and correctly
add outputs to self._evicted_checkpoint_outputs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e96b5863-729e-458d-b18c-0263147d41cc
📒 Files selected for processing (1)
src/snakemake/dag.py
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/snakemake/dag.py (1)
1357-1363:⚠️ Potential issue | 🟠 MajorDon’t ignore incomplete markers when metadata is missing.
Line 1361 only checks
persistence.incomplete(job)ifhas_metadata(job)is true. Butincomplete(job)already detects existing outputs that still have incomplete markers, so a checkpoint that failed before writing metadata can pass this guard, get added here, and then be promoted intocheckpoints.created_outputat Lines 2187-2189. That letsCheckpoint.get()resolve against incomplete files instead of raising.Proposed fix
if missing_input: if job.is_checkpoint: - for out in job.output: - if await out.exists(): - has_meta = self.workflow.persistence.has_metadata(job) # type: ignore[reportOptionalMemberAccess] - is_incomplete = has_meta and await self.workflow.persistence.incomplete(job) # type: ignore[reportOptionalMemberAccess] - if not is_incomplete: - self._evicted_checkpoint_outputs.add(out) + is_incomplete = bool( + await self.workflow.persistence.incomplete(job) # type: ignore[reportOptionalMemberAccess] + ) + if not is_incomplete: + for out in job.output: + if await out.exists(): + self._evicted_checkpoint_outputs.add(out) self.delete_job(job, recursive=False) # delete job from tree raise MissingInputException(job, missing_input)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/snakemake/dag.py` around lines 1357 - 1363, The code currently only calls self.workflow.persistence.incomplete(job) when has_metadata(job) is true, which allows checkpoint outputs without metadata to be treated as complete; instead call await self.workflow.persistence.incomplete(job) unconditionally for each out in job.output (i.e., remove the has_metadata gate and compute is_incomplete = await self.workflow.persistence.incomplete(job)), and only add outputs to self._evicted_checkpoint_outputs when that unconditional incomplete check is false so incomplete-marker files are not promoted into checkpoints.created_output (this touches job.is_checkpoint handling and uses self.workflow.persistence.incomplete(job) and self._evicted_checkpoint_outputs).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/snakemake/dag.py`:
- Around line 1357-1363: The code currently only calls
self.workflow.persistence.incomplete(job) when has_metadata(job) is true, which
allows checkpoint outputs without metadata to be treated as complete; instead
call await self.workflow.persistence.incomplete(job) unconditionally for each
out in job.output (i.e., remove the has_metadata gate and compute is_incomplete
= await self.workflow.persistence.incomplete(job)), and only add outputs to
self._evicted_checkpoint_outputs when that unconditional incomplete check is
false so incomplete-marker files are not promoted into
checkpoints.created_output (this touches job.is_checkpoint handling and uses
self.workflow.persistence.incomplete(job) and self._evicted_checkpoint_outputs).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0a958038-8f14-48bd-9c2a-463fc7dd101e
📒 Files selected for processing (1)
src/snakemake/dag.py
Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>
🤖 I have created a release *beep* *boop* --- ## [9.17.0](v9.16.3...v9.17.0) (2026-03-13) ### Features * Allow storing snakemake metadata in files or databases ([#4012](#4012)) ([dd75f31](dd75f31)) * Allow to specify comparison command per-unit test ([#3956](#3956)) ([b88171c](b88171c)) * job table orderd topological when run is started ([#4018](#4018)) ([75cf506](75cf506)) * lambda functions for priority in rules ([#3253](#3253)) ([d2aa226](d2aa226)) * Make on... directive of modules accessible ([#4050](#4050)) ([e9f2e1c](e9f2e1c)) ### Bug Fixes * adjust conda tests to not fail on apple silicon; fix [#4040](#4040) ([#4049](#4049)) ([f5b0142](f5b0142)) * allow "--containerize apptainer" to output apptainer format instead of dockerfile ([#4030](#4030)) ([f5cac30](f5cac30)) * apptainer command not recognized when singularity is absent ([#4010](#4010)) ([b8162e2](b8162e2)) * capture stderr when tests fail ([#3995](#3995)) ([97d74ba](97d74ba)) * **docs:** make Data-dependent conditional execution a complete example ([#4043](#4043)) ([3a1d7f2](3a1d7f2)) * don't build the DAG when running unlock. Fixes [#4000](#4000) and [#198](#198) ([#4007](#4007)) ([acf79fd](acf79fd)) * Ensure pixi tasks may be run as advertised ([#4046](#4046)) ([88253c2](88253c2)) * fix checkpoint handling corner cases ([#3870](#3870) and [#3559](#3559)) ([#4015](#4015)) ([63f4257](63f4257)) * issue 3642 ([#4054](#4054)) ([76e6fc2](76e6fc2)) * issue 3815 ([#4026](#4026)) ([b0eec96](b0eec96)) * logging None in shellcmd context causes error ([#4064](#4064)) ([d0652cd](d0652cd)) * lookup function returns default value for empty DataFrame queries ([#4056](#4056)) ([f71de97](f71de97)) * make `cache: omit-software` a rule specific property ([#4085](#4085)) ([034a9e7](034a9e7)) * reduce number of tests leaving temporary files behind ([#4033](#4033)) ([a3a1c97](a3a1c97)) * regression in dynamic resource handling ([#4038](#4038)) ([f2c554a](f2c554a)) * somewhat shorter announce message ([#4080](#4080)) ([57efc71](57efc71)) ### Performance Improvements * switch reretry with tenacity; decouple container classes (with Python 3.7 compat for old scripts) from rest of the codebase (enabling moving to newer python versions) ([#4032](#4032)) ([ffb19e7](ffb19e7)) ### Documentation * Add AI-assisted contributions policy to contributing guidelines ([#4051](#4051)) ([dd70526](dd70526)) * **codebase:** Update & simplify plugin architecture section ([#4052](#4052)) ([176cf63](176cf63)) * Correct workflow.source_path() description in documentation ([#4036](#4036)) ([45883c5](45883c5)) * fixed wrong code example for collect() function ([#4037](#4037)) ([5c85ed8](5c85ed8)) * Minor docs improvements ([#4089](#4089)) ([29ea226](29ea226)) * switch to sphinx_design for tabs ([#3976](#3976)) ([9674614](9674614)) * typo in the migration table breaking a pip install command ([#4024](#4024)) ([66f9dda](66f9dda)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
will fix #3870 and fix #3559
QC
Summary by CodeRabbit
Bug Fixes
Tests