fix: storage temp() file cleanup with RemoteProvider #4189
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdd subprocess-exec early-return inside DAG.is_needed_tempfile; limit remote-storage consultation for is_unneeded_outside to remote runs; adjust handle_temp candidate iteration to a set comprehension; add tests covering downstream states, remote/subprocess modes, and a handle_temp regression. ChangesRemote Temp File Cleanup
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/test_remote_temp_cleanup.py`:
- Line 127: The assigned lambda "is_temp = lambda f: f.flags.get('temp', False)"
triggers Ruff E731; replace it with a named function (e.g., define function
is_temp(f) that returns f.flags.get("temp", False)) and use that function in
place of the lambda; update any references to "is_temp" unchanged so behavior
remains identical and ensure the new function is placed where the lambda was
defined (same scope).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 836baa99-9bda-46ee-a37a-196826962ec3
📒 Files selected for processing (3)
src/snakemake/dag.pysrc/snakemake/io/__init__.pytests/test_remote_temp_cleanup.py
The `is_needed_tempfile` method gated storage cleanup on `storage_settings.unneeded_temp_files`, which is only populated in spawned subprocesses via the `--unneeded-temp-files` CLI arg. In the main orchestrator process (where `handle_temp` runs), this set is always empty, so `is_unneeded_outside` was always False and temp files were never removed from remote storage. Remove the `remote_exec` branch that checked `unneeded_temp_files` and unconditionally set `is_unneeded_outside = True`. The main process has full DAG knowledge and already correctly determines whether a temp file is still needed via `is_needed_by_subsequent_job`. This works in conjunction with commit cb4e365 which added the `is_flagged(file, "temp")` condition to the `remove()` function, allowing `managed_remove()` to be called on storage-backed temp files. Add unit tests for `is_needed_tempfile` covering the remote execution regression case.
da60357 to
da16857
Compare
Python's set intersection (a & b) returns objects from the set being iterated (implementation-dependent, typically the smaller set). When tempfiles (output _IOFile with temp flag) is intersected with files (input _IOFile from _dependencies without temp flag), the result may contain the input _IOFile objects which lack the temp flag. This causes remove() to skip managed_remove() because is_flagged(file, 'temp') evaluates to False on the input _IOFile. Replace 'tempfiles & files' with a set comprehension that explicitly iterates tempfiles and checks membership in files, guaranteeing the output _IOFile objects (with temp and storage_object flags) are preserved.
da16857 to
5e42338
Compare
|
Hi @johanneskoester - I'm just checking to see if this PR needs any additions or edits? If so, please don't hesitate to let me know! |
There was a problem hiding this comment.
Pull request overview
This PR fixes remote-storage cleanup for temp() outputs by ensuring the main DAG process uses liveness information directly and preserves temp-flagged output file objects when selecting files to remove.
Changes:
- Updates
is_needed_tempfile()so remote temp cleanup is not blocked by an emptyunneeded_temp_filesset in the main process. - Replaces set intersection in
handle_temp()with a comprehension that preserves the output_IOFileinstance carrying thetempflag. - Adds regression-style unit tests for temp-file liveness and flag preservation behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/snakemake/dag.py |
Adjusts temp-file liveness and cleanup selection logic for remote storage cleanup. |
tests/test_remote_temp_cleanup.py |
Adds tests covering remote temp cleanup decisions and preserving temp-flagged file objects. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ain process and use unneeded_temp_files again
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_remote_temp_cleanup.py (1)
46-55:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winTest passes for the wrong reason due to missing
remote_exec = False.The fixture defaults
remote_exec=Truewithunneeded_temp_files=frozenset(). With these settings,is_unneeded_outsidebecomesFalse(line 972-974 in dag.py), causing the function to immediately returnTrueat line 997 — regardless of downstream consumers.To actually test the downstream-unfinished logic (the
is_needed_by_subsequent_jobcheck), this test needsremote_exec = Falseso thatis_unneeded_outside = Trueand the downstream check is exercised.Proposed fix
def test_temp_file_needed_when_downstream_unfinished(mock_dag): """A temp file is still needed if a consuming job has not finished.""" producer = MagicMock() consumer = MagicMock() tempfile = _make_tempfile("output.tmp") + mock_dag.workflow.remote_exec = False # test main process downstream logic mock_dag.depending[producer] = {consumer: {tempfile}} mock_dag._needrun.add(consumer) assert mock_dag.is_needed_tempfile(producer, tempfile)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_remote_temp_cleanup.py` around lines 46 - 55, The test currently exercises the downstream logic only accidentally because the fixture's default remote_exec=True makes is_unneeded_outside False and causes is_needed_tempfile to return True early; to fix, ensure the DAG is in remote_exec=False mode so is_unneeded_outside is True and the downstream check runs: update test_temp_file_needed_when_downstream_unfinished to set mock_dag.remote_exec = False (or obtain the fixture with remote_exec=False) before populating mock_dag.depending/_needrun and calling mock_dag.is_needed_tempfile so that the is_needed_by_subsequent_job path is actually exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@tests/test_remote_temp_cleanup.py`:
- Around line 46-55: The test currently exercises the downstream logic only
accidentally because the fixture's default remote_exec=True makes
is_unneeded_outside False and causes is_needed_tempfile to return True early; to
fix, ensure the DAG is in remote_exec=False mode so is_unneeded_outside is True
and the downstream check runs: update
test_temp_file_needed_when_downstream_unfinished to set mock_dag.remote_exec =
False (or obtain the fixture with remote_exec=False) before populating
mock_dag.depending/_needrun and calling mock_dag.is_needed_tempfile so that the
is_needed_by_subsequent_job path is actually exercised.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f9af988b-99ef-4453-9557-810f63dacfad
📒 Files selected for processing (2)
src/snakemake/dag.pytests/test_remote_temp_cleanup.py
|
Thank you, @johanneskoester! Much appreciated! |
🤖 I have created a release *beep* *boop* --- ## [9.21.1](v9.21.0...v9.21.1) (2026-05-29) ### Bug Fixes * add default json function to benchmarks ([#4128](#4128)) ([41fab22](41fab22)) * do not rerun when checkpoint job missing but downstream file exists ([#4124](#4124)) ([a060b93](a060b93)) * ensure that error logs contain all available details ([#4183](#4183)) ([74a86e9](74a86e9)) * handle missing pss attribute in benchmark on Windows ([#4160](#4160)) ([da52080](da52080)) * implement Resources.setdefault ([#3968](#3968)) ([2413e99](2413e99)) * reporting remote nodes number ([#3978](#3978)) ([8c534f0](8c534f0)) * resolve pathvars before constructing storage queries ([#3969](#3969)) ([bd15237](bd15237)) * storage temp() file cleanup with RemoteProvider ([#4189](#4189)) ([898bad1](898bad1)) * tolerate FileNotFoundError in drop_iocache ([#4153](#4153)) ([#4191](#4191)) ([ce26b28](ce26b28)) ### Documentation * Added guide on debugging workflows ([#4029](#4029)) ([3d052ae](3d052ae)) * **cli:** Remove broken ref bold markup ([#4204](#4204)) ([1200ebf](1200ebf)) * remove duplicated resources attribute in rules.rst ([#4190](#4190)) ([6c8ecdd](6c8ecdd)) * **rules:** Update script type hint advice ([#4193](#4193)) ([6108712](6108712)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
fix:
temp()files not removed from remote storage (S3)When using
temp()with a remote storage backend (e.g. S3) and a remote executor (e.g. AWS Batch), temporary files are never cleaned up from storage after they are no longer needed. Two bugs combine to cause this:Bug 1 —
is_needed_tempfilealways returnsTruein the main process (6d7c8823)is_needed_tempfilegated cleanup onstorage_settings.unneeded_temp_files, which is only populated in spawned subprocesses via the--unneeded-temp-filesCLI arg. In the main orchestrator process (wherehandle_tempruns), this set is always empty, sois_unneeded_outsidewas alwaysFalseand temp files were never passed toremove(). The main process has full DAG knowledge and already correctly determines liveness viais_needed_by_subsequent_job, so the gate is unnecessary.Bug 2 — set intersection loses the
tempflag (4bf0afeb)handle_tempcomputes files to remove viatempfiles & files— a set intersection between output_IOFileobjects (carrying thetempflag) and input_IOFileobjects from_dependencies(without the flag). Since_IOFileinherits fromstr, both sides hash/compare equal on the path string alone. Python'sset.__and__does not guarantee which side's objects are returned; when it returns the input objects,is_flagged(file, "temp")inremove()evaluates toFalseandmanaged_remove()is skipped.Replaced with
{f for f in tempfiles if f in files}to explicitly preserve the output_IOFileinstances.QC
docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).AI-assistance disclosure
I used AI assistance for:
Summary by CodeRabbit
Bug Fixes
Tests