Skip to content

fix: tolerate FileNotFoundError in drop_iocache (#4153)#4191

Merged
johanneskoester merged 3 commits into
snakemake:mainfrom
SAY-5:fix/iocache-toctou-race
May 29, 2026
Merged

fix: tolerate FileNotFoundError in drop_iocache (#4153)#4191
johanneskoester merged 3 commits into
snakemake:mainfrom
SAY-5:fix/iocache-toctou-race

Conversation

@SAY-5

@SAY-5 SAY-5 commented May 12, 2026

Copy link
Copy Markdown
Contributor

Fixes #4153.

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The documentation (`docs/`) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).
  • I, as a human being, have checked each line of code in this pull request

AI-assistance disclosure

I used AI assistance for:

  • Code generation (e.g., when writing an implementation or fixing a bug)
  • Test/benchmark generation

The race in `Persistence.drop_iocache()` (`src/snakemake/persistence/init.py`) is a check-then-remove pattern. Multiple parallel SLURM workers can pass `os.path.exists()` for `.snakemake/iocache/latest.pkl` (especially on shared filesystems with stale metadata caching), then have `os.remove()` raise `FileNotFoundError` because another worker already deleted the file, crashing DAG init.

Replaced the check-then-remove with a try/except that swallows `FileNotFoundError`. Added two regression tests in `tests/test_persistence.py` covering both the missing-file case and the normal removal path.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved robustness of cache cleanup to gracefully handle missing files during operations.
  • Tests

    • Added test coverage for cache cleanup edge cases.

Review Change Stack

Replace check-then-remove with try/except so parallel SLURM workers
racing on .snakemake/iocache/latest.pkl don't crash when a sibling
worker has already removed the file.

Signed-off-by: SAY-5 <say.apm35@gmail.com>
@SAY-5 SAY-5 requested a review from johanneskoester as a code owner May 12, 2026 07:15
@coderabbitai

coderabbitai Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9bbdb443-eade-4b36-a06d-ca1b2132061b

📥 Commits

Reviewing files that changed from the base of the PR and between 384b4a3 and dafeb40.

📒 Files selected for processing (2)
  • src/snakemake/persistence/__init__.py
  • tests/test_persistence.py

📝 Walkthrough

Walkthrough

The PR fixes a TOCTOU race condition in drop_iocache() where parallel SLURM workers crash when multiple processes attempt concurrent deletion of the iocache file. The method now deletes unconditionally and suppresses FileNotFoundError instead of checking existence first, and new tests verify both missing and existing file scenarios.

Changes

IOCache Race Condition Fix

Layer / File(s) Summary
Race condition fix and regression tests
src/snakemake/persistence/__init__.py, tests/test_persistence.py
drop_iocache() replaces check-then-remove with unconditional deletion wrapped in try-except to suppress FileNotFoundError. Test imports add MagicMock support, and new TestDropIOCache class verifies the method is a no-op when the iocache file is absent and correctly removes it when present.

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main change: replacing the check-then-remove pattern with a try/except that tolerates FileNotFoundError in drop_iocache().
Description check ✅ Passed The description is well-structured, includes all required QC checklist items properly checked, discloses AI assistance, explains the race condition and fix clearly, and references the linked issue.
Linked Issues check ✅ Passed The PR fully addresses issue #4153: it replaces the check-then-remove with try/except suppressing FileNotFoundError, includes regression tests for both missing-file and normal removal paths, and tolerates the file being absent at removal time as required.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the TOCTOU race in drop_iocache(): the conditional persistence code change and the addition of targeted regression tests. No unrelated modifications are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@cademirch cademirch left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for the PR, and welcome! :)

Will wait for CI to be green before merging.

@johanneskoester johanneskoester enabled auto-merge (squash) May 29, 2026 07:55
@johanneskoester johanneskoester merged commit ce26b28 into snakemake:main May 29, 2026
59 checks passed
johanneskoester pushed a commit that referenced this pull request May 29, 2026
🤖 I have created a release *beep* *boop*
---


##
[9.21.1](v9.21.0...v9.21.1)
(2026-05-29)


### Bug Fixes

* add default json function to benchmarks
([#4128](#4128))
([41fab22](41fab22))
* do not rerun when checkpoint job missing but downstream file exists
([#4124](#4124))
([a060b93](a060b93))
* ensure that error logs contain all available details
([#4183](#4183))
([74a86e9](74a86e9))
* handle missing pss attribute in benchmark on Windows
([#4160](#4160))
([da52080](da52080))
* implement Resources.setdefault
([#3968](#3968))
([2413e99](2413e99))
* reporting remote nodes number
([#3978](#3978))
([8c534f0](8c534f0))
* resolve pathvars before constructing storage queries
([#3969](#3969))
([bd15237](bd15237))
* storage temp() file cleanup with RemoteProvider
([#4189](#4189))
([898bad1](898bad1))
* tolerate FileNotFoundError in drop_iocache
([#4153](#4153))
([#4191](#4191))
([ce26b28](ce26b28))


### Documentation

* Added guide on debugging workflows
([#4029](#4029))
([3d052ae](3d052ae))
* **cli:** Remove broken ref bold markup
([#4204](#4204))
([1200ebf](1200ebf))
* remove duplicated resources attribute in rules.rst
([#4190](#4190))
([6c8ecdd](6c8ecdd))
* **rules:** Update script type hint advice
([#4193](#4193))
([6108712](6108712))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition in drop_iocache() crashes parallel SLURM workers during DAG init

3 participants