Snakemake version
9.16.3 (the relevant code on main as of 2026-04-14 is identical)
What is the issue?
Persistence.drop_iocache() in snakemake/persistence/__init__.py (lines 831-833) uses a check-then-remove pattern that is not atomic:
def drop_iocache(self) -> None:
if os.path.exists(self._iocache_filename):
os.remove(self._iocache_filename)
When the SLURM executor launches multiple worker jobs in parallel, each worker re-imports the workflow and calls drop_iocache() during DAG.init() (dag.py:282). Multiple workers race to delete .snakemake/iocache/latest.pkl. The first worker's os.remove() succeeds; subsequent workers can pass the os.path.exists() check (due to filesystem caching or scheduling) but then os.remove() raises FileNotFoundError because the file is already gone.
This is a classic TOCTOU (time-of-check-time-of-use) race condition.
Expected behavior
drop_iocache() should tolerate the file being absent at removal time. The worker should continue normally into rule execution.
Actual behavior
The worker crashes with FileNotFoundError during DAG initialization, before the rule's shell command ever runs. The SLURM job is marked FAILED even though the rule itself was never executed.
Reproducibility
Any Snakemake workflow using the SLURM executor with multiple parallel jobs can trigger this. I encountered it once after invoking Snakemake version 9 about 70 times, each releasing up to 10 parallel SLURM jobs at a time. Low probability per invocation, but distributed execution on shared filesystems is a core Snakemake use case, so this will affect users at scale.
The race window is wider on networked/shared filesystems where client-side metadata caching means os.path.exists() can return a stale True after another process has already deleted the file.
A minimal workflow that can trigger it:
# Snakefile
rule all:
input: expand("output/{sample}.txt", sample=range(20))
rule process:
output: "output/{sample}.txt"
shell: "echo {wildcards.sample} > {output}"
snakemake --executor slurm --jobs 20 --default-resources slurm_account=myaccount slurm_partition=normal
Because the race depends on multiple workers calling drop_iocache() in a narrow overlapping window, reproduction is not guaranteed on any single run. But the code path is deterministic -- every SLURM worker hits drop_iocache() during DAG init -- so it will surface eventually in any environment running enough parallel jobs.
Full traceback from real failure
Building DAG of jobs...
Traceback (most recent call last):
File ".../snakemake/cli.py", line 2194, in args_to_api
dag_api.execute_workflow(
File ".../snakemake/api.py", line 646, in execute_workflow
workflow.execute(
File ".../snakemake/workflow.py", line 1304, in execute
self._build_dag()
File ".../snakemake/workflow.py", line 1247, in _build_dag
self.async_run(self.dag.init())
File ".../snakemake/workflow.py", line 267, in async_run
return runner.run(coro)
File ".../asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File ".../asyncio/base_events.py", line 650, in run_until_complete
return future.result()
File ".../snakemake/dag.py", line 282, in init
self.workflow.persistence.drop_iocache()
File ".../snakemake/persistence.py", line 834, in drop_iocache
os.remove(filepath)
FileNotFoundError: [Errno 2] No such file or directory: '.../.snakemake/iocache/latest.pkl'
Supporting evidence this is a race, not a real failure
- 10 SLURM worker jobs were submitted within a 33-second window (10:43:08 -- 10:43:41). 9 of 10 completed successfully. Only 1 crashed.
- The crash occurs during DAG initialization (
dag.py:282), before the rule's shell command runs. The actual awk/bgzip/tabix command was never executed on the failed worker.
- The traceback points to library code (
snakemake/persistence/__init__.py), not to any user workflow script.
- The shared filesystem is NFS-mounted, which uses client-side metadata caching. This widens the TOCTOU window: a worker's
os.path.exists() can return True from a cached stat even after another worker has already called os.remove() on the metadata server.
Suggested fix
Replace the check-then-remove with an unconditional remove that suppresses FileNotFoundError:
def drop_iocache(self) -> None:
try:
os.remove(self._iocache_filename)
except FileNotFoundError:
pass
Or equivalently with pathlib:
def drop_iocache(self) -> None:
Path(self._iocache_filename).unlink(missing_ok=True)
Both are atomic with respect to this race: they succeed regardless of whether another process already deleted the file.
Note: the same TOCTOU pattern exists in load_iocache() (lines 825-829), where os.path.exists() is followed by open(). If the file is deleted between check and open, this would raise FileNotFoundError as well. A similar guard may be warranted there.
Environment
- Snakemake: 9.16.3 (latest at time of failure; current main branch has identical code)
- Executor:
snakemake-executor-plugin-slurm
- Python: 3.11
- OS: RHEL 7 (Linux 3.10.0)
- Filesystem: NFS-mounted shared storage
- Parallelism: 10 SLURM worker jobs submitted near-simultaneously
Snakemake version
9.16.3 (the relevant code on
mainas of 2026-04-14 is identical)What is the issue?
Persistence.drop_iocache()insnakemake/persistence/__init__.py(lines 831-833) uses a check-then-remove pattern that is not atomic:When the SLURM executor launches multiple worker jobs in parallel, each worker re-imports the workflow and calls
drop_iocache()duringDAG.init()(dag.py:282). Multiple workers race to delete.snakemake/iocache/latest.pkl. The first worker'sos.remove()succeeds; subsequent workers can pass theos.path.exists()check (due to filesystem caching or scheduling) but thenos.remove()raisesFileNotFoundErrorbecause the file is already gone.This is a classic TOCTOU (time-of-check-time-of-use) race condition.
Expected behavior
drop_iocache()should tolerate the file being absent at removal time. The worker should continue normally into rule execution.Actual behavior
The worker crashes with
FileNotFoundErrorduring DAG initialization, before the rule's shell command ever runs. The SLURM job is marked FAILED even though the rule itself was never executed.Reproducibility
Any Snakemake workflow using the SLURM executor with multiple parallel jobs can trigger this. I encountered it once after invoking Snakemake version 9 about 70 times, each releasing up to 10 parallel SLURM jobs at a time. Low probability per invocation, but distributed execution on shared filesystems is a core Snakemake use case, so this will affect users at scale.
The race window is wider on networked/shared filesystems where client-side metadata caching means
os.path.exists()can return a staleTrueafter another process has already deleted the file.A minimal workflow that can trigger it:
Because the race depends on multiple workers calling
drop_iocache()in a narrow overlapping window, reproduction is not guaranteed on any single run. But the code path is deterministic -- every SLURM worker hitsdrop_iocache()during DAG init -- so it will surface eventually in any environment running enough parallel jobs.Full traceback from real failure
Supporting evidence this is a race, not a real failure
dag.py:282), before the rule's shell command runs. The actualawk/bgzip/tabixcommand was never executed on the failed worker.snakemake/persistence/__init__.py), not to any user workflow script.os.path.exists()can returnTruefrom a cachedstateven after another worker has already calledos.remove()on the metadata server.Suggested fix
Replace the check-then-remove with an unconditional remove that suppresses
FileNotFoundError:Or equivalently with
pathlib:Both are atomic with respect to this race: they succeed regardless of whether another process already deleted the file.
Note: the same TOCTOU pattern exists in
load_iocache()(lines 825-829), whereos.path.exists()is followed byopen(). If the file is deleted between check and open, this would raiseFileNotFoundErroras well. A similar guard may be warranted there.Environment
snakemake-executor-plugin-slurm