Skip to content

Race condition in drop_iocache() crashes parallel SLURM workers during DAG init #4153

@kaybrand

Description

@kaybrand

Snakemake version

9.16.3 (the relevant code on main as of 2026-04-14 is identical)

What is the issue?

Persistence.drop_iocache() in snakemake/persistence/__init__.py (lines 831-833) uses a check-then-remove pattern that is not atomic:

def drop_iocache(self) -> None:
    if os.path.exists(self._iocache_filename):
        os.remove(self._iocache_filename)

When the SLURM executor launches multiple worker jobs in parallel, each worker re-imports the workflow and calls drop_iocache() during DAG.init() (dag.py:282). Multiple workers race to delete .snakemake/iocache/latest.pkl. The first worker's os.remove() succeeds; subsequent workers can pass the os.path.exists() check (due to filesystem caching or scheduling) but then os.remove() raises FileNotFoundError because the file is already gone.

This is a classic TOCTOU (time-of-check-time-of-use) race condition.

Expected behavior

drop_iocache() should tolerate the file being absent at removal time. The worker should continue normally into rule execution.

Actual behavior

The worker crashes with FileNotFoundError during DAG initialization, before the rule's shell command ever runs. The SLURM job is marked FAILED even though the rule itself was never executed.

Reproducibility

Any Snakemake workflow using the SLURM executor with multiple parallel jobs can trigger this. I encountered it once after invoking Snakemake version 9 about 70 times, each releasing up to 10 parallel SLURM jobs at a time. Low probability per invocation, but distributed execution on shared filesystems is a core Snakemake use case, so this will affect users at scale.

The race window is wider on networked/shared filesystems where client-side metadata caching means os.path.exists() can return a stale True after another process has already deleted the file.

A minimal workflow that can trigger it:

# Snakefile
rule all:
    input: expand("output/{sample}.txt", sample=range(20))

rule process:
    output: "output/{sample}.txt"
    shell: "echo {wildcards.sample} > {output}"
snakemake --executor slurm --jobs 20 --default-resources slurm_account=myaccount slurm_partition=normal

Because the race depends on multiple workers calling drop_iocache() in a narrow overlapping window, reproduction is not guaranteed on any single run. But the code path is deterministic -- every SLURM worker hits drop_iocache() during DAG init -- so it will surface eventually in any environment running enough parallel jobs.

Full traceback from real failure

Building DAG of jobs...
Traceback (most recent call last):

  File ".../snakemake/cli.py", line 2194, in args_to_api
    dag_api.execute_workflow(

  File ".../snakemake/api.py", line 646, in execute_workflow
    workflow.execute(

  File ".../snakemake/workflow.py", line 1304, in execute
    self._build_dag()

  File ".../snakemake/workflow.py", line 1247, in _build_dag
    self.async_run(self.dag.init())

  File ".../snakemake/workflow.py", line 267, in async_run
    return runner.run(coro)

  File ".../asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)

  File ".../asyncio/base_events.py", line 650, in run_until_complete
    return future.result()

  File ".../snakemake/dag.py", line 282, in init
    self.workflow.persistence.drop_iocache()

  File ".../snakemake/persistence.py", line 834, in drop_iocache
    os.remove(filepath)

FileNotFoundError: [Errno 2] No such file or directory: '.../.snakemake/iocache/latest.pkl'

Supporting evidence this is a race, not a real failure

  • 10 SLURM worker jobs were submitted within a 33-second window (10:43:08 -- 10:43:41). 9 of 10 completed successfully. Only 1 crashed.
  • The crash occurs during DAG initialization (dag.py:282), before the rule's shell command runs. The actual awk/bgzip/tabix command was never executed on the failed worker.
  • The traceback points to library code (snakemake/persistence/__init__.py), not to any user workflow script.
  • The shared filesystem is NFS-mounted, which uses client-side metadata caching. This widens the TOCTOU window: a worker's os.path.exists() can return True from a cached stat even after another worker has already called os.remove() on the metadata server.

Suggested fix

Replace the check-then-remove with an unconditional remove that suppresses FileNotFoundError:

def drop_iocache(self) -> None:
    try:
        os.remove(self._iocache_filename)
    except FileNotFoundError:
        pass

Or equivalently with pathlib:

def drop_iocache(self) -> None:
    Path(self._iocache_filename).unlink(missing_ok=True)

Both are atomic with respect to this race: they succeed regardless of whether another process already deleted the file.

Note: the same TOCTOU pattern exists in load_iocache() (lines 825-829), where os.path.exists() is followed by open(). If the file is deleted between check and open, this would raise FileNotFoundError as well. A similar guard may be warranted there.

Environment

  • Snakemake: 9.16.3 (latest at time of failure; current main branch has identical code)
  • Executor: snakemake-executor-plugin-slurm
  • Python: 3.11
  • OS: RHEL 7 (Linux 3.10.0)
  • Filesystem: NFS-mounted shared storage
  • Parallelism: 10 SLURM worker jobs submitted near-simultaneously

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions