Race condition in `drop_iocache()` crashes parallel SLURM workers during DAG init

### Snakemake version

9.16.3 (the relevant code on `main` as of 2026-04-14 is identical)

### What is the issue?

`Persistence.drop_iocache()` in `snakemake/persistence/__init__.py` (lines 831-833) uses a check-then-remove pattern that is not atomic:

```python
def drop_iocache(self) -> None:
    if os.path.exists(self._iocache_filename):
        os.remove(self._iocache_filename)
```

When the SLURM executor launches multiple worker jobs in parallel, each worker re-imports the workflow and calls `drop_iocache()` during `DAG.init()` (dag.py:282). Multiple workers race to delete `.snakemake/iocache/latest.pkl`. The first worker's `os.remove()` succeeds; subsequent workers can pass the `os.path.exists()` check (due to filesystem caching or scheduling) but then `os.remove()` raises `FileNotFoundError` because the file is already gone.

This is a classic TOCTOU (time-of-check-time-of-use) race condition.

### Expected behavior

`drop_iocache()` should tolerate the file being absent at removal time. The worker should continue normally into rule execution.

### Actual behavior

The worker crashes with `FileNotFoundError` during DAG initialization, before the rule's shell command ever runs. The SLURM job is marked FAILED even though the rule itself was never executed.

### Reproducibility

Any Snakemake workflow using the SLURM executor with multiple parallel jobs can trigger this.  I encountered it once after invoking Snakemake version 9 about 70 times, each releasing up to 10 parallel SLURM jobs at a time.  Low probability per invocation, but distributed execution on shared filesystems is a core Snakemake use case, so this will affect users at scale.

The race window is wider on networked/shared filesystems where client-side metadata caching means `os.path.exists()` can return a stale `True` after another process has already deleted the file.

A minimal workflow that can trigger it:

```python
# Snakefile
rule all:
    input: expand("output/{sample}.txt", sample=range(20))

rule process:
    output: "output/{sample}.txt"
    shell: "echo {wildcards.sample} > {output}"
```

```bash
snakemake --executor slurm --jobs 20 --default-resources slurm_account=myaccount slurm_partition=normal
```

Because the race depends on multiple workers calling `drop_iocache()` in a narrow overlapping window, reproduction is not guaranteed on any single run. But the code path is deterministic -- every SLURM worker hits `drop_iocache()` during DAG init -- so it will surface eventually in any environment running enough parallel jobs.

### Full traceback from real failure

```
Building DAG of jobs...
Traceback (most recent call last):

  File ".../snakemake/cli.py", line 2194, in args_to_api
    dag_api.execute_workflow(

  File ".../snakemake/api.py", line 646, in execute_workflow
    workflow.execute(

  File ".../snakemake/workflow.py", line 1304, in execute
    self._build_dag()

  File ".../snakemake/workflow.py", line 1247, in _build_dag
    self.async_run(self.dag.init())

  File ".../snakemake/workflow.py", line 267, in async_run
    return runner.run(coro)

  File ".../asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)

  File ".../asyncio/base_events.py", line 650, in run_until_complete
    return future.result()

  File ".../snakemake/dag.py", line 282, in init
    self.workflow.persistence.drop_iocache()

  File ".../snakemake/persistence.py", line 834, in drop_iocache
    os.remove(filepath)

FileNotFoundError: [Errno 2] No such file or directory: '.../.snakemake/iocache/latest.pkl'
```

### Supporting evidence this is a race, not a real failure

- 10 SLURM worker jobs were submitted within a 33-second window (10:43:08 -- 10:43:41). 9 of 10 completed successfully. Only 1 crashed.
- The crash occurs during DAG initialization (`dag.py:282`), **before** the rule's shell command runs. The actual `awk`/`bgzip`/`tabix` command was never executed on the failed worker.
- The traceback points to library code (`snakemake/persistence/__init__.py`), not to any user workflow script.
- The shared filesystem is NFS-mounted, which uses client-side metadata caching. This widens the TOCTOU window: a worker's `os.path.exists()` can return `True` from a cached `stat` even after another worker has already called `os.remove()` on the metadata server.

### Suggested fix

Replace the check-then-remove with an unconditional remove that suppresses `FileNotFoundError`:

```python
def drop_iocache(self) -> None:
    try:
        os.remove(self._iocache_filename)
    except FileNotFoundError:
        pass
```

Or equivalently with `pathlib`:

```python
def drop_iocache(self) -> None:
    Path(self._iocache_filename).unlink(missing_ok=True)
```

Both are atomic with respect to this race: they succeed regardless of whether another process already deleted the file.

Note: the same TOCTOU pattern exists in `load_iocache()` (lines 825-829), where `os.path.exists()` is followed by `open()`. If the file is deleted between check and open, this would raise `FileNotFoundError` as well. A similar guard may be warranted there.

### Environment

- **Snakemake**: 9.16.3 (latest at time of failure; current main branch has identical code)
- **Executor**: `snakemake-executor-plugin-slurm`
- **Python**: 3.11
- **OS**: RHEL 7 (Linux 3.10.0)
- **Filesystem**: NFS-mounted shared storage
- **Parallelism**: 10 SLURM worker jobs submitted near-simultaneously


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in `drop_iocache()` crashes parallel SLURM workers during DAG init #4153

Snakemake version

What is the issue?

Expected behavior

Actual behavior

Reproducibility

Full traceback from real failure

Supporting evidence this is a race, not a real failure

Suggested fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Race condition in drop_iocache() crashes parallel SLURM workers during DAG init #4153

Description

Snakemake version

What is the issue?

Expected behavior

Actual behavior

Reproducibility

Full traceback from real failure

Supporting evidence this is a race, not a real failure

Suggested fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Race condition in `drop_iocache()` crashes parallel SLURM workers during DAG init #4153