Skip to content

fix(environments): write snapshot stores atomically to prevent corruption#5099

Open
maymuneth wants to merge 4 commits into
NousResearch:mainfrom
maymuneth:fix/environments-snapshot-atomic-write
Open

fix(environments): write snapshot stores atomically to prevent corruption#5099
maymuneth wants to merge 4 commits into
NousResearch:mainfrom
maymuneth:fix/environments-snapshot-atomic-write

Conversation

@maymuneth

Copy link
Copy Markdown
Contributor

What does this PR do?

Both tools/environments/modal.py and tools/environments/singularity.py
persist their snapshot ID mappings using write_text() directly — a
non-atomic operation. If the process is killed mid-write (crash, OOM,
SIGKILL), the snapshot store file is left partially written and corrupt.

On the next run, json.loads() fails to parse the corrupt file — all
tracked sandbox snapshot IDs are lost. This means Modal/Singularity
environments can no longer resume from their snapshots and must rebuild
from scratch, wasting significant time and compute.

Fix

Write to a .tmp sibling file first, then use Path.replace() which
is atomic on POSIX systems (rename syscall). Applied to both files.

# Before (non-atomic)
_SNAPSHOT_STORE.write_text(json.dumps(data, indent=2))

# After (atomic)
_tmp = _SNAPSHOT_STORE.with_suffix(".tmp")
_tmp.write_text(json.dumps(data, indent=2))
_tmp.replace(_SNAPSHOT_STORE)

This is consistent with the atomic write pattern already used in
hermes_cli/auth.py, gateway/run.py, and tools/skill_manager_tool.py.

Files Changed

  • tools/environments/modal.pymodal_snapshots.json
  • tools/environments/singularity.pysingularity_snapshots.json

Type of Change

  • 🐛 Bug fix (reliability)

Checklist

  • Read the Contributing Guide
  • Commit messages follow Conventional Commits
  • Consistent with existing atomic write pattern in Hermes
  • No behavior change under normal operation
  • Prevents snapshot state loss after process crash

@trevorgordon981

Copy link
Copy Markdown

Correct direction — crash-midwrite leaving the snapshot store corrupt is a real operational hazard, and temp-file + rename is the right basic pattern. BUT this PR is weaker than the existing atomic-write patterns the codebase already uses. The PR cites hermes_cli/auth.py as the model, so let's compare:

hermes_cli/auth.py (existing, correct):

with open(tmp_path, "w") as handle:
    handle.write(json.dumps(data))
    handle.flush()
    os.fsync(handle.fileno())       # <-- flush data to disk
os.replace(tmp_path, auth_file)
try:
    dir_fd = os.open(str(auth_file.parent), os.O_RDONLY)
    os.fsync(dir_fd)                # <-- persist the rename too
    os.close(dir_fd)
except OSError:
    pass

This PR:

_tmp.write_text(json.dumps(data, indent=2))
_tmp.replace(_SNAPSHOT_STORE)

Path.write_text() does not fsync — it writes to the page cache and returns. replace() then renames, which IS durable for metadata on journaled filesystems, but the data behind the filename may still be in the page cache. If the kernel panics or the machine loses power between these two steps, you can end up with:

  • A zero-length target file (rename committed but data pages hadn't been flushed)
  • A target file pointing at old data (rename happened with new name, new contents not flushed)

Process-level crashes (SIGKILL, OOM, Python exceptions) ARE handled correctly by the PR as-is — the OS completes the page-cache writes after the process dies. It's only hardware-level crashes (kernel panic, power loss, hard reboot) that expose the bug. Whether that's a meaningful risk depends on how operationally important snapshot IDs are.

Given that the PR description frames this as a crash-recovery fix, it's worth going all the way:

def _save_snapshots(data: Dict[str, str]) -> None:
    _SNAPSHOT_STORE.parent.mkdir(parents=True, exist_ok=True)
    _tmp = _SNAPSHOT_STORE.with_suffix(f".tmp.{os.getpid()}")
    try:
        with open(_tmp, "w") as f:
            f.write(json.dumps(data, indent=2))
            f.flush()
            os.fsync(f.fileno())
        _tmp.replace(_SNAPSHOT_STORE)
    except Exception:
        _tmp.unlink(missing_ok=True)
        raise

Three small additions vs the PR:

  1. fsync before rename — ensures data is on disk, not just in page cache
  2. os.getpid() in tmp suffix — prevents two processes racing to write the same .tmp file (process A writes, process B overwrites, process A renames B's data)
  3. try/finally with unlink — cleans up stale .tmp files when the write fails partway (disk full, perm error, etc.)

Without (1) the fix doesn't deliver its stated goal.
Without (2) concurrent writers corrupt each other. Probably rare in practice for snapshot stores.
Without (3) you accumulate stale .tmp files over time.

Core pattern is correct. Promote to the auth.py level of rigor to actually close the crash-recovery gap.

zebster-cmd added a commit to zebster-cmd/hermes-agent that referenced this pull request Apr 18, 2026
Resolved conflicts:
- plugins/memory/openviking/__init__.py: kept our is_available() .env
  fallback, adopted upstream env-var defaults for account/user
- plugins/memory/byterover/__init__.py: adopted upstream synchronous
  prefetch (replaces threaded queue_prefetch pattern)
- hermes_cli/main.py: adopted upstream find_gateway_pids() approach
  for multi-gateway restart (replaces per-service-type restart block)
- hermes_cli/__init__.py + pyproject.toml: version 0.7.0 (auto-resolved)

Key upstream features:
- Memory provider tools routed in sequential execution (NousResearch#4803)
- Clean user message for memory operations (NousResearch#5099)
- Duplicate message prevention (gateway dedup) (NousResearch#4878)
- O(n^2) regex backtracking fix (831067c)
- OpenViking tenant-scoping headers (NousResearch#4825)
@alt-glitch alt-glitch added type/bug Something isn't working backend/modal Modal.com cloud execution backend/singularity Singularity container execution P2 Medium — degraded but workaround exists labels May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend/modal Modal.com cloud execution backend/singularity Singularity container execution P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants