fix(environments): write snapshot stores atomically to prevent corruption#5099
fix(environments): write snapshot stores atomically to prevent corruption#5099maymuneth wants to merge 4 commits into
Conversation
|
Correct direction — crash-midwrite leaving the snapshot store corrupt is a real operational hazard, and temp-file + rename is the right basic pattern. BUT this PR is weaker than the existing atomic-write patterns the codebase already uses. The PR cites
with open(tmp_path, "w") as handle:
handle.write(json.dumps(data))
handle.flush()
os.fsync(handle.fileno()) # <-- flush data to disk
os.replace(tmp_path, auth_file)
try:
dir_fd = os.open(str(auth_file.parent), os.O_RDONLY)
os.fsync(dir_fd) # <-- persist the rename too
os.close(dir_fd)
except OSError:
passThis PR: _tmp.write_text(json.dumps(data, indent=2))
_tmp.replace(_SNAPSHOT_STORE)
Process-level crashes (SIGKILL, OOM, Python exceptions) ARE handled correctly by the PR as-is — the OS completes the page-cache writes after the process dies. It's only hardware-level crashes (kernel panic, power loss, hard reboot) that expose the bug. Whether that's a meaningful risk depends on how operationally important snapshot IDs are. Given that the PR description frames this as a crash-recovery fix, it's worth going all the way: def _save_snapshots(data: Dict[str, str]) -> None:
_SNAPSHOT_STORE.parent.mkdir(parents=True, exist_ok=True)
_tmp = _SNAPSHOT_STORE.with_suffix(f".tmp.{os.getpid()}")
try:
with open(_tmp, "w") as f:
f.write(json.dumps(data, indent=2))
f.flush()
os.fsync(f.fileno())
_tmp.replace(_SNAPSHOT_STORE)
except Exception:
_tmp.unlink(missing_ok=True)
raiseThree small additions vs the PR:
Without (1) the fix doesn't deliver its stated goal. Core pattern is correct. Promote to the |
Resolved conflicts: - plugins/memory/openviking/__init__.py: kept our is_available() .env fallback, adopted upstream env-var defaults for account/user - plugins/memory/byterover/__init__.py: adopted upstream synchronous prefetch (replaces threaded queue_prefetch pattern) - hermes_cli/main.py: adopted upstream find_gateway_pids() approach for multi-gateway restart (replaces per-service-type restart block) - hermes_cli/__init__.py + pyproject.toml: version 0.7.0 (auto-resolved) Key upstream features: - Memory provider tools routed in sequential execution (NousResearch#4803) - Clean user message for memory operations (NousResearch#5099) - Duplicate message prevention (gateway dedup) (NousResearch#4878) - O(n^2) regex backtracking fix (831067c) - OpenViking tenant-scoping headers (NousResearch#4825)
What does this PR do?
Both
tools/environments/modal.pyandtools/environments/singularity.pypersist their snapshot ID mappings using
write_text()directly — anon-atomic operation. If the process is killed mid-write (crash, OOM,
SIGKILL), the snapshot store file is left partially written and corrupt.
On the next run,
json.loads()fails to parse the corrupt file — alltracked sandbox snapshot IDs are lost. This means Modal/Singularity
environments can no longer resume from their snapshots and must rebuild
from scratch, wasting significant time and compute.
Fix
Write to a
.tmpsibling file first, then usePath.replace()whichis atomic on POSIX systems (rename syscall). Applied to both files.
This is consistent with the atomic write pattern already used in
hermes_cli/auth.py,gateway/run.py, andtools/skill_manager_tool.py.Files Changed
tools/environments/modal.py—modal_snapshots.jsontools/environments/singularity.py—singularity_snapshots.jsonType of Change
Checklist