Skip to content

fix(backup): atomically restore cron/jobs.json via tempfile + rename#35594

Open
sprmn24 wants to merge 1 commit into
NousResearch:mainfrom
sprmn24:fix/atomik-olmayan-restore-cron
Open

fix(backup): atomically restore cron/jobs.json via tempfile + rename#35594
sprmn24 wants to merge 1 commit into
NousResearch:mainfrom
sprmn24:fix/atomik-olmayan-restore-cron

Conversation

@sprmn24

@sprmn24 sprmn24 commented May 31, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

restore_cron_jobs_if_emptied() in hermes_cli/backup.py uses
shutil.copy2(snap_path, live_path) to restore cron/jobs.json when a
hermes update empties it. shutil.copy2 is not atomic: if the process
is killed mid-copy (SIGKILL, OOM, power loss) the live file is left partially
written — corrupt JSON that is worse than the empty file the function was
trying to fix.

This PR replaces the copy with the atomic pattern already used by save_jobs()
in cron/jobs.py: mkstempwrite + fsyncos.replace. The temp file
is created in the same directory as the target so the final rename is guaranteed
to be on the same filesystem (atomic). On any failure the temp file is removed
and the live file is left untouched.

Related Issue

Fixes #34600 (follow-up hardening — the restore path introduced by #34840
inherited the non-atomic copy)

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • hermes_cli/backup.py: replace shutil.copy2(snap_path, live_path) with
    mkstemp + write + fsync + os.replace in restore_cron_jobs_if_emptied()

How to Test

  1. Create a few cron jobs (hermes cron add ...)
  2. Take a quick snapshot: run hermes update once so a pre-update snapshot exists
  3. Empty ~/.hermes/cron/jobs.json manually (echo '{jobs:[]}' > ...)
  4. Call restore_cron_jobs_if_emptied(snapshot_id) — confirm jobs are restored
  5. To verify atomicity: add a time.sleep() after fdopen write and kill the
    process mid-sleep — confirm no partial .cron_restore_*.tmp file remains and
    jobs.json is unchanged

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Windows 11

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

Same atomic pattern already merged in this repo:

restore_cron_jobs_if_emptied used shutil.copy2 which is non-atomic:
a crash mid-copy leaves a partially-written jobs.json, worse than the
empty file the function was trying to fix.

Replace with mkstemp + write + fsync + os.replace so the live file is
either fully restored or untouched. Mirrors the pattern already used by
save_jobs in cron/jobs.py (refs NousResearch#30857, NousResearch#30855, NousResearch#32233).
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/cron Cron scheduler and job management comp/cli CLI entry point, hermes_cli/, setup wizard labels May 31, 2026

@tonydwb tonydwb left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Verdict: Approved

Review

Replaces a non-atomic shutil.copy2 with the correct atomic pattern (mkstemp → write → fsync → os.replace) for cron job restore, preventing corrupt JSON on crash during restore.

✅ Looks Good

  • Correct pattern: Uses the same atomic write already proven in save_jobs() (cron/jobs.py).
  • Proper cleanup: Temp file removed on any failure via try/except with os.unlink.
  • Minimal change: 15 additions, 1 deletion. Straightforward and well-documented.
  • References existing precedent: Same pattern used in #30857, #30855, #32233.

Reviewed by Hermes Agent (cron job)

@tonydwb tonydwb left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Verdict: Approved

Reviewed Changes

  • hermes_cli/backup.py:restore_cron_jobs_if_emptied — Replaced shutil.copy2(snap_path, live_path) with atomic write pattern: tempfile.mkstemp → write → os.fsyncos.replace.

✅ Looks Good

  • Correctness: shutil.copy2 is not atomic — a partial write or concurrent reader could see a corrupted jobs.json. The new pattern uses tempfile.mkstemp + os.fsync + os.replace (which is atomic on POSIX) to ensure the target file is either fully the new content or fully the old content.
  • Error handling: On any BaseException, the temp file is cleaned up with os.unlink (wrapped in try/except OSError). Good defensive cleanup.
  • Edge cases: os.replace is cross-platform atomic on both Linux and macOS. The exclusive lock on live_path (via portalocker) is held during this operation in the calling context. Safe.
  • No security concerns: No secrets or credential exposure.
  • Test coverage: Tested by existing tests/hermes_cli/test_backup.py suite which validates restore behavior.

Reviewed by Hermes Agent

@tonydwb tonydwb left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Verdict: Approved

Overview

Fixes non-atomic cron job restore in restore_cron_jobs_if_emptied(). Replaces shutil.copy2() (which can leave partially-written corrupt JSON if interrupted mid-copy) with the atomic pattern: mkstemp + write + fsync + os.replace. Temp file is cleaned up on failure.

Looks Good

  • Used same atomic pattern as save_jobs() in cron/jobs.py — consistency
  • os.replace() is atomic on the same filesystem (temp file in same directory)
  • Temp file cleaned up on any exception via try/except BaseException
  • Well-documented with clear before/after behavior
  • Low blast radius: only affects the automated cron restore path

Reviewed by Hermes Agent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/cron Cron scheduler and job management P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(cron): config migration (23 to 24) may silently clear cron/jobs.json

3 participants