Skip to content

fix: use atomic writes for persistent state files to prevent corruption#20532

Open
vominh1919 wants to merge 1 commit into
NousResearch:mainfrom
vominh1919:fix/non-atomic-write-text
Open

fix: use atomic writes for persistent state files to prevent corruption#20532
vominh1919 wants to merge 1 commit into
NousResearch:mainfrom
vominh1919:fix/non-atomic-write-text

Conversation

@vominh1919

Copy link
Copy Markdown
Contributor

Problem

Non-atomic write_text() calls can leave files in a partially-written state if the process is killed mid-write (OOM kill, power loss, SIGKILL). This causes:

  • JSON stores: json.loads() raises JSONDecodeError on next read, breaking all functionality that depends on the store
  • Text files: Truncated output, incomplete delivery records

The codebase already uses atomic writes in several places (run_agent.py via atomic_json_write(), gateway/run.py via tmp+replace, tools/mcp_oauth.py via tmp+rename). This PR brings the remaining persistent state writes to parity.

Fix

Replace write_text() with atomic write patterns for 6 locations across 3 files.

Files changed:

File Function What it writes Risk
tools/environments/base.py _save_json_store() Execution environment state HIGH — JSONDecodeError breaks all subsequent reads
tools/skills_hub.py InstalledSkillsRegistry.save() Installed skills metadata HIGH — loses all installed-skill tracking
tools/skills_hub.py TapsRegistry.save() Taps data MEDIUM — loses skill source repos
tools/skills_hub.py index cache write Skills hub index cache LOW — regeneratable from network
gateway/delivery.py _save_summary_output() Cron delivery output MEDIUM — truncated delivery record
gateway/delivery.py _save_full_output() Full cron output MEDIUM — truncated output

Implementation:

  • JSON writes: Reuse existing atomic_json_write() from utils.py (handles temp file + fsync + os.replace + symlink resolution)
  • Text writes (delivery.py): Added local _atomic_write_text() helper with the same pattern (tempfile + fsync + os.replace)

Before vs After

Scenario Before After
Process killed mid-write File left partially written, next read crashes Previous version intact, no data loss
Normal write Works Works (atomic replace is transparent)
Disk full during write Partial file + no error handling Temp file creation fails, original preserved

Tests

The fix is defensive-only — it cannot break existing behavior since atomic writes produce identical output to non-atomic writes under normal operation. The difference only matters under abnormal termination, which is difficult to test in unit tests.

References

  • utils.py:85atomic_json_write() implementation
  • run_agent.py:4377 — existing atomic write usage for session logs
  • gateway/run.py:4788 — existing tmp+replace pattern for response files
  • tools/mcp_oauth.py:165 — existing tmp+rename pattern for OAuth tokens

Non-atomic write_text() calls can leave files in a partially-written
state if the process is killed mid-write. This causes JSONDecodeError
on next read (for JSON stores) or truncated output (for text files).

Fixed 6 locations:
- tools/environments/base.py: _save_json_store() — JSON state for
  execution environments. Corruption breaks all subsequent reads.
- tools/skills_hub.py: InstalledSkillsRegistry.save() — installed
  skills metadata. Corruption loses all installed-skill tracking.
- tools/skills_hub.py: TapsRegistry.save() — taps data.
- tools/skills_hub.py: index cache write — skills hub index cache.
- gateway/delivery.py: _save_summary_output() — cron delivery output.
- gateway/delivery.py: _save_full_output() — full cron output.

For JSON writes, reuses the existing atomic_json_write() from utils.py
(which already handles temp file + fsync + os.replace + symlink
resolution). For text writes in delivery.py, added a local
_atomic_write_text() helper with the same pattern.

The codebase already uses atomic writes in several places:
- run_agent.py: atomic_json_write() for session logs
- gateway/run.py: tmp + replace for response/pending files
- tools/mcp_oauth.py: tmp + rename for OAuth tokens
This commit brings the remaining persistent state writes to parity.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery comp/cron Cron scheduler and job management tool/skills Skills system (list, view, manage) labels May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cron Cron scheduler and job management comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists tool/skills Skills system (list, view, manage) type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants