Skip to content

fix(entity_registry): clean up .tmp sidecar on write or rename failure (#1373)#1408

Merged
igorls merged 1 commit into
MemPalace:developfrom
arnoldwender:fix/entity-registry-tmp-cleanup-on-failure
May 17, 2026
Merged

fix(entity_registry): clean up .tmp sidecar on write or rename failure (#1373)#1408
igorls merged 1 commit into
MemPalace:developfrom
arnoldwender:fix/entity-registry-tmp-cleanup-on-failure

Conversation

@arnoldwender

Copy link
Copy Markdown
Contributor

What and Why

#1215 made EntityRegistry.save() atomic via temp-file + fsync + os.replace. Crash-mid-write durability is correct — the previous registry stays intact on any failure. But the .tmp sidecar was left on disk if f.write() / f.flush() / os.fsync() / os.replace() raised. Subsequent saves overwrite the same temp filename so this does not grow unboundedly, but it litters the palace directory and obscures diagnostics: a user inspecting entity_registry.json.tmp after a crash cannot tell whether it is the in-flight write of an ongoing process or stale debris from a prior failure.

@igorls flagged this in #1373.

Root Cause

mempalace/entity_registry.py:328-338 — the write + chmod + os.replace block ran without an outer try/except. The with open(...) as f: block closes the fd on exit but does not unlink the file. If f.write(), f.flush(), or os.fsync() raised (disk full, perms flip, broken FUSE mount, IO error), the temp file stayed. If os.replace() raised, same outcome.

Reproduction

import os, pytest
from mempalace.entity_registry import EntityRegistry

registry = EntityRegistry.load(config_dir=tmp_path)
registry.seed(mode='personal', people=[{'name': 'Alice', 'relationship': 'friend', 'context': 'personal'}], projects=[])
registry.save()

# Force os.fsync to raise — simulates IO error after bytes are in page cache
real_fsync = os.fsync
os.fsync = lambda fd: (_ for _ in ()).throw(OSError('IO error'))
try:
    registry.seed(mode='personal', people=[{'name': 'Bob', ...}], projects=[])
    registry.save()
except OSError:
    pass
os.fsync = real_fsync

# Before this PR:
list(tmp_path.glob('entity_registry.json.tmp*'))  # ['entity_registry.json.tmp']  ← litter

# After this PR:
list(tmp_path.glob('entity_registry.json.tmp*'))  # []

Change Summary

  • mempalace/entity_registry.py:328-348 — wrap the write + chmod + os.replace block in try/except BaseException. On failure, tmp_path.unlink(missing_ok=True) and re-raise. The dir-fsync stays outside the try because it only runs on the successful-rename path.
  • Cleanup uses BaseException so KeyboardInterrupt mid-write also leaves a clean directory; the inner unlink itself is wrapped against an OSError race so cleanup cannot mask the original exception.
  • tests/test_entity_registry.py — augment test_save_preserves_previous_on_serialization_failure with an additional assert leftover == [] after the forced os.replace failure.
  • tests/test_entity_registry.py — new test_save_cleans_tmp_on_write_failure that forces os.fsync to raise (the gap between write and rename that the existing test does not cover) and asserts both the durability invariant (previous registry intact) and the cleanup invariant (no .tmp sidecar left).

Test Plan

Closes #1373.

MemPalace#1373)

MemPalace#1215 made `EntityRegistry.save()` atomic via temp-file + fsync + os.replace.
Crash-mid-write durability is correct: the previous registry stays intact on
any failure. But if `f.write()` / `f.flush()` / `os.fsync()` / `os.replace()`
raise (disk full, perms flip, broken FUSE mount, IO error), the `.tmp` sidecar
was left on disk. Subsequent saves overwrite the same path so it does not
grow unboundedly, but it litters the palace directory and obscures
diagnostics — a user inspecting `entity_registry.json.tmp` after a crash
cannot distinguish in-flight writes from stale debris.

Wrap the write+chmod+replace block in try/except. On any exception, attempt
`tmp_path.unlink(missing_ok=True)` before re-raising. The dir-fsync is
deliberately outside the try — that is durability for a successful rename,
not a write step that needs cleanup.

Tests:
- Augment `test_save_preserves_previous_on_serialization_failure` to also
  assert the .tmp sidecar is gone after a forced os.replace failure.
- New `test_save_cleans_tmp_on_write_failure` forces os.fsync to raise,
  covering the gap between write and rename that the existing test does
  not exercise.

Closes MemPalace#1373.
@arnoldwender

Copy link
Copy Markdown
Contributor Author

Rebased on upstream/develop (latest 1247e17, post-3.3.5 release). No logical changes — conflict-free rebase, only commit replay onto current develop. CI green pre-rebase; awaiting fresh CI run.

@arnoldwender arnoldwender force-pushed the fix/entity-registry-tmp-cleanup-on-failure branch from 02e0de9 to e6f7398 Compare May 10, 2026 10:57
@igorls igorls added this to the v3.3.6 milestone May 15, 2026
@igorls igorls merged commit e6c06ef into MemPalace:develop May 17, 2026
6 checks passed
arnoldwender pushed a commit to arnoldwender/mempalace that referenced this pull request May 24, 2026
Bumps version 3.3.5 → 3.3.6 across pyproject.toml, version.py, plugin
manifests (.claude-plugin/plugin.json, .claude-plugin/marketplace.json,
.codex-plugin/plugin.json), README badge, and uv.lock. Flips CHANGELOG.md
from ``[Unreleased]`` to ``[3.3.6] — 2026-05-24`` and backfills the
major user-facing entries that landed without changelog entries during
the cycle:

Features:
- MemPalace#1555 office-document mining via --mode extract + virtual line numbers
- MemPalace#1584 surgical closet pointers with date+line locators (Tier 6a)
- MemPalace#1558 + MemPalace#1560 within-wing hallways (entity co-occurrence graph)
- MemPalace#1565 cross-wing tunnels auto-promoted from hallways
- MemPalace#1578 Hebbian potentiation + Ebbinghaus decay on hallways/tunnels
- MemPalace#1236 API-tool transcripts auto-route to wing_api
- MemPalace#711 hooks.auto_save toggle for silent-mode sessions
- MemPalace#1605 COCA content-word filter for entity detection
- MemPalace#1557 case-insensitive entity matching at mine time
- MemPalace#1483 multilingual embeddings (embeddinggemma-300m) by default

Bug Fixes (selected, user-visible):
- MemPalace#1540 silent data loss in three unchunked upsert sites
- MemPalace#1538 paragraph chunker oversized chunks
- MemPalace#1554 per-file chunk cap too low for transcripts
- MemPalace#1562 Windows hook subprocess/ChromaDB deadlock
- MemPalace#1529 create_tunnel corrupted hyphenated wing names
- MemPalace#1424 save-hook truncated hyphenated project folders
- MemPalace#1383 KG cache duplicated graphs for symlinked/cased paths
- MemPalace#1466 silent symlink skip now logged
- MemPalace#1441 macOS stock-bash 3.2 hook compatibility
- MemPalace#1500 / MemPalace#1513 structured JSON-RPC errors on bad MCP input
- MemPalace#1523 VACUUM + FTS5 rebuild after repair
- MemPalace#1548 FTS5 validation at end of mine
- plus MemPalace#1216, MemPalace#1408, MemPalace#1438, MemPalace#1439, MemPalace#1445, MemPalace#1452, MemPalace#1459, MemPalace#1461, MemPalace#1466,
  MemPalace#1470, MemPalace#1477, MemPalace#1485, MemPalace#1500, MemPalace#1513, MemPalace#1528, MemPalace#1532, MemPalace#1543, MemPalace#1546, MemPalace#1585

Performance:
- MemPalace#1474 convo miner pre-fetches mined-set
- MemPalace#1487 rebuild_index progress callback
- MemPalace#1530 MCP cold-start diagnostics + opt-in warmup

Lint passes (ruff 0.15.14); mempalace-mcp entry point alignment
verified per RELEASING.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working storage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

entity_registry: .tmp sidecar leaks if write fails before os.replace

2 participants