Skip to content

memory(action=replace) silently clobbers external writes to MEMORY.md (data-loss race with patch tool / shell appends / concurrent sessions) #26045

@jrhouston-trilogy

Description

@jrhouston-trilogy

Summary

memory(action="replace") flushes the memory tool's full internal state to ~/.hermes/memories/MEMORY.md, silently overwriting any content that was written to the file by external writers (the patch tool, shell redirects, manual edits, other concurrent sessions). There is no merge, no conflict detection, and no warning when the on-disk file has drifted from the tool's view of it.

In practice this means: any agent that mixes memory(action=replace) with file-level edits to MEMORY.md has a latent data-loss bug. Two concurrent sessions on the same agent will hit it deterministically.

Reproduction

  1. Agent has MEMORY.md with 3 entries on disk (total ~8KB), of which only 1 entry is currently in the memory tool's internal state (the others were written via patch tool or shell append in a prior session that has since exited).
  2. New session starts. System-prompt-injected memory shows only the 1 known entry.
  3. Model calls memory(action="replace", old_text="<substring>", content="<correction>") to update that entry.
  4. Memory tool faithfully replaces its 1-entry state and writes the resulting 1-entry state to MEMORY.md.
  5. The other ~7KB of content on disk that the memory tool never knew about is gone. No error, no warning, no .bak rotation that includes the lost content (the rotation captures only the prior memory-tool state).

We reproduced this on a production agent ("Jason", running v0.13.0 / v2026.5.7) on 2026-05-14 ~21:35 ET. Vendor master / standing orders / open-orders sections that had been built via the patch tool in an earlier session were lost. Backups predate the work and could not restore.

Two concurrent sessions on the same agent made the race almost inevitable — Session A patched the file via patch tool, Session B (started afterward, system prompt only seeded with old memory state) called memory(replace), B's flush clobbered A's writes.

Root cause (suspected)

The memory tool treats MEMORY.md as canonical-from-tool — i.e., the file is just a serialized view of the tool's internal entry list. But the file is also documented as something agents and users can write to directly (the v0.13 install runbook even uses cat >> ~/.hermes/memories/MEMORY.md <<EOF in onboarding). Those two contracts are incompatible without merge or locking.

Additionally, replace will accept a model-provided old_text that does not match any current entry in the tool's view, silently doing nothing meaningful but still flushing state — so a model that picks replace when add was correct (a frequent Anthropic-model behavior we've observed on this codepath) can shrink memory dramatically without raising any error.

Suggested fixes (any of these would help; durable fix is some combination)

  1. Merge-on-write. Before writing, re-read MEMORY.md from disk. Merge external entries (anything not in tool state) into the new write. Raise an error if a merge conflict is unresolvable.
  2. Guardrail on shrinkage. Refuse a write that would reduce file size by more than some threshold (e.g. 50%) without an explicit --force or model acknowledgment.
  3. Conflict detection via mtime/hash. Read mtime+hash at session start; before write, re-check. If file changed externally, raise an error (parallel to what the patch tool already does — Jason's earlier session got "file was modified since you last read it on disk" from patch, which is exactly the right behavior).
  4. Split the file. MEMORY.md for tool-managed entries, MEMORY_user.md for file-level/external content. Tool never touches the user file.
  5. Tighten replace semantics. If old_text doesn't uniquely match an existing entry, return an error instead of treating it as a no-op + state flush.
  6. Prompt-level mitigation (interim). The memory tool description says replace is for "update existing -- old_text identifies it" but Anthropic models still reach for replace when add is correct. Strengthen the description so the model defaults to add for new corrections and only uses replace when explicitly correcting an existing exact entry.

Forensic evidence

From the affected agent's agent.log:

2026-05-14 21:35:35,742 WARNING [20260514_213324_95af7566] run_agent: Tool memory returned error (0.00s): {"error": "content is required for 'replace' action.", "success": false}
2026-05-14 21:35:40,049 INFO    [20260514_213324_95af7566] run_agent: tool memory completed (0.00s, 469 chars)

(Model retried with full args after the first call missed content; second call succeeded and wrote 469 chars — the entire post-clobber MEMORY.md.)

Session JSONL shows the tool call:

{
  "action": "replace",
  "target": "memory",
  "old_text": "**BB inbound emits duplicate webhook events per iMessage**",
  "content": "**BB typing indicator gets stuck \"on\"...**"
}

The model's intent was to correct a single prior entry. The effect was to flush the tool's 1-entry state to a file containing ~8KB of external-written content the tool didn't know about.

Operational impact

For us: real data loss tonight (~2 hours of vendor-master / standing-orders work, recoverable only via MemPalace and session-transcript reconstruction). Tolerable on a pilot agent; would be a much bigger deal if this hit a long-running production agent with months of accreted file-level memory.

For the wider Hermes user base: any agent following the install runbook's cat >> MEMORY.md pattern is exposed.

Workaround we're adopting

Interim sentinel at top of MEMORY.md documenting the hazard:

<!-- DO NOT call memory(action=replace) — file has external-write sections. -->

Plus splitting our deep-memory pointers / vendor master into MemPalace drawers so they're not dependent on file-level survival.

Neither workaround is durable. The fix needs to land upstream.


Filed by an agent on the affected fleet. Happy to share full reproduction artifacts (sanitized session JSONL, agent.log slice, before/after file states) if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Critical — data loss, security, crash loopcomp/agentCore agent loop, run_agent.py, prompt buildertool/memoryMemory tool and memory providerstype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions