Skip to content

fix(compression): prevent session-id fork from concurrent compressions#34351

Merged
teknium1 merged 2 commits into
mainfrom
hermes/hermes-cb38d90d
May 29, 2026
Merged

fix(compression): prevent session-id fork from concurrent compressions#34351
teknium1 merged 2 commits into
mainfrom
hermes/hermes-cb38d90d

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Adds a state.db-backed per-session compression lock so two AIAgent instances sharing the same session_id can no longer both rotate it concurrently — eliminating the transcript-fork class that produced Damien's "parent → two child sessions, only one visible" incident.

The race (reproduced + repro shipped as a regression test)

  1. A turn ends at high context utilization (~145k of 272k).
  2. conversation_loop.py:4571 fires _spawn_background_review(messages_snapshot=list(messages)). The fork inherits agent.session_id verbatim (agent/background_review.py:451: review_agent.session_id = agent.session_id).
  3. The fork's run_conversation() hits preflight compression (agent/conversation_loop.py:604) and calls _compress_context. Inside, conversation_compression.py:381-389 ends the old session and creates a new one rooted at the parent's id.
  4. Meanwhile, a follow-up event arrives — Damien's was a Background process proc_… completed notification. The gateway reads the old session_entry.session_id (the fork's rotation lives only inside the fork's AIAgent instance), starts a new turn on the same id, hits its own preflight, rotates again.
  5. state.db now has TWO orphan children pointing at the same parent. The gateway SessionEntry only caught one rotation. The other child silently accumulates writes.

The fix

New table compression_locks (session_id PK, holder, acquired_at, expires_at) reconciled in via the existing declarative pattern (SCHEMA_VERSION 13 → 14). Three primitives on SessionDB:

  • try_acquire_compression_lock(session_id, holder, ttl_seconds=300) — atomic DELETE-expired + INSERT-or-IGNORE + SELECT-confirm, all in one _execute_write. Reclaims expired locks transparently so a crashed compressor doesn't permanently block the session.
  • release_compression_lock(session_id, holder) — DELETE WHERE matches; idempotent, holder-checked.
  • get_compression_lock_holder(session_id) — diagnostic only.

Wired into agent/conversation_compression.py::compress_context keyed on the OLD session_id (before rotation). Loser of the race returns messages unchanged, hits the existing len(returned) == len(input) short-circuit in auto-compress callers, and emits a one-time ⚠ Skipping concurrent compression warning. Winner releases the lock after the rotation + all post-rotation bookkeeping (memory manager, context engine, file dedup) completes.

Scope (all six compression entry points share the same code path)

Entry point Protected
conversation_loop preflight yes (shared compress_context)
conversation_loop mid-turn fallback (4 sites) yes
gateway/run.py hygiene compression yes
gateway/run.py /compact yes
cli.py CLI /compress yes
tui_gateway/server.py TUI /compress yes
acp_adapter/server.py ACP /compress already safe — explicitly nulls _session_db before the call
background_review fork yes (also goes through compress_context)

Validation

Check Result
tests/test_hermes_state_compression_locks.py (new) 12/12 pass
tests/agent/test_compression_concurrent_fork.py (new, Damien repro shape) 2/2 pass
tests/agent/test_context_compressor*.py, test_compress_focus.py, test_compression_*.py 120/120 pass
tests/test_hermes_state.py, tests/gateway/test_session_hygiene.py, test_compression_session_id_persistence.py 247/247 pass
tests/run_agent/test_background_review*.py, tests/cli/test_manual_compress.py 26/26 pass
tests/acp/test_server.py 78/78 pass
Repro WITHOUT the lock 2 children deterministically (orphan fork)
Repro WITH the lock 1 child, lock released cleanly

Schema

SCHEMA_VERSION 13 → 14. Existing DBs reconcile additively — no data migration, no behaviour change for sessions that aren't compressing. The new table is empty for everyone until a compression actually happens.

Infographic

Compression lock — session fork prevention

Companion to #34310

#34310 made the fallback survivable when summary generation fails. This PR closes the orthogonal fork class — same incident report (Damien, 2026-05-28), different mechanism. The two together fully resolve his Case 1 + Case 2.

teknium1 added 2 commits May 28, 2026 21:08
When two AIAgent instances share the same session_id (most commonly the
parent-turn agent and its background-review fork, which inherits
session_id verbatim via background_review.py L451), both can call
compress_context() on overlapping snapshots of the same conversation.
Each ends the parent and creates its own NEW child session in state.db,
both parented to the same old id. The gateway SessionEntry only catches
one rotation; the other becomes an orphan that silently accumulates
writes — Damien's incident shape (parent 20260527_234659_e65f0e → two
children, only one visible).

Adds a state.db-backed per-session compression lock. Acquired before
the rotation in conversation_compression.compress_context(); on
failure, the caller returns messages unchanged so the auto-compress
retry loop stops cleanly. TTL (5min default) reclaims locks abandoned
by crashed compressors. Lock holder identity (pid:tid:agent:nonce) is
preserved for diagnostics via get_compression_lock_holder().

Schema bumped 13 -> 14 to track the new compression_locks table.
Reconciled additively via the existing declarative-column pattern;
no data migration needed for existing DBs.

Regression test reproduces Damien's shape: two threads racing
_compress_context on a shared parent_sid. Without the lock the test
deterministically produces 2 child sessions; with the lock, exactly 1.

Covers all six compression entry points (preflight in conversation_loop,
mid-turn fallback, hygiene compression in gateway, /compact, CLI
/compress, TUI /compress). ACP /compress was already protected by
nulling out _session_db before its compress call.
@github-actions

Copy link
Copy Markdown
Contributor

🔎 Lint report: hermes/hermes-cb38d90d vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9588 on HEAD, 9583 on base (🆕 +5)

🆕 New issues (4):

Rule Count
unresolved-import 2
unresolved-attribute 2
First entries
tests/agent/test_compression_concurrent_fork.py:37: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/agent/test_compression_concurrent_fork.py:85: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `Connection | None`
tests/agent/test_compression_concurrent_fork.py:79: [unresolved-attribute] unresolved-attribute: Unresolved attribute `context_compressor` on type `AIAgent`
tests/test_hermes_state_compression_locks.py:19: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`

✅ Fixed issues: none

Unchanged: 5052 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels May 29, 2026
@teknium1 teknium1 merged commit a30480b into main May 29, 2026
25 checks passed
@teknium1 teknium1 deleted the hermes/hermes-cb38d90d branch May 29, 2026 04:40
teknium1 added a commit that referenced this pull request May 29, 2026
…kew) (#34475)

A process running mismatched module versions — conversation_compression.py
re-imported with the post-#34351 lock code while a long-lived
hermes_state.SessionDB stays bound to the pre-#34351 class in memory — has
the try_acquire_compression_lock call site but not the method. The
AttributeError it raises is NOT a sqlite3.Error, so the method's own
fail-open guard never runs; the exception escapes to the outer agent loop,
which prints the error and retries. Compression never succeeds, the token
count never drops, and the loop re-triggers compaction forever (the
'API call #47/#48/#49 ... has no attribute try_acquire_compression_lock'
spin a user hit after an update).

Wrap the lock acquire so any unexpected exception fails OPEN: skip locking
and proceed with compression. Skipping the lock risks a rare
concurrent-compression session fork; an infinite no-progress loop that never
compresses at all is strictly worse. The remediation hint in the log points
at the real fix (restart / hermes update to resync the stale module).

Also guards get_compression_lock_holder against the same skew.

Adds a regression test simulating the version skew (real SessionDB wrapped
so only the lock methods raise AttributeError) — asserts _compress_context
proceeds and rotates instead of raising.
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
NousResearch#34351)

* fix(compression): prevent session-id fork from concurrent compressions

When two AIAgent instances share the same session_id (most commonly the
parent-turn agent and its background-review fork, which inherits
session_id verbatim via background_review.py L451), both can call
compress_context() on overlapping snapshots of the same conversation.
Each ends the parent and creates its own NEW child session in state.db,
both parented to the same old id. The gateway SessionEntry only catches
one rotation; the other becomes an orphan that silently accumulates
writes — Damien's incident shape (parent 20260527_234659_e65f0e → two
children, only one visible).

Adds a state.db-backed per-session compression lock. Acquired before
the rotation in conversation_compression.compress_context(); on
failure, the caller returns messages unchanged so the auto-compress
retry loop stops cleanly. TTL (5min default) reclaims locks abandoned
by crashed compressors. Lock holder identity (pid:tid:agent:nonce) is
preserved for diagnostics via get_compression_lock_holder().

Schema bumped 13 -> 14 to track the new compression_locks table.
Reconciled additively via the existing declarative-column pattern;
no data migration needed for existing DBs.

Regression test reproduces Damien's shape: two threads racing
_compress_context on a shared parent_sid. Without the lock the test
deterministically produces 2 child sessions; with the lock, exactly 1.

Covers all six compression entry points (preflight in conversation_loop,
mid-turn fallback, hygiene compression in gateway, /compact, CLI
/compress, TUI /compress). ACP /compress was already protected by
nulling out _session_db before its compress call.

* ci: trigger rerun (transient GitHub API rate limit on CodeQL workflow)
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
…kew) (NousResearch#34475)

A process running mismatched module versions — conversation_compression.py
re-imported with the post-NousResearch#34351 lock code while a long-lived
hermes_state.SessionDB stays bound to the pre-NousResearch#34351 class in memory — has
the try_acquire_compression_lock call site but not the method. The
AttributeError it raises is NOT a sqlite3.Error, so the method's own
fail-open guard never runs; the exception escapes to the outer agent loop,
which prints the error and retries. Compression never succeeds, the token
count never drops, and the loop re-triggers compaction forever (the
'API call NousResearch#47/NousResearch#48/NousResearch#49 ... has no attribute try_acquire_compression_lock'
spin a user hit after an update).

Wrap the lock acquire so any unexpected exception fails OPEN: skip locking
and proceed with compression. Skipping the lock risks a rare
concurrent-compression session fork; an infinite no-progress loop that never
compresses at all is strictly worse. The remediation hint in the log points
at the real fix (restart / hermes update to resync the stale module).

Also guards get_compression_lock_holder against the same skew.

Adds a regression test simulating the version skew (real SessionDB wrapped
so only the lock methods raise AttributeError) — asserts _compress_context
proceeds and rotates instead of raising.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants