fix(compression): prevent session-id fork from concurrent compressions#34351
Merged
Conversation
When two AIAgent instances share the same session_id (most commonly the parent-turn agent and its background-review fork, which inherits session_id verbatim via background_review.py L451), both can call compress_context() on overlapping snapshots of the same conversation. Each ends the parent and creates its own NEW child session in state.db, both parented to the same old id. The gateway SessionEntry only catches one rotation; the other becomes an orphan that silently accumulates writes — Damien's incident shape (parent 20260527_234659_e65f0e → two children, only one visible). Adds a state.db-backed per-session compression lock. Acquired before the rotation in conversation_compression.compress_context(); on failure, the caller returns messages unchanged so the auto-compress retry loop stops cleanly. TTL (5min default) reclaims locks abandoned by crashed compressors. Lock holder identity (pid:tid:agent:nonce) is preserved for diagnostics via get_compression_lock_holder(). Schema bumped 13 -> 14 to track the new compression_locks table. Reconciled additively via the existing declarative-column pattern; no data migration needed for existing DBs. Regression test reproduces Damien's shape: two threads racing _compress_context on a shared parent_sid. Without the lock the test deterministically produces 2 child sessions; with the lock, exactly 1. Covers all six compression entry points (preflight in conversation_loop, mid-turn fallback, hygiene compression in gateway, /compact, CLI /compress, TUI /compress). ACP /compress was already protected by nulling out _session_db before its compress call.
Contributor
🔎 Lint report:
|
| Rule | Count |
|---|---|
unresolved-import |
2 |
unresolved-attribute |
2 |
First entries
tests/agent/test_compression_concurrent_fork.py:37: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/agent/test_compression_concurrent_fork.py:85: [unresolved-attribute] unresolved-attribute: Attribute `execute` is not defined on `None` in union `Connection | None`
tests/agent/test_compression_concurrent_fork.py:79: [unresolved-attribute] unresolved-attribute: Unresolved attribute `context_compressor` on type `AIAgent`
tests/test_hermes_state_compression_locks.py:19: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
✅ Fixed issues: none
Unchanged: 5052 pre-existing issues carried over.
Diagnostics are surfaced as warnings — this check never fails the build.
teknium1
added a commit
that referenced
this pull request
May 29, 2026
…kew) (#34475) A process running mismatched module versions — conversation_compression.py re-imported with the post-#34351 lock code while a long-lived hermes_state.SessionDB stays bound to the pre-#34351 class in memory — has the try_acquire_compression_lock call site but not the method. The AttributeError it raises is NOT a sqlite3.Error, so the method's own fail-open guard never runs; the exception escapes to the outer agent loop, which prints the error and retries. Compression never succeeds, the token count never drops, and the loop re-triggers compaction forever (the 'API call #47/#48/#49 ... has no attribute try_acquire_compression_lock' spin a user hit after an update). Wrap the lock acquire so any unexpected exception fails OPEN: skip locking and proceed with compression. Skipping the lock risks a rare concurrent-compression session fork; an infinite no-progress loop that never compresses at all is strictly worse. The remediation hint in the log points at the real fix (restart / hermes update to resync the stale module). Also guards get_compression_lock_holder against the same skew. Adds a regression test simulating the version skew (real SessionDB wrapped so only the lock methods raise AttributeError) — asserts _compress_context proceeds and rotates instead of raising.
KKT-OPT
pushed a commit
to KKT-OPT/hermes-agent
that referenced
this pull request
May 31, 2026
NousResearch#34351) * fix(compression): prevent session-id fork from concurrent compressions When two AIAgent instances share the same session_id (most commonly the parent-turn agent and its background-review fork, which inherits session_id verbatim via background_review.py L451), both can call compress_context() on overlapping snapshots of the same conversation. Each ends the parent and creates its own NEW child session in state.db, both parented to the same old id. The gateway SessionEntry only catches one rotation; the other becomes an orphan that silently accumulates writes — Damien's incident shape (parent 20260527_234659_e65f0e → two children, only one visible). Adds a state.db-backed per-session compression lock. Acquired before the rotation in conversation_compression.compress_context(); on failure, the caller returns messages unchanged so the auto-compress retry loop stops cleanly. TTL (5min default) reclaims locks abandoned by crashed compressors. Lock holder identity (pid:tid:agent:nonce) is preserved for diagnostics via get_compression_lock_holder(). Schema bumped 13 -> 14 to track the new compression_locks table. Reconciled additively via the existing declarative-column pattern; no data migration needed for existing DBs. Regression test reproduces Damien's shape: two threads racing _compress_context on a shared parent_sid. Without the lock the test deterministically produces 2 child sessions; with the lock, exactly 1. Covers all six compression entry points (preflight in conversation_loop, mid-turn fallback, hygiene compression in gateway, /compact, CLI /compress, TUI /compress). ACP /compress was already protected by nulling out _session_db before its compress call. * ci: trigger rerun (transient GitHub API rate limit on CodeQL workflow)
KKT-OPT
pushed a commit
to KKT-OPT/hermes-agent
that referenced
this pull request
May 31, 2026
…kew) (NousResearch#34475) A process running mismatched module versions — conversation_compression.py re-imported with the post-NousResearch#34351 lock code while a long-lived hermes_state.SessionDB stays bound to the pre-NousResearch#34351 class in memory — has the try_acquire_compression_lock call site but not the method. The AttributeError it raises is NOT a sqlite3.Error, so the method's own fail-open guard never runs; the exception escapes to the outer agent loop, which prints the error and retries. Compression never succeeds, the token count never drops, and the loop re-triggers compaction forever (the 'API call NousResearch#47/NousResearch#48/NousResearch#49 ... has no attribute try_acquire_compression_lock' spin a user hit after an update). Wrap the lock acquire so any unexpected exception fails OPEN: skip locking and proceed with compression. Skipping the lock risks a rare concurrent-compression session fork; an infinite no-progress loop that never compresses at all is strictly worse. The remediation hint in the log points at the real fix (restart / hermes update to resync the stale module). Also guards get_compression_lock_holder against the same skew. Adds a regression test simulating the version skew (real SessionDB wrapped so only the lock methods raise AttributeError) — asserts _compress_context proceeds and rotates instead of raising.
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a state.db-backed per-session compression lock so two AIAgent instances sharing the same
session_idcan no longer both rotate it concurrently — eliminating the transcript-fork class that produced Damien's "parent → two child sessions, only one visible" incident.The race (reproduced + repro shipped as a regression test)
conversation_loop.py:4571fires_spawn_background_review(messages_snapshot=list(messages)). The fork inheritsagent.session_idverbatim (agent/background_review.py:451:review_agent.session_id = agent.session_id).run_conversation()hits preflight compression (agent/conversation_loop.py:604) and calls_compress_context. Inside,conversation_compression.py:381-389ends the old session and creates a new one rooted at the parent's id.Background process proc_… completednotification. The gateway reads the oldsession_entry.session_id(the fork's rotation lives only inside the fork's AIAgent instance), starts a new turn on the same id, hits its own preflight, rotates again.state.dbnow has TWO orphan children pointing at the same parent. The gateway SessionEntry only caught one rotation. The other child silently accumulates writes.The fix
New table
compression_locks (session_id PK, holder, acquired_at, expires_at)reconciled in via the existing declarative pattern (SCHEMA_VERSION 13 → 14). Three primitives onSessionDB:try_acquire_compression_lock(session_id, holder, ttl_seconds=300)— atomic DELETE-expired + INSERT-or-IGNORE + SELECT-confirm, all in one_execute_write. Reclaims expired locks transparently so a crashed compressor doesn't permanently block the session.release_compression_lock(session_id, holder)— DELETE WHERE matches; idempotent, holder-checked.get_compression_lock_holder(session_id)— diagnostic only.Wired into
agent/conversation_compression.py::compress_contextkeyed on the OLD session_id (before rotation). Loser of the race returnsmessagesunchanged, hits the existinglen(returned) == len(input)short-circuit in auto-compress callers, and emits a one-time⚠ Skipping concurrent compressionwarning. Winner releases the lock after the rotation + all post-rotation bookkeeping (memory manager, context engine, file dedup) completes.Scope (all six compression entry points share the same code path)
conversation_looppreflightcompress_context)conversation_loopmid-turn fallback (4 sites)gateway/run.pyhygiene compressiongateway/run.py/compactcli.pyCLI/compresstui_gateway/server.pyTUI/compressacp_adapter/server.pyACP/compress_session_dbbefore the callbackground_reviewforkcompress_context)Validation
tests/test_hermes_state_compression_locks.py(new)tests/agent/test_compression_concurrent_fork.py(new, Damien repro shape)tests/agent/test_context_compressor*.py,test_compress_focus.py,test_compression_*.pytests/test_hermes_state.py,tests/gateway/test_session_hygiene.py,test_compression_session_id_persistence.pytests/run_agent/test_background_review*.py,tests/cli/test_manual_compress.pytests/acp/test_server.pySchema
SCHEMA_VERSION 13 → 14. Existing DBs reconcile additively — no data migration, no behaviour change for sessions that aren't compressing. The new table is empty for everyone until a compression actually happens.Infographic
Companion to #34310
#34310 made the fallback survivable when summary generation fails. This PR closes the orthogonal fork class — same incident report (Damien, 2026-05-28), different mechanism. The two together fully resolve his Case 1 + Case 2.