fix(compression): fail open when lock subsystem is missing (version skew)#34475
Merged
Conversation
…kew) A process running mismatched module versions — conversation_compression.py re-imported with the post-#34351 lock code while a long-lived hermes_state.SessionDB stays bound to the pre-#34351 class in memory — has the try_acquire_compression_lock call site but not the method. The AttributeError it raises is NOT a sqlite3.Error, so the method's own fail-open guard never runs; the exception escapes to the outer agent loop, which prints the error and retries. Compression never succeeds, the token count never drops, and the loop re-triggers compaction forever (the 'API call #47/#48/#49 ... has no attribute try_acquire_compression_lock' spin a user hit after an update). Wrap the lock acquire so any unexpected exception fails OPEN: skip locking and proceed with compression. Skipping the lock risks a rare concurrent-compression session fork; an infinite no-progress loop that never compresses at all is strictly worse. The remediation hint in the log points at the real fix (restart / hermes update to resync the stale module). Also guards get_compression_lock_holder against the same skew. Adds a regression test simulating the version skew (real SessionDB wrapped so only the lock methods raise AttributeError) — asserts _compress_context proceeds and rotates instead of raising.
KKT-OPT
pushed a commit
to KKT-OPT/hermes-agent
that referenced
this pull request
May 31, 2026
…kew) (NousResearch#34475) A process running mismatched module versions — conversation_compression.py re-imported with the post-NousResearch#34351 lock code while a long-lived hermes_state.SessionDB stays bound to the pre-NousResearch#34351 class in memory — has the try_acquire_compression_lock call site but not the method. The AttributeError it raises is NOT a sqlite3.Error, so the method's own fail-open guard never runs; the exception escapes to the outer agent loop, which prints the error and retries. Compression never succeeds, the token count never drops, and the loop re-triggers compaction forever (the 'API call NousResearch#47/NousResearch#48/NousResearch#49 ... has no attribute try_acquire_compression_lock' spin a user hit after an update). Wrap the lock acquire so any unexpected exception fails OPEN: skip locking and proceed with compression. Skipping the lock risks a rare concurrent-compression session fork; an infinite no-progress loop that never compresses at all is strictly worse. The remediation hint in the log points at the real fix (restart / hermes update to resync the stale module). Also guards get_compression_lock_holder against the same skew. Adds a regression test simulating the version skew (real SessionDB wrapped so only the lock methods raise AttributeError) — asserts _compress_context proceeds and rotates instead of raising.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Compression now fails OPEN when the per-session lock subsystem is missing or broken, instead of spinning the agent loop forever.
Root cause: a process running mismatched module versions —
conversation_compression.pyre-imported with the post-#34351 lock code while a long-livedhermes_state.SessionDBstays bound to the pre-#34351 class in memory — has thetry_acquire_compression_lockcall site but not the method. TheAttributeErrorit raises is not asqlite3.Error, so the method's own fail-open guard never runs. The exception escapes to the outer agent loop, which prints❌ Error during ... API call #Nand retries. Compression never succeeds → token count never drops → the loop re-triggers compaction forever (the#47/#48/#49 ... has no attribute try_acquire_compression_lockspin a user hit after an update).Changes
agent/conversation_compression.py: wrap the lock acquire in try/except — any unexpected exception logs once per session (with a "restart /hermes updateto resync" hint) and proceeds without the lock. Also guardsget_compression_lock_holderagainst the same skew.tests/agent/test_compression_concurrent_fork.py: regression test simulating the version skew (realSessionDBwrapped so only the lock methods raiseAttributeError).Why fail open
Skipping the lock risks a rare concurrent-compression session fork (the thing #34351 prevents). An infinite no-progress loop that never compresses at all is strictly worse — and it's the symptom users actually see.
Validation
AttributeError→ outer loop retries → infinite compaction spinAttributeError: ... try_acquire_compression_lock)Negative-control verified: reverting the guard makes the new test fail with the exact production AttributeError; restoring it passes.
tests/agent/test_compression_concurrent_fork.py+tests/test_hermes_state_compression_locks.py→ 15/15 green.Note: this is a robustness guard. The operational fix for an instance already stuck is a process restart (
hermes update) to resync the stale in-memory module — the source onmainis already correct.