Skip to content

fix(compression): fail open when lock subsystem is missing (version skew)#34475

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-6338475d
May 29, 2026
Merged

fix(compression): fail open when lock subsystem is missing (version skew)#34475
teknium1 merged 1 commit into
mainfrom
hermes/hermes-6338475d

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Compression now fails OPEN when the per-session lock subsystem is missing or broken, instead of spinning the agent loop forever.

Root cause: a process running mismatched module versions — conversation_compression.py re-imported with the post-#34351 lock code while a long-lived hermes_state.SessionDB stays bound to the pre-#34351 class in memory — has the try_acquire_compression_lock call site but not the method. The AttributeError it raises is not a sqlite3.Error, so the method's own fail-open guard never runs. The exception escapes to the outer agent loop, which prints ❌ Error during ... API call #N and retries. Compression never succeeds → token count never drops → the loop re-triggers compaction forever (the #47/#48/#49 ... has no attribute try_acquire_compression_lock spin a user hit after an update).

Changes

  • agent/conversation_compression.py: wrap the lock acquire in try/except — any unexpected exception logs once per session (with a "restart / hermes update to resync" hint) and proceeds without the lock. Also guards get_compression_lock_holder against the same skew.
  • tests/agent/test_compression_concurrent_fork.py: regression test simulating the version skew (real SessionDB wrapped so only the lock methods raise AttributeError).

Why fail open

Skipping the lock risks a rare concurrent-compression session fork (the thing #34351 prevents). An infinite no-progress loop that never compresses at all is strictly worse — and it's the symptom users actually see.

Validation

Before fix After fix
Lock method missing AttributeError → outer loop retries → infinite compaction spin logs once, proceeds, compresses, rotates
New regression test FAILS (AttributeError: ... try_acquire_compression_lock) PASSES
Existing lock tests (14) pass pass

Negative-control verified: reverting the guard makes the new test fail with the exact production AttributeError; restoring it passes. tests/agent/test_compression_concurrent_fork.py + tests/test_hermes_state_compression_locks.py → 15/15 green.

Note: this is a robustness guard. The operational fix for an instance already stuck is a process restart (hermes update) to resync the stale in-memory module — the source on main is already correct.

…kew)

A process running mismatched module versions — conversation_compression.py
re-imported with the post-#34351 lock code while a long-lived
hermes_state.SessionDB stays bound to the pre-#34351 class in memory — has
the try_acquire_compression_lock call site but not the method. The
AttributeError it raises is NOT a sqlite3.Error, so the method's own
fail-open guard never runs; the exception escapes to the outer agent loop,
which prints the error and retries. Compression never succeeds, the token
count never drops, and the loop re-triggers compaction forever (the
'API call #47/#48/#49 ... has no attribute try_acquire_compression_lock'
spin a user hit after an update).

Wrap the lock acquire so any unexpected exception fails OPEN: skip locking
and proceed with compression. Skipping the lock risks a rare
concurrent-compression session fork; an infinite no-progress loop that never
compresses at all is strictly worse. The remediation hint in the log points
at the real fix (restart / hermes update to resync the stale module).

Also guards get_compression_lock_holder against the same skew.

Adds a regression test simulating the version skew (real SessionDB wrapped
so only the lock methods raise AttributeError) — asserts _compress_context
proceeds and rotates instead of raising.
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder labels May 29, 2026
@teknium1 teknium1 merged commit db2ce9e into main May 29, 2026
20 of 24 checks passed
@teknium1 teknium1 deleted the hermes/hermes-6338475d branch May 29, 2026 08:32
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
…kew) (NousResearch#34475)

A process running mismatched module versions — conversation_compression.py
re-imported with the post-NousResearch#34351 lock code while a long-lived
hermes_state.SessionDB stays bound to the pre-NousResearch#34351 class in memory — has
the try_acquire_compression_lock call site but not the method. The
AttributeError it raises is NOT a sqlite3.Error, so the method's own
fail-open guard never runs; the exception escapes to the outer agent loop,
which prints the error and retries. Compression never succeeds, the token
count never drops, and the loop re-triggers compaction forever (the
'API call NousResearch#47/NousResearch#48/NousResearch#49 ... has no attribute try_acquire_compression_lock'
spin a user hit after an update).

Wrap the lock acquire so any unexpected exception fails OPEN: skip locking
and proceed with compression. Skipping the lock risks a rare
concurrent-compression session fork; an infinite no-progress loop that never
compresses at all is strictly worse. The remediation hint in the log points
at the real fix (restart / hermes update to resync the stale module).

Also guards get_compression_lock_holder against the same skew.

Adds a regression test simulating the version skew (real SessionDB wrapped
so only the lock methods raise AttributeError) — asserts _compress_context
proceeds and rotates instead of raising.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants