Skip to content

fix(honcho): thread-safe session cache via RLock#13510

Closed
hekaru-agent wants to merge 1 commit into
NousResearch:mainfrom
hekaru-agent:fix/honcho-session-thread-safety
Closed

fix(honcho): thread-safe session cache via RLock#13510
hekaru-agent wants to merge 1 commit into
NousResearch:mainfrom
hekaru-agent:fix/honcho-session-thread-safety

Conversation

@hekaru-agent

Copy link
Copy Markdown
Contributor

Summary

Adds threading.RLock() to protect HonchoSessionManager._cache access against data races between the main thread and the async writer loop.

Problem

HonchoSessionManager uses an async writer loop (_async_writer_loop) that writes messages to Honcho's persistence layer, while other threads call get_or_create(), _flush_session(), delete(), and new_session() — all of which read/write self._cache without synchronization.

Python's dict is not thread-safe for concurrent read-write. The race manifests as:

  • Dictionary changed size during iteration
  • Lost updates (session created but never cached, or session returned as None after write)
  • In rare conditions: segfault-level Python crash

Fix

self._cache_lock = threading.RLock()

All accesses to self._cache are now wrapped:

Location Lock scope
get_or_create (cache hit) Read-only fast path
get_or_create (cache miss — write) Write-after-I/O outside lock
_flush_session Read-then-conditional-write
flush_all Iterating a snapshot copy
delete Remove
new_session Remove old + write new

Lock design decision: I/O operations (peer lookups, Honcho API calls) are kept outside the lock to avoid serializing expensive operations. Only the cache read/write is locked. Concurrent cache misses will race on Honcho session creation — but Honcho's own persistence is the source of truth, so this is safe.

Testing

  • Syntax validated: python3 -m py_compile plugins/memory/honcho/session.py
  • Confirmed upstream introduced runtime_user_peer_name param after our lock change; merged cleanly
  • No behavioral changes to the public API

Files changed

  • plugins/memory/honcho/session.py — RLock addition + 6 synchronized cache access sites

…ache

- cron/jobs.py: remove_job() now cleans output directory to prevent
  orphaned ~/.hermes/cron/output/{job_id}/ dirs accumulating
- plugins/memory/honcho/session.py: protect _cache with RLock; both
  main thread and async writer loop access it without synchronization
  Lock scope: cache reads/writes only; I/O (peer lookups, Honcho calls)
  stays outside lock to avoid serializing expensive operations
@alt-glitch alt-glitch added type/bug Something isn't working comp/plugins Plugin system and bundled plugins labels Apr 22, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #5102 (Honcho session cache race condition) and #5103 (alternative fix via join sync_turn thread).

1 similar comment
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #5102 (Honcho session cache race condition) and #5103 (alternative fix via join sync_turn thread).

erosika pushed a commit to erosika/hermes-agent that referenced this pull request Apr 24, 2026
Wraps _session_cache mutations in threading.RLock. Without this, concurrent
gateway sessions (e.g., Telegram + Discord hitting Honcho at the same time)
can race on the cache and silently lose conclusions or memory writes.

Adopted from NousResearch#13510 by @hekaru-agent; the off-topic cron/jobs.py cleanup
hunk from that PR is dropped here for scope isolation. Resolved a small
conflict with the pinPeerName guard (kept both).
erosika added a commit to erosika/hermes-agent that referenced this pull request Apr 24, 2026
new_session() was popping the old cached session, releasing the lock,
calling get_or_create, then re-acquiring the lock to insert. A concurrent
caller could observe the empty-cache window and race-create its own
session, producing two divergent session objects for the same key.

_cache_lock is an RLock, so nested reacquisition inside get_or_create is
safe. Hold it across the whole pop/create/insert sequence.

Follow-up to NousResearch#13510 (@hekaru-agent).
erosika pushed a commit to erosika/hermes-agent that referenced this pull request Apr 27, 2026
Wraps _session_cache mutations in threading.RLock. Without this, concurrent
gateway sessions (e.g., Telegram + Discord hitting Honcho at the same time)
can race on the cache and silently lose conclusions or memory writes.

Adopted from NousResearch#13510 by @hekaru-agent; the off-topic cron/jobs.py cleanup
hunk from that PR is dropped here for scope isolation. Resolved a small
conflict with the pinPeerName guard (kept both).
erosika added a commit to erosika/hermes-agent that referenced this pull request Apr 27, 2026
new_session() was popping the old cached session, releasing the lock,
calling get_or_create, then re-acquiring the lock to insert. A concurrent
caller could observe the empty-cache window and race-create its own
session, producing two divergent session objects for the same key.

_cache_lock is an RLock, so nested reacquisition inside get_or_create is
safe. Hold it across the whole pop/create/insert sequence.

Follow-up to NousResearch#13510 (@hekaru-agent).
kshitijk4poor pushed a commit that referenced this pull request Apr 27, 2026
Wraps _session_cache mutations in threading.RLock. Without this, concurrent
gateway sessions (e.g., Telegram + Discord hitting Honcho at the same time)
can race on the cache and silently lose conclusions or memory writes.

Adopted from #13510 by @hekaru-agent; the off-topic cron/jobs.py cleanup
hunk from that PR is dropped here for scope isolation. Resolved a small
conflict with the pinPeerName guard (kept both).
kshitijk4poor pushed a commit that referenced this pull request Apr 27, 2026
new_session() was popping the old cached session, releasing the lock,
calling get_or_create, then re-acquiring the lock to insert. A concurrent
caller could observe the empty-cache window and race-create its own
session, producing two divergent session objects for the same key.

_cache_lock is an RLock, so nested reacquisition inside get_or_create is
safe. Hold it across the whole pop/create/insert sequence.

Follow-up to #13510 (@hekaru-agent).
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
Wraps _session_cache mutations in threading.RLock. Without this, concurrent
gateway sessions (e.g., Telegram + Discord hitting Honcho at the same time)
can race on the cache and silently lose conclusions or memory writes.

Adopted from NousResearch#13510 by @hekaru-agent; the off-topic cron/jobs.py cleanup
hunk from that PR is dropped here for scope isolation. Resolved a small
conflict with the pinPeerName guard (kept both).
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
new_session() was popping the old cached session, releasing the lock,
calling get_or_create, then re-acquiring the lock to insert. A concurrent
caller could observe the empty-cache window and race-create its own
session, producing two divergent session objects for the same key.

_cache_lock is an RLock, so nested reacquisition inside get_or_create is
safe. Hold it across the whole pop/create/insert sequence.

Follow-up to NousResearch#13510 (@hekaru-agent).
donald131 pushed a commit to donald131/hermes-agent that referenced this pull request May 2, 2026
Wraps _session_cache mutations in threading.RLock. Without this, concurrent
gateway sessions (e.g., Telegram + Discord hitting Honcho at the same time)
can race on the cache and silently lose conclusions or memory writes.

Adopted from NousResearch#13510 by @hekaru-agent; the off-topic cron/jobs.py cleanup
hunk from that PR is dropped here for scope isolation. Resolved a small
conflict with the pinPeerName guard (kept both).
donald131 pushed a commit to donald131/hermes-agent that referenced this pull request May 2, 2026
new_session() was popping the old cached session, releasing the lock,
calling get_or_create, then re-acquiring the lock to insert. A concurrent
caller could observe the empty-cache window and race-create its own
session, producing two divergent session objects for the same key.

_cache_lock is an RLock, so nested reacquisition inside get_or_create is
safe. Hold it across the whole pop/create/insert sequence.

Follow-up to NousResearch#13510 (@hekaru-agent).
teknium1 pushed a commit that referenced this pull request May 8, 2026
remove_job() deletes the job from cron/jobs.json but leaves the per-job
output directory at ~/.hermes/cron/output/{job_id}/ behind. Over time
this accumulates orphaned dirs that never get reclaimed.

Adopted from #13510 by @hekaru-agent; the honcho RLock half of that PR
was already salvaged in commit dad0217 so this lands the remaining
cron cleanup hunk on its own.
@teknium1

teknium1 commented May 8, 2026

Copy link
Copy Markdown
Contributor

Thanks @hekaru-agent! Both halves of this PR are now on main:

Both contributions fully absorbed. Appreciate the thread-safety catch.

RationallyPrime pushed a commit to RationallyPrime/hermes-agent that referenced this pull request May 8, 2026
remove_job() deletes the job from cron/jobs.json but leaves the per-job
output directory at ~/.hermes/cron/output/{job_id}/ behind. Over time
this accumulates orphaned dirs that never get reclaimed.

Adopted from NousResearch#13510 by @hekaru-agent; the honcho RLock half of that PR
was already salvaged in commit dad0217 so this lands the remaining
cron cleanup hunk on its own.
JZKK720 pushed a commit to JZKK720/hermes-agent that referenced this pull request May 11, 2026
remove_job() deletes the job from cron/jobs.json but leaves the per-job
output directory at ~/.hermes/cron/output/{job_id}/ behind. Over time
this accumulates orphaned dirs that never get reclaimed.

Adopted from NousResearch#13510 by @hekaru-agent; the honcho RLock half of that PR
was already salvaged in commit dad0217 so this lands the remaining
cron cleanup hunk on its own.
rmulligan pushed a commit to rmulligan/hermes-agent that referenced this pull request May 11, 2026
remove_job() deletes the job from cron/jobs.json but leaves the per-job
output directory at ~/.hermes/cron/output/{job_id}/ behind. Over time
this accumulates orphaned dirs that never get reclaimed.

Adopted from NousResearch#13510 by @hekaru-agent; the honcho RLock half of that PR
was already salvaged in commit dad0217 so this lands the remaining
cron cleanup hunk on its own.
JinyuID pushed a commit to JinyuID/hermes-agent that referenced this pull request May 11, 2026
remove_job() deletes the job from cron/jobs.json but leaves the per-job
output directory at ~/.hermes/cron/output/{job_id}/ behind. Over time
this accumulates orphaned dirs that never get reclaimed.

Adopted from NousResearch#13510 by @hekaru-agent; the honcho RLock half of that PR
was already salvaged in commit 6dcd66d so this lands the remaining
cron cleanup hunk on its own.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
Wraps _session_cache mutations in threading.RLock. Without this, concurrent
gateway sessions (e.g., Telegram + Discord hitting Honcho at the same time)
can race on the cache and silently lose conclusions or memory writes.

Adopted from NousResearch#13510 by @hekaru-agent; the off-topic cron/jobs.py cleanup
hunk from that PR is dropped here for scope isolation. Resolved a small
conflict with the pinPeerName guard (kept both).
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
new_session() was popping the old cached session, releasing the lock,
calling get_or_create, then re-acquiring the lock to insert. A concurrent
caller could observe the empty-cache window and race-create its own
session, producing two divergent session objects for the same key.

_cache_lock is an RLock, so nested reacquisition inside get_or_create is
safe. Hold it across the whole pop/create/insert sequence.

Follow-up to NousResearch#13510 (@hekaru-agent).
jsboige pushed a commit to jsboige/hermes-agent that referenced this pull request May 14, 2026
remove_job() deletes the job from cron/jobs.json but leaves the per-job
output directory at ~/.hermes/cron/output/{job_id}/ behind. Over time
this accumulates orphaned dirs that never get reclaimed.

Adopted from NousResearch#13510 by @hekaru-agent; the honcho RLock half of that PR
was already salvaged in commit d418f45 so this lands the remaining
cron cleanup hunk on its own.
dannyJ848 pushed a commit to dannyJ848/hermes-agent that referenced this pull request May 17, 2026
Wraps _session_cache mutations in threading.RLock. Without this, concurrent
gateway sessions (e.g., Telegram + Discord hitting Honcho at the same time)
can race on the cache and silently lose conclusions or memory writes.

Adopted from NousResearch#13510 by @hekaru-agent; the off-topic cron/jobs.py cleanup
hunk from that PR is dropped here for scope isolation. Resolved a small
conflict with the pinPeerName guard (kept both).
dannyJ848 pushed a commit to dannyJ848/hermes-agent that referenced this pull request May 17, 2026
new_session() was popping the old cached session, releasing the lock,
calling get_or_create, then re-acquiring the lock to insert. A concurrent
caller could observe the empty-cache window and race-create its own
session, producing two divergent session objects for the same key.

_cache_lock is an RLock, so nested reacquisition inside get_or_create is
safe. Hold it across the whole pop/create/insert sequence.

Follow-up to NousResearch#13510 (@hekaru-agent).
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
Wraps _session_cache mutations in threading.RLock. Without this, concurrent
gateway sessions (e.g., Telegram + Discord hitting Honcho at the same time)
can race on the cache and silently lose conclusions or memory writes.

Adopted from NousResearch#13510 by @hekaru-agent; the off-topic cron/jobs.py cleanup
hunk from that PR is dropped here for scope isolation. Resolved a small
conflict with the pinPeerName guard (kept both).
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
new_session() was popping the old cached session, releasing the lock,
calling get_or_create, then re-acquiring the lock to insert. A concurrent
caller could observe the empty-cache window and race-create its own
session, producing two divergent session objects for the same key.

_cache_lock is an RLock, so nested reacquisition inside get_or_create is
safe. Hold it across the whole pop/create/insert sequence.

Follow-up to NousResearch#13510 (@hekaru-agent).
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
remove_job() deletes the job from cron/jobs.json but leaves the per-job
output directory at ~/.hermes/cron/output/{job_id}/ behind. Over time
this accumulates orphaned dirs that never get reclaimed.

Adopted from NousResearch#13510 by @hekaru-agent; the honcho RLock half of that PR
was already salvaged in commit dad0217 so this lands the remaining
cron cleanup hunk on its own.
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
Wraps _session_cache mutations in threading.RLock. Without this, concurrent
gateway sessions (e.g., Telegram + Discord hitting Honcho at the same time)
can race on the cache and silently lose conclusions or memory writes.

Adopted from NousResearch#13510 by @hekaru-agent; the off-topic cron/jobs.py cleanup
hunk from that PR is dropped here for scope isolation. Resolved a small
conflict with the pinPeerName guard (kept both).
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
new_session() was popping the old cached session, releasing the lock,
calling get_or_create, then re-acquiring the lock to insert. A concurrent
caller could observe the empty-cache window and race-create its own
session, producing two divergent session objects for the same key.

_cache_lock is an RLock, so nested reacquisition inside get_or_create is
safe. Hold it across the whole pop/create/insert sequence.

Follow-up to NousResearch#13510 (@hekaru-agent).
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
remove_job() deletes the job from cron/jobs.json but leaves the per-job
output directory at ~/.hermes/cron/output/{job_id}/ behind. Over time
this accumulates orphaned dirs that never get reclaimed.

Adopted from NousResearch#13510 by @hekaru-agent; the honcho RLock half of that PR
was already salvaged in commit e991cf0 so this lands the remaining
cron cleanup hunk on its own.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants