fix: join sync_turn thread before tool calls to prevent race condition on _cache#5103
fix: join sync_turn thread before tool calls to prevent race condition on _cache#5103darkworon wants to merge 2 commits into
Conversation
…n on _cache sync_turn() runs get_or_create() in a background thread while handle_tool_call() reads _cache in the main thread. Both access the same unprotected dict, causing intermittent cache misses where create_conclusion() silently fails. The fix joins the sync thread before processing any tool call, ensuring _cache is in a consistent state. Fixes NousResearch#5102
There was a problem hiding this comment.
Pull request overview
Fixes an intermittent race in the Honcho memory provider by waiting for the sync_turn background thread to finish before servicing Honcho tool calls that read session state.
Changes:
- Join the
sync_turnthread before handling Honcho tool calls to reduce concurrent access to the session cache.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # the same _cache dict. Wait for it to complete before tool calls. | ||
| # See: https://github.com/NousResearch/hermes-agent/issues/5102 | ||
| if self._sync_thread and self._sync_thread.is_alive(): | ||
| self._sync_thread.join(timeout=5.0) |
There was a problem hiding this comment.
join(timeout=5.0) does not guarantee the sync thread has finished; if it’s still alive after 5s, tool handling proceeds and the _cache race this change is meant to prevent can still occur. Consider either joining without a timeout here, or re-checking is_alive() after the join and returning a retryable error / waiting longer, or switching to a shared lock that both sync_turn() and tool-call paths use when touching the session cache.
| self._sync_thread.join(timeout=5.0) | |
| self._sync_thread.join(timeout=5.0) | |
| if self._sync_thread.is_alive(): | |
| return json.dumps( | |
| { | |
| "error": ( | |
| "Honcho is still synchronizing session state. " | |
| "Please retry the tool call." | |
| ) | |
| } | |
| ) |
…eout Address review feedback from Copilot: join(timeout=5.0) doesn't guarantee the thread finished. Now returns a retryable error if the sync thread is still alive after the timeout, preventing the cache race entirely.
|
LGTM as a minimal band-aid — the pattern matches existing code in the file (line 567 uses the same A few observations: Timeout inconsistency Line 599 uses HONCHO_SYNC_JOIN_TIMEOUT = 10.0 # module-leveland using it in both places. Keeps the behavior predictable and makes it easy to tune later. Retry message could be more actionable If sync_turn is genuinely stuck, the user will hit this repeatedly on every retry — each one waits 5s, errors, user retries, waits another 5s, etc. Consider either:
The real fix You're already aware — a Ship it, though. The race was real and users were hitting it. |
|
Related to #5102. |
Summary
Fixes a race condition in the Honcho memory plugin where
create_conclusion()intermittently fails due to concurrent access toHonchoSessionManager._cachefrom thesync_turnbackground thread and the main thread.Problem
sync_turn()spawns a background thread that callsget_or_create()on_cache(a plaindict). Meanwhile,handle_tool_call()reads_cacheviacreate_conclusion()in the main thread. Sinceget_or_create()is a multi-step read-modify-write operation, the main thread can observe_cachein an inconsistent state, causing_cache.get(session_key)to returnNoneand the conclusion to silently fail.Fix
Join the
sync_turnthread before processing any tool call inhandle_tool_call(). This ensures_cacheis in a consistent state when tool handlers access it.This is the minimal fix. A more comprehensive solution would add a
threading.Lockto all_cacheaccess inHonchoSessionManager, similar to the existing_prefetch_cache_lock.Testing
honcho_concludecalls — 2 random failuressync_turnthread typically completes in <1s, and the join is a no-op when it's not runningFixes #5102