fix: join sync_turn thread before tool calls to prevent race condition on _cache by darkworon · Pull Request #5103 · NousResearch/hermes-agent

darkworon · 2026-04-04T19:39:58Z

Summary

Fixes a race condition in the Honcho memory plugin where create_conclusion() intermittently fails due to concurrent access to HonchoSessionManager._cache from the sync_turn background thread and the main thread.

Problem

sync_turn() spawns a background thread that calls get_or_create() on _cache (a plain dict). Meanwhile, handle_tool_call() reads _cache via create_conclusion() in the main thread. Since get_or_create() is a multi-step read-modify-write operation, the main thread can observe _cache in an inconsistent state, causing _cache.get(session_key) to return None and the conclusion to silently fail.

Fix

Join the sync_turn thread before processing any tool call in handle_tool_call(). This ensures _cache is in a consistent state when tool handlers access it.

This is the minimal fix. A more comprehensive solution would add a threading.Lock to all _cache access in HonchoSessionManager, similar to the existing _prefetch_cache_lock.

Testing

Reproduced with 13 consecutive honcho_conclude calls — 2 random failures
After fix: all calls succeed consistently
No performance impact: sync_turn thread typically completes in <1s, and the join is a no-op when it's not running

Fixes #5102

…n on _cache sync_turn() runs get_or_create() in a background thread while handle_tool_call() reads _cache in the main thread. Both access the same unprotected dict, causing intermittent cache misses where create_conclusion() silently fails. The fix joins the sync thread before processing any tool call, ensuring _cache is in a consistent state. Fixes NousResearch#5102

Copilot

Pull request overview

Fixes an intermittent race in the Honcho memory provider by waiting for the sync_turn background thread to finish before servicing Honcho tool calls that read session state.

Changes:

Join the sync_turn thread before handling Honcho tool calls to reduce concurrent access to the session cache.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-04T19:42:37Z

+        # the same _cache dict. Wait for it to complete before tool calls.
+        # See: https://github.com/NousResearch/hermes-agent/issues/5102
+        if self._sync_thread and self._sync_thread.is_alive():
+            self._sync_thread.join(timeout=5.0)


join(timeout=5.0) does not guarantee the sync thread has finished; if it’s still alive after 5s, tool handling proceeds and the _cache race this change is meant to prevent can still occur. Consider either joining without a timeout here, or re-checking is_alive() after the join and returning a retryable error / waiting longer, or switching to a shared lock that both sync_turn() and tool-call paths use when touching the session cache.

Suggested change

self._sync_thread.join(timeout=5.0)

self._sync_thread.join(timeout=5.0)

if self._sync_thread.is_alive():

return json.dumps(

{

"error": (

"Honcho is still synchronizing session state. "

"Please retry the tool call."

)

}

)

…eout Address review feedback from Copilot: join(timeout=5.0) doesn't guarantee the thread finished. Now returns a retryable error if the sync thread is still alive after the timeout, preventing the cache race entirely.

trevorgordon981 · 2026-04-04T22:04:35Z

LGTM as a minimal band-aid — the pattern matches existing code in the file (line 567 uses the same is_alive() + join(timeout=5.0) guard on sync_turn itself, and line 599 uses 10s in what looks like the conclusion path). Fix is consistent.

A few observations:

Timeout inconsistency

Line 599 uses join(timeout=10.0) for what appears to be the same underlying concern, but this PR uses 5.0 in handle_tool_call. If sync_turn legitimately takes 6-9s on a slow network (Honcho is remote, users on mobile tethering etc.), a tool call will hit this new error while the other path would succeed. Worth aligning to a single timeout constant, e.g.:

HONCHO_SYNC_JOIN_TIMEOUT = 10.0  # module-level

and using it in both places. Keeps the behavior predictable and makes it easy to tune later.

Retry message could be more actionable

"Honcho is still synchronizing session state. Please retry the tool call."

If sync_turn is genuinely stuck, the user will hit this repeatedly on every retry — each one waits 5s, errors, user retries, waits another 5s, etc. Consider either:

Adding an approximate wait hint: "...Please wait ~10s before retrying."
Or logging the elapsed sync_turn time server-side so operators can spot stuck syncs

The real fix

You're already aware — a threading.Lock around _cache access would eliminate this class of bug. The _prefetch_cache_lock pattern already in the codebase is the right template. Whenever someone has the cycles for that refactor, this band-aid can go away.

Ship it, though. The race was real and users were hitting it.

alt-glitch · 2026-05-01T11:26:24Z

Related to #5102 (closed). Note: #17005 already merged a drain-queue fix for Hindsight; this targets the Honcho plugin specifically. P3 per plugin-provider-demotion rule.

alt-glitch · 2026-05-01T11:27:01Z

Related to #5102.

Copilot AI review requested due to automatic review settings April 4, 2026 19:39

Copilot started reviewing on behalf of darkworon April 4, 2026 19:40 View session

Copilot AI reviewed Apr 4, 2026

View reviewed changes

alt-glitch mentioned this pull request Apr 22, 2026

fix(honcho): thread-safe session cache via RLock #13510

Closed

alt-glitch added type/bug Something isn't working comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have labels May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: join sync_turn thread before tool calls to prevent race condition on _cache#5103

fix: join sync_turn thread before tool calls to prevent race condition on _cache#5103
darkworon wants to merge 2 commits into
NousResearch:mainfrom
darkworon:fix/honcho-session-cache-race

darkworon commented Apr 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 4, 2026

Uh oh!

trevorgordon981 commented Apr 4, 2026

Uh oh!

alt-glitch commented May 1, 2026

Uh oh!

alt-glitch commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-            self._sync_thread.join(timeout=5.0)
+            self._sync_thread.join(timeout=5.0)
+            if self._sync_thread.is_alive():
+                return json.dumps(
+                    {
+                        "error": (
+                            "Honcho is still synchronizing session state. "
+                            "Please retry the tool call."
+                        )
+                    }
+                )

Conversation

darkworon commented Apr 4, 2026

Summary

Problem

Fix

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

trevorgordon981 commented Apr 4, 2026

Uh oh!

alt-glitch commented May 1, 2026

Uh oh!

alt-glitch commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants