Skip to content

fix: join sync_turn thread before tool calls to prevent race condition on _cache#5103

Open
darkworon wants to merge 2 commits into
NousResearch:mainfrom
darkworon:fix/honcho-session-cache-race
Open

fix: join sync_turn thread before tool calls to prevent race condition on _cache#5103
darkworon wants to merge 2 commits into
NousResearch:mainfrom
darkworon:fix/honcho-session-cache-race

Conversation

@darkworon

Copy link
Copy Markdown

Summary

Fixes a race condition in the Honcho memory plugin where create_conclusion() intermittently fails due to concurrent access to HonchoSessionManager._cache from the sync_turn background thread and the main thread.

Problem

sync_turn() spawns a background thread that calls get_or_create() on _cache (a plain dict). Meanwhile, handle_tool_call() reads _cache via create_conclusion() in the main thread. Since get_or_create() is a multi-step read-modify-write operation, the main thread can observe _cache in an inconsistent state, causing _cache.get(session_key) to return None and the conclusion to silently fail.

Fix

Join the sync_turn thread before processing any tool call in handle_tool_call(). This ensures _cache is in a consistent state when tool handlers access it.

This is the minimal fix. A more comprehensive solution would add a threading.Lock to all _cache access in HonchoSessionManager, similar to the existing _prefetch_cache_lock.

Testing

  • Reproduced with 13 consecutive honcho_conclude calls — 2 random failures
  • After fix: all calls succeed consistently
  • No performance impact: sync_turn thread typically completes in <1s, and the join is a no-op when it's not running

Fixes #5102

…n on _cache

sync_turn() runs get_or_create() in a background thread while
handle_tool_call() reads _cache in the main thread. Both access the
same unprotected dict, causing intermittent cache misses where
create_conclusion() silently fails.

The fix joins the sync thread before processing any tool call,
ensuring _cache is in a consistent state.

Fixes NousResearch#5102
Copilot AI review requested due to automatic review settings April 4, 2026 19:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes an intermittent race in the Honcho memory provider by waiting for the sync_turn background thread to finish before servicing Honcho tool calls that read session state.

Changes:

  • Join the sync_turn thread before handling Honcho tool calls to reduce concurrent access to the session cache.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# the same _cache dict. Wait for it to complete before tool calls.
# See: https://github.com/NousResearch/hermes-agent/issues/5102
if self._sync_thread and self._sync_thread.is_alive():
self._sync_thread.join(timeout=5.0)

Copilot AI Apr 4, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

join(timeout=5.0) does not guarantee the sync thread has finished; if it’s still alive after 5s, tool handling proceeds and the _cache race this change is meant to prevent can still occur. Consider either joining without a timeout here, or re-checking is_alive() after the join and returning a retryable error / waiting longer, or switching to a shared lock that both sync_turn() and tool-call paths use when touching the session cache.

Suggested change
self._sync_thread.join(timeout=5.0)
self._sync_thread.join(timeout=5.0)
if self._sync_thread.is_alive():
return json.dumps(
{
"error": (
"Honcho is still synchronizing session state. "
"Please retry the tool call."
)
}
)

Copilot uses AI. Check for mistakes.
…eout

Address review feedback from Copilot: join(timeout=5.0) doesn't guarantee
the thread finished. Now returns a retryable error if the sync thread is
still alive after the timeout, preventing the cache race entirely.
@trevorgordon981

Copy link
Copy Markdown

LGTM as a minimal band-aid — the pattern matches existing code in the file (line 567 uses the same is_alive() + join(timeout=5.0) guard on sync_turn itself, and line 599 uses 10s in what looks like the conclusion path). Fix is consistent.

A few observations:

Timeout inconsistency

Line 599 uses join(timeout=10.0) for what appears to be the same underlying concern, but this PR uses 5.0 in handle_tool_call. If sync_turn legitimately takes 6-9s on a slow network (Honcho is remote, users on mobile tethering etc.), a tool call will hit this new error while the other path would succeed. Worth aligning to a single timeout constant, e.g.:

HONCHO_SYNC_JOIN_TIMEOUT = 10.0  # module-level

and using it in both places. Keeps the behavior predictable and makes it easy to tune later.

Retry message could be more actionable

"Honcho is still synchronizing session state. Please retry the tool call."

If sync_turn is genuinely stuck, the user will hit this repeatedly on every retry — each one waits 5s, errors, user retries, waits another 5s, etc. Consider either:

  • Adding an approximate wait hint: "...Please wait ~10s before retrying."
  • Or logging the elapsed sync_turn time server-side so operators can spot stuck syncs

The real fix

You're already aware — a threading.Lock around _cache access would eliminate this class of bug. The _prefetch_cache_lock pattern already in the codebase is the right template. Whenever someone has the cycles for that refactor, this band-aid can go away.

Ship it, though. The race was real and users were hitting it.

@alt-glitch alt-glitch added type/bug Something isn't working comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have labels May 1, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #5102 (closed). Note: #17005 already merged a drain-queue fix for Hindsight; this targets the Honcho plugin specifically. P3 per plugin-provider-demotion rule.

@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #5102.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Honcho plugin: race condition in session cache causes silent conclusion/memory loss

4 participants