Skip to content

fix(memory): run end-of-turn sync off the turn thread (agent stuck 'running')#41945

Merged
teknium1 merged 1 commit into
mainfrom
fix/async-memory-sync-running-state
Jun 8, 2026
Merged

fix(memory): run end-of-turn sync off the turn thread (agent stuck 'running')#41945
teknium1 merged 1 commit into
mainfrom
fix/async-memory-sync-running-state

Conversation

@teknium1

@teknium1 teknium1 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

A slow or misconfigured external memory provider no longer holds the agent in the "running" state after it has already delivered its final response.

Root cause: MemoryManager.sync_all / queue_prefetch_all looped provider.sync_turn / queue_prefetch inline on the turn-completion path. A provider making a blocking network/daemon call (a broken Hindsight daemon was observed blocking ~298s before failing) blocked run_conversation from returning. Every interface (CLI, TUI, gateway) marks the agent "running" until run_conversation returns — so the agent stayed busy for the full block, and any follow-up message triggered an aggressive interrupt that dropped the message.

Changes

  • agent/memory_manager.py: dispatch sync_all / queue_prefetch_all provider work to a lazily-created single-worker background executor. Both return immediately; work completes (or fails, logged) in the background.
    • Single worker serializes writes so turn N lands before turn N+1.
    • flush_pending(timeout) — barrier for session boundaries / deterministic tests.
    • shutdown_all() drains the executor with a bounded timeout (_SYNC_DRAIN_TIMEOUT_S = 5s) so a wedged provider can't hang teardown.
    • Builtin-only / no-provider sessions spawn no executor (zero new threads in the common case).
  • tests/agent/test_memory_async_sync.py: new regression suite (non-blocking dispatch, background completion, bounded teardown, write ordering, no-executor path).
  • tests/agent/test_memory_provider.py, tests/agent/test_memory_session_switch.py: add flush_pending() barrier before asserting provider state (sync is now async).

Validation

Before After
sync_all with provider blocking 3s caller blocks 3s (turn held "running") returns in 0.000s; work completes in background
shutdown_all with wedged (30s) provider could hang teardown returns in 5.00s (bounded)
Targeted memory tests 120 passed 120 passed (+ 6 new)

Fixes the primary symptom in the community report (agent stuck "running" 3+ min across CLI / Desktop / Telegram even on a clean turn). The cascading-interrupt / dropped-message symptoms are a separate code path (see #6600).

Infographic

async-memory-sync

A misconfigured/slow external memory provider could hold the agent in
the 'running' state for minutes after the final response was delivered.
MemoryManager.sync_all / queue_prefetch_all looped provider.sync_turn /
queue_prefetch INLINE on the turn-completion path; a provider making a
blocking network/daemon call (a broken Hindsight daemon was observed
blocking ~298s before failing) blocked run_conversation from returning.
Because every interface (CLI, TUI, gateway) marks the agent 'running'
until run_conversation returns, the agent stayed busy for the full block
and any follow-up message triggered an aggressive interrupt that dropped
the message.

Dispatch provider sync/prefetch to a lazily-created single-worker
background executor. sync_all / queue_prefetch_all return immediately;
work completes (or fails, logged) in the background. A single worker
serializes writes so turn N lands before turn N+1. flush_pending()
provides a barrier for session boundaries and deterministic tests.
shutdown_all() drains the executor with a bounded timeout so a wedged
provider can never hang teardown.

Builtin-only / no-provider sessions spawn no executor (zero new threads
in the common case).
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder tool/memory Memory tool and memory providers labels Jun 8, 2026
@liuhao1024

Copy link
Copy Markdown
Contributor

Code Review: Clean — no issues found.

Reviewed the full diff (agent/memory_manager.py + 3 test files, 138-line new test suite). The async memory sync design is solid:

  1. Single-worker ThreadPoolExecutor — max_workers=1 serializes writes so turn N lands before turn N+1. Lazy creation avoids spawning threads for builtin-only sessions (no providers means no executor).

  2. _submit_background fallback — if the executor is unavailable (already drained / creation failed), work runs inline. A RuntimeError catch handles the get-submit race during teardown. Writes are never silently dropped.

  3. _drain_sync_executor — stops accepting new work and cancels queued tasks. A daemon watcher thread does a bounded wait (5s) via join. Wedged providers cannot hang teardown — they die with the interpreter.

  4. flush_pending — submits a sentinel lambda and waits on its future. Single-worker semantics guarantee all prior tasks have completed. Used by tests for deterministic assertions and by session boundaries.

  5. Thread safety — _sync_executor_lock guards lazy init and teardown. The double-checked locking in _get_sync_executor is correct (check outside lock, acquire, check inside).

  6. Existing test updates — all existing test_memory_provider.py and test_memory_session_switch.py tests now call flush_pending(timeout=5) after sync/prefetch calls, preserving their assertions while adapting to async dispatch.

@teknium1 teknium1 merged commit aa6f277 into main Jun 8, 2026
23 checks passed
@teknium1 teknium1 deleted the fix/async-memory-sync-running-state branch June 8, 2026 09:19
a249169329-cpu pushed a commit to a249169329-cpu/hermes-agent that referenced this pull request Jun 9, 2026
…1945)

A misconfigured/slow external memory provider could hold the agent in
the 'running' state for minutes after the final response was delivered.
MemoryManager.sync_all / queue_prefetch_all looped provider.sync_turn /
queue_prefetch INLINE on the turn-completion path; a provider making a
blocking network/daemon call (a broken Hindsight daemon was observed
blocking ~298s before failing) blocked run_conversation from returning.
Because every interface (CLI, TUI, gateway) marks the agent 'running'
until run_conversation returns, the agent stayed busy for the full block
and any follow-up message triggered an aggressive interrupt that dropped
the message.

Dispatch provider sync/prefetch to a lazily-created single-worker
background executor. sync_all / queue_prefetch_all return immediately;
work completes (or fails, logged) in the background. A single worker
serializes writes so turn N lands before turn N+1. flush_pending()
provides a barrier for session boundaries and deterministic tests.
shutdown_all() drains the executor with a bounded timeout so a wedged
provider can never hang teardown.

Builtin-only / no-provider sessions spawn no executor (zero new threads
in the common case).
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
…1945)

A misconfigured/slow external memory provider could hold the agent in
the 'running' state for minutes after the final response was delivered.
MemoryManager.sync_all / queue_prefetch_all looped provider.sync_turn /
queue_prefetch INLINE on the turn-completion path; a provider making a
blocking network/daemon call (a broken Hindsight daemon was observed
blocking ~298s before failing) blocked run_conversation from returning.
Because every interface (CLI, TUI, gateway) marks the agent 'running'
until run_conversation returns, the agent stayed busy for the full block
and any follow-up message triggered an aggressive interrupt that dropped
the message.

Dispatch provider sync/prefetch to a lazily-created single-worker
background executor. sync_all / queue_prefetch_all return immediately;
work completes (or fails, logged) in the background. A single worker
serializes writes so turn N lands before turn N+1. flush_pending()
provides a barrier for session boundaries and deterministic tests.
shutdown_all() drains the executor with a bounded timeout so a wedged
provider can never hang teardown.

Builtin-only / no-provider sessions spawn no executor (zero new threads
in the common case).
alt-glitch pushed a commit that referenced this pull request Jun 14, 2026
A misconfigured/slow external memory provider could hold the agent in
the 'running' state for minutes after the final response was delivered.
MemoryManager.sync_all / queue_prefetch_all looped provider.sync_turn /
queue_prefetch INLINE on the turn-completion path; a provider making a
blocking network/daemon call (a broken Hindsight daemon was observed
blocking ~298s before failing) blocked run_conversation from returning.
Because every interface (CLI, TUI, gateway) marks the agent 'running'
until run_conversation returns, the agent stayed busy for the full block
and any follow-up message triggered an aggressive interrupt that dropped
the message.

Dispatch provider sync/prefetch to a lazily-created single-worker
background executor. sync_all / queue_prefetch_all return immediately;
work completes (or fails, logged) in the background. A single worker
serializes writes so turn N lands before turn N+1. flush_pending()
provides a barrier for session boundaries and deterministic tests.
shutdown_all() drains the executor with a bounded timeout so a wedged
provider can never hang teardown.

Builtin-only / no-provider sessions spawn no executor (zero new threads
in the common case).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists tool/memory Memory tool and memory providers type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants