fix(memory): run end-of-turn sync off the turn thread (agent stuck 'running') by teknium1 · Pull Request #41945 · NousResearch/hermes-agent

teknium1 · 2026-06-08T08:50:08Z

Summary

A slow or misconfigured external memory provider no longer holds the agent in the "running" state after it has already delivered its final response.

Root cause: MemoryManager.sync_all / queue_prefetch_all looped provider.sync_turn / queue_prefetch inline on the turn-completion path. A provider making a blocking network/daemon call (a broken Hindsight daemon was observed blocking ~298s before failing) blocked run_conversation from returning. Every interface (CLI, TUI, gateway) marks the agent "running" until run_conversation returns — so the agent stayed busy for the full block, and any follow-up message triggered an aggressive interrupt that dropped the message.

Changes

agent/memory_manager.py: dispatch sync_all / queue_prefetch_all provider work to a lazily-created single-worker background executor. Both return immediately; work completes (or fails, logged) in the background.
- Single worker serializes writes so turn N lands before turn N+1.
- flush_pending(timeout) — barrier for session boundaries / deterministic tests.
- shutdown_all() drains the executor with a bounded timeout (_SYNC_DRAIN_TIMEOUT_S = 5s) so a wedged provider can't hang teardown.
- Builtin-only / no-provider sessions spawn no executor (zero new threads in the common case).
tests/agent/test_memory_async_sync.py: new regression suite (non-blocking dispatch, background completion, bounded teardown, write ordering, no-executor path).
tests/agent/test_memory_provider.py, tests/agent/test_memory_session_switch.py: add flush_pending() barrier before asserting provider state (sync is now async).

Validation

	Before	After
`sync_all` with provider blocking 3s	caller blocks 3s (turn held "running")	returns in 0.000s; work completes in background
`shutdown_all` with wedged (30s) provider	could hang teardown	returns in 5.00s (bounded)
Targeted memory tests	120 passed	120 passed (+ 6 new)

Fixes the primary symptom in the community report (agent stuck "running" 3+ min across CLI / Desktop / Telegram even on a clean turn). The cascading-interrupt / dropped-message symptoms are a separate code path (see #6600).

Infographic

A misconfigured/slow external memory provider could hold the agent in the 'running' state for minutes after the final response was delivered. MemoryManager.sync_all / queue_prefetch_all looped provider.sync_turn / queue_prefetch INLINE on the turn-completion path; a provider making a blocking network/daemon call (a broken Hindsight daemon was observed blocking ~298s before failing) blocked run_conversation from returning. Because every interface (CLI, TUI, gateway) marks the agent 'running' until run_conversation returns, the agent stayed busy for the full block and any follow-up message triggered an aggressive interrupt that dropped the message. Dispatch provider sync/prefetch to a lazily-created single-worker background executor. sync_all / queue_prefetch_all return immediately; work completes (or fails, logged) in the background. A single worker serializes writes so turn N lands before turn N+1. flush_pending() provides a barrier for session boundaries and deterministic tests. shutdown_all() drains the executor with a bounded timeout so a wedged provider can never hang teardown. Builtin-only / no-provider sessions spawn no executor (zero new threads in the common case).

liuhao1024 · 2026-06-08T09:17:03Z

Code Review: Clean — no issues found.

Reviewed the full diff (agent/memory_manager.py + 3 test files, 138-line new test suite). The async memory sync design is solid:

Single-worker ThreadPoolExecutor — max_workers=1 serializes writes so turn N lands before turn N+1. Lazy creation avoids spawning threads for builtin-only sessions (no providers means no executor).
_submit_background fallback — if the executor is unavailable (already drained / creation failed), work runs inline. A RuntimeError catch handles the get-submit race during teardown. Writes are never silently dropped.
_drain_sync_executor — stops accepting new work and cancels queued tasks. A daemon watcher thread does a bounded wait (5s) via join. Wedged providers cannot hang teardown — they die with the interpreter.
flush_pending — submits a sentinel lambda and waits on its future. Single-worker semantics guarantee all prior tasks have completed. Used by tests for deterministic assertions and by session boundaries.
Thread safety — _sync_executor_lock guards lazy init and teardown. The double-checked locking in _get_sync_executor is correct (check outside lock, acquire, check inside).
Existing test updates — all existing test_memory_provider.py and test_memory_session_switch.py tests now call flush_pending(timeout=5) after sync/prefetch calls, preserving their assertions while adapting to async dispatch.

…1945) A misconfigured/slow external memory provider could hold the agent in the 'running' state for minutes after the final response was delivered. MemoryManager.sync_all / queue_prefetch_all looped provider.sync_turn / queue_prefetch INLINE on the turn-completion path; a provider making a blocking network/daemon call (a broken Hindsight daemon was observed blocking ~298s before failing) blocked run_conversation from returning. Because every interface (CLI, TUI, gateway) marks the agent 'running' until run_conversation returns, the agent stayed busy for the full block and any follow-up message triggered an aggressive interrupt that dropped the message. Dispatch provider sync/prefetch to a lazily-created single-worker background executor. sync_all / queue_prefetch_all return immediately; work completes (or fails, logged) in the background. A single worker serializes writes so turn N lands before turn N+1. flush_pending() provides a barrier for session boundaries and deterministic tests. shutdown_all() drains the executor with a bounded timeout so a wedged provider can never hang teardown. Builtin-only / no-provider sessions spawn no executor (zero new threads in the common case).

A misconfigured/slow external memory provider could hold the agent in the 'running' state for minutes after the final response was delivered. MemoryManager.sync_all / queue_prefetch_all looped provider.sync_turn / queue_prefetch INLINE on the turn-completion path; a provider making a blocking network/daemon call (a broken Hindsight daemon was observed blocking ~298s before failing) blocked run_conversation from returning. Because every interface (CLI, TUI, gateway) marks the agent 'running' until run_conversation returns, the agent stayed busy for the full block and any follow-up message triggered an aggressive interrupt that dropped the message. Dispatch provider sync/prefetch to a lazily-created single-worker background executor. sync_all / queue_prefetch_all return immediately; work completes (or fails, logged) in the background. A single worker serializes writes so turn N lands before turn N+1. flush_pending() provides a barrier for session boundaries and deterministic tests. shutdown_all() drains the executor with a bounded timeout so a wedged provider can never hang teardown. Builtin-only / no-provider sessions spawn no executor (zero new threads in the common case).

teknium1 mentioned this pull request Jun 8, 2026

fix(agent): don't retry interrupt-induced transport errors (salvage #6600) #41952

Merged

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder tool/memory Memory tool and memory providers labels Jun 8, 2026

teknium1 merged commit aa6f277 into main Jun 8, 2026
23 checks passed

teknium1 deleted the fix/async-memory-sync-running-state branch June 8, 2026 09:19

teknium1 mentioned this pull request Jun 11, 2026

fix(agent): dispatch external memory sync to daemon thread #24486

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(memory): run end-of-turn sync off the turn thread (agent stuck 'running')#41945

fix(memory): run end-of-turn sync off the turn thread (agent stuck 'running')#41945
teknium1 merged 1 commit into
mainfrom
fix/async-memory-sync-running-state

teknium1 commented Jun 8, 2026

Uh oh!

liuhao1024 commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

teknium1 commented Jun 8, 2026

Summary

Changes

Validation

Infographic

Uh oh!

liuhao1024 commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants