fix(memory): run end-of-turn sync off the turn thread (agent stuck 'running')#41945
Conversation
A misconfigured/slow external memory provider could hold the agent in the 'running' state for minutes after the final response was delivered. MemoryManager.sync_all / queue_prefetch_all looped provider.sync_turn / queue_prefetch INLINE on the turn-completion path; a provider making a blocking network/daemon call (a broken Hindsight daemon was observed blocking ~298s before failing) blocked run_conversation from returning. Because every interface (CLI, TUI, gateway) marks the agent 'running' until run_conversation returns, the agent stayed busy for the full block and any follow-up message triggered an aggressive interrupt that dropped the message. Dispatch provider sync/prefetch to a lazily-created single-worker background executor. sync_all / queue_prefetch_all return immediately; work completes (or fails, logged) in the background. A single worker serializes writes so turn N lands before turn N+1. flush_pending() provides a barrier for session boundaries and deterministic tests. shutdown_all() drains the executor with a bounded timeout so a wedged provider can never hang teardown. Builtin-only / no-provider sessions spawn no executor (zero new threads in the common case).
|
Code Review: Clean — no issues found. Reviewed the full diff (agent/memory_manager.py + 3 test files, 138-line new test suite). The async memory sync design is solid:
|
…1945) A misconfigured/slow external memory provider could hold the agent in the 'running' state for minutes after the final response was delivered. MemoryManager.sync_all / queue_prefetch_all looped provider.sync_turn / queue_prefetch INLINE on the turn-completion path; a provider making a blocking network/daemon call (a broken Hindsight daemon was observed blocking ~298s before failing) blocked run_conversation from returning. Because every interface (CLI, TUI, gateway) marks the agent 'running' until run_conversation returns, the agent stayed busy for the full block and any follow-up message triggered an aggressive interrupt that dropped the message. Dispatch provider sync/prefetch to a lazily-created single-worker background executor. sync_all / queue_prefetch_all return immediately; work completes (or fails, logged) in the background. A single worker serializes writes so turn N lands before turn N+1. flush_pending() provides a barrier for session boundaries and deterministic tests. shutdown_all() drains the executor with a bounded timeout so a wedged provider can never hang teardown. Builtin-only / no-provider sessions spawn no executor (zero new threads in the common case).
…1945) A misconfigured/slow external memory provider could hold the agent in the 'running' state for minutes after the final response was delivered. MemoryManager.sync_all / queue_prefetch_all looped provider.sync_turn / queue_prefetch INLINE on the turn-completion path; a provider making a blocking network/daemon call (a broken Hindsight daemon was observed blocking ~298s before failing) blocked run_conversation from returning. Because every interface (CLI, TUI, gateway) marks the agent 'running' until run_conversation returns, the agent stayed busy for the full block and any follow-up message triggered an aggressive interrupt that dropped the message. Dispatch provider sync/prefetch to a lazily-created single-worker background executor. sync_all / queue_prefetch_all return immediately; work completes (or fails, logged) in the background. A single worker serializes writes so turn N lands before turn N+1. flush_pending() provides a barrier for session boundaries and deterministic tests. shutdown_all() drains the executor with a bounded timeout so a wedged provider can never hang teardown. Builtin-only / no-provider sessions spawn no executor (zero new threads in the common case).
A misconfigured/slow external memory provider could hold the agent in the 'running' state for minutes after the final response was delivered. MemoryManager.sync_all / queue_prefetch_all looped provider.sync_turn / queue_prefetch INLINE on the turn-completion path; a provider making a blocking network/daemon call (a broken Hindsight daemon was observed blocking ~298s before failing) blocked run_conversation from returning. Because every interface (CLI, TUI, gateway) marks the agent 'running' until run_conversation returns, the agent stayed busy for the full block and any follow-up message triggered an aggressive interrupt that dropped the message. Dispatch provider sync/prefetch to a lazily-created single-worker background executor. sync_all / queue_prefetch_all return immediately; work completes (or fails, logged) in the background. A single worker serializes writes so turn N lands before turn N+1. flush_pending() provides a barrier for session boundaries and deterministic tests. shutdown_all() drains the executor with a bounded timeout so a wedged provider can never hang teardown. Builtin-only / no-provider sessions spawn no executor (zero new threads in the common case).
Summary
A slow or misconfigured external memory provider no longer holds the agent in the "running" state after it has already delivered its final response.
Root cause:
MemoryManager.sync_all/queue_prefetch_allloopedprovider.sync_turn/queue_prefetchinline on the turn-completion path. A provider making a blocking network/daemon call (a broken Hindsight daemon was observed blocking ~298s before failing) blockedrun_conversationfrom returning. Every interface (CLI, TUI, gateway) marks the agent "running" untilrun_conversationreturns — so the agent stayed busy for the full block, and any follow-up message triggered an aggressive interrupt that dropped the message.Changes
agent/memory_manager.py: dispatchsync_all/queue_prefetch_allprovider work to a lazily-created single-worker background executor. Both return immediately; work completes (or fails, logged) in the background.flush_pending(timeout)— barrier for session boundaries / deterministic tests.shutdown_all()drains the executor with a bounded timeout (_SYNC_DRAIN_TIMEOUT_S = 5s) so a wedged provider can't hang teardown.tests/agent/test_memory_async_sync.py: new regression suite (non-blocking dispatch, background completion, bounded teardown, write ordering, no-executor path).tests/agent/test_memory_provider.py,tests/agent/test_memory_session_switch.py: addflush_pending()barrier before asserting provider state (sync is now async).Validation
sync_allwith provider blocking 3sshutdown_allwith wedged (30s) providerFixes the primary symptom in the community report (agent stuck "running" 3+ min across CLI / Desktop / Telegram even on a clean turn). The cascading-interrupt / dropped-message symptoms are a separate code path (see #6600).
Infographic