Skip to content

fix(hindsight): drain retain queue cleanly on shutdown#17005

Merged
teknium1 merged 1 commit into
NousResearch:mainfrom
nicoloboschi:fix/hindsight-shutdown-race
Apr 29, 2026
Merged

fix(hindsight): drain retain queue cleanly on shutdown#17005
teknium1 merged 1 commit into
NousResearch:mainfrom
nicoloboschi:fix/hindsight-shutdown-race

Conversation

@nicoloboschi

@nicoloboschi nicoloboschi commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Hindsight memory works during normal Hermes runtime, but on CLI / session exit the plugin's background retain path can race interpreter shutdown:

WARNING plugins.memory.hindsight: Hindsight sync failed: cannot schedule new futures after interpreter shutdown
RuntimeError: cannot schedule new futures after interpreter shutdown
ERROR asyncio: Unclosed client session
        client_session: <aiohttp.client.ClientSession object at 0x...>

Mostly exit-time log noise, but it can also silently drop the last unsaved turn — typically the most important one.

Root cause

sync_turn() previously spawned one ad-hoc daemon thread per call to do the aretain_batch network write. When that thread is still mid-flight while interpreter shutdown begins, asyncio's "cannot schedule new futures" guard fires and aiohttp's session is never cleanly closed.

Fix

Per-provider single-writer model:

  • One long-lived writer thread + queue.Queue per provider.
  • sync_turn() snapshots state and enqueues a callable; no thread spawn per call.
  • queue_prefetch() and sync_turn() early-return once shutdown() has fired — "once shutdown starts, stop accepting new writes."
  • shutdown() sets the gate, sends a sentinel, joins the writer (10s) so in-flight work drains, then closes the client.
  • Idempotent atexit hook registered from the first sync_turn() so exit paths that don't go through MemoryManager.shutdown_all() (Ctrl-C, abrupt exit) still get a chance to drain.

_sync_thread is kept as a legacy alias pointing at the writer for any external callers; tests use _retain_queue.join() (the canonical wait-for-drain idiom) instead.

Test plan

  • uv run pytest tests/plugins/memory/test_hindsight_provider.py86/86 passed
  • uv run pytest tests/agent/test_memory_provider.py tests/plugins/memory/194/194 passed
  • New TestShutdownRace suite covers:
    • Single writer thread reused across sync_turn() calls (no per-call spawn)
    • sync_turn after shutdown() is dropped (no enqueue, no client call)
    • queue_prefetch after shutdown() is dropped
    • shutdown() drains pending retains before returning
    • shutdown() is idempotent

@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins tool/memory Memory tool and memory providers labels Apr 28, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Fixes #15497 and #15073. Competing PRs: #15507, #15512 — all address the same Hindsight shutdown race. This PR uses a single-writer queue model which is the most comprehensive approach.

@nicoloboschi nicoloboschi changed the title fix(hindsight): drain retain queue cleanly on shutdown fix(hindsight): retain shape + clean shutdown drain Apr 28, 2026
nicoloboschi added a commit to vectorize-io/hindsight that referenced this pull request Apr 28, 2026
Two improvements to make the smoke test reliable end-to-end:

1. Default model: gpt-4o-mini -> gpt-4o. The smaller model is unreliable
   on Hindsight's fact-extraction structured-output schema — the LLM
   returns ~163 output tokens but an empty `facts` array, so retain
   succeeds (200 OK) yet the bank stays empty and recall surfaces
   nothing. gpt-4o produces consistent extractions for this content.

2. Realistic content: replace synthetic Q&A ("What's my favorite
   language?" / "The user's favorite language is Rust") with a
   first-person user statement, which is what the fact extractor is
   tuned to mine.

Also switch the post-retain wait from `_sync_thread.join(60s)` to
`_retain_queue.join()` — the canonical wait-for-drain idiom under the
plugin's new single-writer model (NousResearch/hermes-agent#17005).
@nicoloboschi nicoloboschi force-pushed the fix/hindsight-shutdown-race branch from ed8965e to 93ca4ef Compare April 28, 2026 16:33
@nicoloboschi nicoloboschi changed the title fix(hindsight): retain shape + clean shutdown drain fix(hindsight): drain retain queue cleanly on shutdown Apr 28, 2026
The plugin used to spawn one daemon thread per sync_turn() to do the
aretain_batch network write. On CLI exit, that pattern raced interpreter
shutdown — the last retain could reach aiohttp after asyncio's
"cannot schedule new futures" guard had fired, producing noisy logs and
silently losing the final unsaved turn:

    WARNING ... Hindsight sync failed: cannot schedule new futures after
            interpreter shutdown
    ERROR asyncio: Unclosed client session
            client_session: <aiohttp.client.ClientSession object at 0x...>

Switch to a single-writer model: each provider owns one long-lived
writer thread plus a queue. sync_turn() snapshots state and enqueues a
job; the writer drains sequentially. Once shutdown() is called:

  - new sync_turn() / queue_prefetch() calls are dropped, not enqueued
  - a sentinel wakes the writer so it finishes in-flight work
  - shutdown joins the writer (10s) before nulling the client

Also register an idempotent atexit hook from the first sync_turn(), so
exit paths that don't go through MemoryManager.shutdown_all() (Ctrl-C,
abrupt exit) still get a chance to drain.

Tests: keep _sync_thread as a legacy alias to the writer, swap join()
calls to _retain_queue.join() (canonical wait-for-drain), add a new
TestShutdownRace suite covering single-writer reuse, post-shutdown drop,
queue draining, and shutdown idempotency.
@nicoloboschi nicoloboschi force-pushed the fix/hindsight-shutdown-race branch from 93ca4ef to 8b9e948 Compare April 29, 2026 12:05
@teknium1 teknium1 merged commit 0565497 into NousResearch:main Apr 29, 2026
6 of 7 checks passed
@nicoloboschi

Copy link
Copy Markdown
Contributor Author

Consolidated into #17447 (same fix + the session-switch buffer-flush + prefetch-clear cousins, all sharing the writer queue surface). Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have tool/memory Memory tool and memory providers type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants