fix(hindsight): drain retain queue cleanly on shutdown by nicoloboschi · Pull Request #17005 · NousResearch/hermes-agent

nicoloboschi · 2026-04-28T12:49:38Z

Summary

Hindsight memory works during normal Hermes runtime, but on CLI / session exit the plugin's background retain path can race interpreter shutdown:

WARNING plugins.memory.hindsight: Hindsight sync failed: cannot schedule new futures after interpreter shutdown
RuntimeError: cannot schedule new futures after interpreter shutdown
ERROR asyncio: Unclosed client session
        client_session: <aiohttp.client.ClientSession object at 0x...>

Mostly exit-time log noise, but it can also silently drop the last unsaved turn — typically the most important one.

Root cause

sync_turn() previously spawned one ad-hoc daemon thread per call to do the aretain_batch network write. When that thread is still mid-flight while interpreter shutdown begins, asyncio's "cannot schedule new futures" guard fires and aiohttp's session is never cleanly closed.

Fix

Per-provider single-writer model:

One long-lived writer thread + queue.Queue per provider.
sync_turn() snapshots state and enqueues a callable; no thread spawn per call.
queue_prefetch() and sync_turn() early-return once shutdown() has fired — "once shutdown starts, stop accepting new writes."
shutdown() sets the gate, sends a sentinel, joins the writer (10s) so in-flight work drains, then closes the client.
Idempotent atexit hook registered from the first sync_turn() so exit paths that don't go through MemoryManager.shutdown_all() (Ctrl-C, abrupt exit) still get a chance to drain.

_sync_thread is kept as a legacy alias pointing at the writer for any external callers; tests use _retain_queue.join() (the canonical wait-for-drain idiom) instead.

Test plan

uv run pytest tests/plugins/memory/test_hindsight_provider.py — 86/86 passed
uv run pytest tests/agent/test_memory_provider.py tests/plugins/memory/ — 194/194 passed
New TestShutdownRace suite covers:
- Single writer thread reused across sync_turn() calls (no per-call spawn)
- sync_turn after shutdown() is dropped (no enqueue, no client call)
- queue_prefetch after shutdown() is dropped
- shutdown() drains pending retains before returning
- shutdown() is idempotent

alt-glitch · 2026-04-28T13:07:25Z

Fixes #15497 and #15073. Competing PRs: #15507, #15512 — all address the same Hindsight shutdown race. This PR uses a single-writer queue model which is the most comprehensive approach.

Two improvements to make the smoke test reliable end-to-end: 1. Default model: gpt-4o-mini -> gpt-4o. The smaller model is unreliable on Hindsight's fact-extraction structured-output schema — the LLM returns ~163 output tokens but an empty `facts` array, so retain succeeds (200 OK) yet the bank stays empty and recall surfaces nothing. gpt-4o produces consistent extractions for this content. 2. Realistic content: replace synthetic Q&A ("What's my favorite language?" / "The user's favorite language is Rust") with a first-person user statement, which is what the fact extractor is tuned to mine. Also switch the post-retain wait from `_sync_thread.join(60s)` to `_retain_queue.join()` — the canonical wait-for-drain idiom under the plugin's new single-writer model (NousResearch/hermes-agent#17005).

The plugin used to spawn one daemon thread per sync_turn() to do the aretain_batch network write. On CLI exit, that pattern raced interpreter shutdown — the last retain could reach aiohttp after asyncio's "cannot schedule new futures" guard had fired, producing noisy logs and silently losing the final unsaved turn: WARNING ... Hindsight sync failed: cannot schedule new futures after interpreter shutdown ERROR asyncio: Unclosed client session client_session: <aiohttp.client.ClientSession object at 0x...> Switch to a single-writer model: each provider owns one long-lived writer thread plus a queue. sync_turn() snapshots state and enqueues a job; the writer drains sequentially. Once shutdown() is called: - new sync_turn() / queue_prefetch() calls are dropped, not enqueued - a sentinel wakes the writer so it finishes in-flight work - shutdown joins the writer (10s) before nulling the client Also register an idempotent atexit hook from the first sync_turn(), so exit paths that don't go through MemoryManager.shutdown_all() (Ctrl-C, abrupt exit) still get a chance to drain. Tests: keep _sync_thread as a legacy alias to the writer, swap join() calls to _retain_queue.join() (canonical wait-for-drain), add a new TestShutdownRace suite covering single-writer reuse, post-shutdown drop, queue draining, and shutdown idempotency.

nicoloboschi · 2026-04-29T14:36:37Z

Consolidated into #17447 (same fix + the session-switch buffer-flush + prefetch-clear cousins, all sharing the writer queue surface). Closing.

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins tool/memory Memory tool and memory providers labels Apr 28, 2026

nicoloboschi changed the title ~~fix(hindsight): drain retain queue cleanly on shutdown~~ fix(hindsight): retain shape + clean shutdown drain Apr 28, 2026

nicoloboschi force-pushed the fix/hindsight-shutdown-race branch from ed8965e to 93ca4ef Compare April 28, 2026 16:33

nicoloboschi changed the title ~~fix(hindsight): retain shape + clean shutdown drain~~ fix(hindsight): drain retain queue cleanly on shutdown Apr 28, 2026

dagbs mentioned this pull request Apr 29, 2026

RuntimeError: Timeout context manager should be used inside a task during aretain_batch when background event loop is silently replaced #17226

Open

nicoloboschi force-pushed the fix/hindsight-shutdown-race branch from 93ca4ef to 8b9e948 Compare April 29, 2026 12:05

teknium1 merged commit 0565497 into NousResearch:main Apr 29, 2026
6 of 7 checks passed

nicoloboschi mentioned this pull request Apr 29, 2026

fix(hindsight): flush buffered turns and drop stale prefetch on session switch #17447

Closed

2 tasks

alt-glitch mentioned this pull request May 1, 2026

fix: join sync_turn thread before tool calls to prevent race condition on _cache #5103

Open

nicoloboschi mentioned this pull request May 6, 2026

fix: prevent Hindsight retain work during interpreter shutdown #15512

Open

This was referenced Jun 4, 2026

Hindsight provider can submit retain work during interpreter shutdown #15497

Open

[Bug]: Hindsight sync can race interpreter shutdown after successful one-shot CLI exit #15073

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hindsight): drain retain queue cleanly on shutdown#17005

fix(hindsight): drain retain queue cleanly on shutdown#17005
teknium1 merged 1 commit into
NousResearch:mainfrom
nicoloboschi:fix/hindsight-shutdown-race

nicoloboschi commented Apr 28, 2026 •

edited

Loading

Uh oh!

alt-glitch commented Apr 28, 2026

Uh oh!

Uh oh!

nicoloboschi commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nicoloboschi commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Test plan

Uh oh!

alt-glitch commented Apr 28, 2026

Uh oh!

Uh oh!

nicoloboschi commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nicoloboschi commented Apr 28, 2026 •

edited

Loading