fix(hindsight): flush buffered turns + drop stale prefetch on session switch by teknium1 · Pull Request #17508 · NousResearch/hermes-agent

teknium1 · 2026-04-29T15:08:31Z

Salvage of #17447 (@nicoloboschi) — same fix, same tests, + one follow-up: route the flush through the existing writer queue instead of a raw thread.

Summary

HindsightMemoryProvider.on_session_switch was silently losing partial batches and leaking prior-session recall across /reset, /resume, /branch, /new, and context compression. This PR flushes buffered turns under the OLD document_id with OLD lineage before rotation, drains in-flight prefetch, and clears _prefetch_result.

Bugs fixed (from #17447)

Data loss when retain_every_n_turns > 1 — on_session_switch cleared _session_turns without flushing. Any in-flight batch disappeared at session switch time. Same data-loss class as the shutdown race, different lifecycle event. Reproduced on current main via bare-object repro: _session_turns=['a','b'] → on_session_switch('new') → buffer is [] with zero flush.
Stale prefetch leak across switch — if queue_prefetch ran in the old session and prefetch() hadn't consumed the result, the new session's first prefetch() returned text mined from the prior session's bank.

Follow-up in this salvage (second commit)

Nicolò's original fix spawned a raw threading.Thread for the flush, overwriting self._sync_thread (which is aliased to the long-lived writer thread). Two issues:

No serialization with the writer queue. sync_turn enqueues retains on _retain_queue drained by the long-lived _writer_thread. The raw-thread flush ran concurrently with the writer — two threads could call aretain_batch against the same document_id.
Broken pre-spawn join. The self._sync_thread.join(timeout=5.0) before spawn tried to join the long-lived writer, which never exits on its own, so it always timed out and never actually serialized anything.

Fix: enqueue the flush closure on _retain_queue via _ensure_writer() + put(), the same path sync_turn uses. Natural FIFO ordering behind any pending retains, no new thread, no broken join. Shutdown-aware (if not self._shutting_down.is_set()) so it can't enqueue during teardown.

Tests

4 tests from Nicolò — buffer flushed under OLD doc/lineage, no spurious retain on empty buffer, _prefetch_result cleared, in-flight prefetch drained.
1 new regression guard — test_flush_serializes_behind_pending_retains_via_writer_queue blocks the writer mid-retain with an Event and proves the flush lands FIFO behind the pending retain rather than racing it.
103/103 passing on tests/plugins/memory/test_hindsight_provider.py + tests/agent/test_memory_session_switch.py.
E2E against the worktree: bare-object repro against patched code → flush enqueued on writer queue with OLD document_id and session:old lineage tag; stale prefetch cleared; rotation completes.

Commits

c3bbdc23d — original fix (authored by @nicoloboschi, cherry-picked)
177a6905e — follow-up: route flush through writer queue

Closes #17447.

…on switch Two data-loss / leak gaps in HindsightMemoryProvider.on_session_switch introduced by #17409. 1. Buffered turns silently lost when retain_every_n_turns > 1. on_session_switch unconditionally cleared _session_turns without flushing. Users who batched every N>1 turns and switched mid-batch (/reset, /new, /resume, /branch, or context compression) had those buffered turns disappear. Same data-loss class as the shutdown race, different lifecycle event. Note commit_memory_session() -> on_session_end() runs *before* on_session_switch on /reset, but Hindsight doesn't implement on_session_end so the buffer survives that step and dies at clear time. /resume, /branch, and compression skip commit_memory_session entirely so an on_session_end impl wouldn't help them anyway. Fix: snapshot the old _session_id, _document_id, _parent_session_id, _turn_index, and _session_turns; spawn one final retain that lands under the OLD document_id; then rotate state. Metadata is built synchronously against the old self._* so session_id / lineage tags on the flushed item all reference the prior session consistently. 2. Stale _prefetch_result leaks across switch. If queue_prefetch ran in the old session and the result hadn't been consumed by prefetch() yet, on_session_switch left the cached recall text in place. The next session's first prefetch() call would return text mined from the prior session's bank/query. Fix: join any in-flight _prefetch_thread (3s bounded — matches shutdown()), then clear _prefetch_result under _prefetch_lock before rotating session_id. Tests ----- - tests/plugins/memory/test_hindsight_provider.py (TestSessionSwitchBufferFlush): - buffered turns flushed under OLD document_id with OLD lineage tags - empty buffer => no spurious retain - _prefetch_result cleared on switch - in-flight prefetch thread is awaited before clear (no race) - tests/agent/test_memory_session_switch.py: factory extended to seed the attrs the new flush path reads (_retain_source, _platform, _bank_id, prefetch state, etc.) and stub _run_hindsight_operation so existing switch-state assertions keep passing without network setup.

…hread Follow-up to the cherry-picked PR #17447. The original flush spawned a bare threading.Thread for the buffer-flush path, overwriting self._sync_thread — which is aliased to the long-lived writer thread. Two consequences: 1. No serialization with the writer queue. If old-session retains were still queued in _retain_queue, the flush ran concurrently with the writer and both threads could call aretain_batch against the same document_id. 2. The pre-spawn 'self._sync_thread.join(timeout=5.0)' tried to join the long-lived writer, which never exits, so the join was a no-op that just timed out — never actually serialized anything. Fix: enqueue the flush closure on _retain_queue via _ensure_writer + put(). Natural FIFO ordering behind any pending retains, no new thread, no broken join. Shutdown-aware so it doesn't enqueue after teardown. Tests updated to drain via _retain_queue.join() instead of the stale _sync_thread.join(). Added regression guard test_flush_serializes_behind_pending_retains_via_writer_queue that blocks the writer mid-retain to prove the flush waits in FIFO behind the old retain. Also seeds _retain_queue / _shutting_down / stubbed _ensure_writer on the bare-object test helper in test_memory_session_switch.py so that path doesn't blow up under the new queue-enqueue. tests/plugins/memory/test_hindsight_provider.py + tests/agent/test_memory_session_switch.py: 103/103 passing.

nicoloboschi and others added 2 commits April 29, 2026 08:05

teknium1 merged commit 0a5ee01 into main Apr 29, 2026
10 of 11 checks passed

teknium1 deleted the hermes/hermes-b72c17d6 branch April 29, 2026 15:09

teknium1 mentioned this pull request Apr 29, 2026

fix(hindsight): flush buffered turns and drop stale prefetch on session switch #17447

Closed

2 tasks

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins tool/memory Memory tool and memory providers labels Apr 29, 2026

harshitAgr mentioned this pull request May 19, 2026

fix(openviking): implement on_session_switch hook (#28296) #28445

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hindsight): flush buffered turns + drop stale prefetch on session switch#17508

fix(hindsight): flush buffered turns + drop stale prefetch on session switch#17508
teknium1 merged 2 commits into
mainfrom
hermes/hermes-b72c17d6

teknium1 commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

teknium1 commented Apr 29, 2026

Summary

Bugs fixed (from #17447)

Follow-up in this salvage (second commit)

Tests

Commits

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants