fix(hindsight): flush buffered turns + drop stale prefetch on session switch#17508
Merged
Conversation
…on switch Two data-loss / leak gaps in HindsightMemoryProvider.on_session_switch introduced by #17409. 1. Buffered turns silently lost when retain_every_n_turns > 1. on_session_switch unconditionally cleared _session_turns without flushing. Users who batched every N>1 turns and switched mid-batch (/reset, /new, /resume, /branch, or context compression) had those buffered turns disappear. Same data-loss class as the shutdown race, different lifecycle event. Note commit_memory_session() -> on_session_end() runs *before* on_session_switch on /reset, but Hindsight doesn't implement on_session_end so the buffer survives that step and dies at clear time. /resume, /branch, and compression skip commit_memory_session entirely so an on_session_end impl wouldn't help them anyway. Fix: snapshot the old _session_id, _document_id, _parent_session_id, _turn_index, and _session_turns; spawn one final retain that lands under the OLD document_id; then rotate state. Metadata is built synchronously against the old self._* so session_id / lineage tags on the flushed item all reference the prior session consistently. 2. Stale _prefetch_result leaks across switch. If queue_prefetch ran in the old session and the result hadn't been consumed by prefetch() yet, on_session_switch left the cached recall text in place. The next session's first prefetch() call would return text mined from the prior session's bank/query. Fix: join any in-flight _prefetch_thread (3s bounded — matches shutdown()), then clear _prefetch_result under _prefetch_lock before rotating session_id. Tests ----- - tests/plugins/memory/test_hindsight_provider.py (TestSessionSwitchBufferFlush): - buffered turns flushed under OLD document_id with OLD lineage tags - empty buffer => no spurious retain - _prefetch_result cleared on switch - in-flight prefetch thread is awaited before clear (no race) - tests/agent/test_memory_session_switch.py: factory extended to seed the attrs the new flush path reads (_retain_source, _platform, _bank_id, prefetch state, etc.) and stub _run_hindsight_operation so existing switch-state assertions keep passing without network setup.
…hread Follow-up to the cherry-picked PR #17447. The original flush spawned a bare threading.Thread for the buffer-flush path, overwriting self._sync_thread — which is aliased to the long-lived writer thread. Two consequences: 1. No serialization with the writer queue. If old-session retains were still queued in _retain_queue, the flush ran concurrently with the writer and both threads could call aretain_batch against the same document_id. 2. The pre-spawn 'self._sync_thread.join(timeout=5.0)' tried to join the long-lived writer, which never exits, so the join was a no-op that just timed out — never actually serialized anything. Fix: enqueue the flush closure on _retain_queue via _ensure_writer + put(). Natural FIFO ordering behind any pending retains, no new thread, no broken join. Shutdown-aware so it doesn't enqueue after teardown. Tests updated to drain via _retain_queue.join() instead of the stale _sync_thread.join(). Added regression guard test_flush_serializes_behind_pending_retains_via_writer_queue that blocks the writer mid-retain to prove the flush waits in FIFO behind the old retain. Also seeds _retain_queue / _shutting_down / stubbed _ensure_writer on the bare-object test helper in test_memory_session_switch.py so that path doesn't blow up under the new queue-enqueue. tests/plugins/memory/test_hindsight_provider.py + tests/agent/test_memory_session_switch.py: 103/103 passing.
2 tasks
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Salvage of #17447 (@nicoloboschi) — same fix, same tests, + one follow-up: route the flush through the existing writer queue instead of a raw thread.
Summary
HindsightMemoryProvider.on_session_switchwas silently losing partial batches and leaking prior-session recall across/reset,/resume,/branch,/new, and context compression. This PR flushes buffered turns under the OLDdocument_idwith OLD lineage before rotation, drains in-flight prefetch, and clears_prefetch_result.Bugs fixed (from #17447)
retain_every_n_turns > 1—on_session_switchcleared_session_turnswithout flushing. Any in-flight batch disappeared at session switch time. Same data-loss class as the shutdown race, different lifecycle event. Reproduced on current main via bare-object repro:_session_turns=['a','b']→on_session_switch('new')→ buffer is[]with zero flush.queue_prefetchran in the old session andprefetch()hadn't consumed the result, the new session's firstprefetch()returned text mined from the prior session's bank.Follow-up in this salvage (second commit)
Nicolò's original fix spawned a raw
threading.Threadfor the flush, overwritingself._sync_thread(which is aliased to the long-lived writer thread). Two issues:sync_turnenqueues retains on_retain_queuedrained by the long-lived_writer_thread. The raw-thread flush ran concurrently with the writer — two threads could callaretain_batchagainst the samedocument_id.self._sync_thread.join(timeout=5.0)before spawn tried to join the long-lived writer, which never exits on its own, so it always timed out and never actually serialized anything.Fix: enqueue the flush closure on
_retain_queuevia_ensure_writer()+put(), the same pathsync_turnuses. Natural FIFO ordering behind any pending retains, no new thread, no broken join. Shutdown-aware (if not self._shutting_down.is_set()) so it can't enqueue during teardown.Tests
_prefetch_resultcleared, in-flight prefetch drained.test_flush_serializes_behind_pending_retains_via_writer_queueblocks the writer mid-retain with an Event and proves the flush lands FIFO behind the pending retain rather than racing it.tests/plugins/memory/test_hindsight_provider.py+tests/agent/test_memory_session_switch.py.document_idandsession:oldlineage tag; stale prefetch cleared; rotation completes.Commits
c3bbdc23d— original fix (authored by @nicoloboschi, cherry-picked)177a6905e— follow-up: route flush through writer queueCloses #17447.