fix(openviking): implement on_session_switch hook (#28296)#28445
fix(openviking): implement on_session_switch hook (#28296)#28445harshitAgr wants to merge 5 commits into
Conversation
OpenVikingMemoryProvider only overrides on_session_end and inherits the base-class no-op for on_session_switch. When the agent rotates session_id (via /new, /branch, /reset, /resume, or context compression), the provider's cached _session_id stays at the value initialize() captured. All subsequent sync_turn writes then land in the already-closed old session, and on_session_end tries to commit it a second time — the new session never accumulates messages and never triggers memory extraction. The fix mirrors the pattern Hindsight uses (NousResearch#17508): 1. Wait for any in-flight sync thread to drain under the OLD _session_id before we mutate it, otherwise the commit below races the last message write. 2. Commit the old session if it accumulated turns — same extraction semantics as on_session_end. Skip if empty (nothing to extract). 3. Drain in-flight prefetch from the old session and clear its cached result so the new session doesn't see stale recall. 4. Rotate _session_id to the new value and reset _turn_count. Commit failures are swallowed (logged at WARN) so a flaky server can't strand the provider on the old session forever — same posture as the existing on_session_end commit.
ehz0ah
left a comment
There was a problem hiding this comment.
Thanks for picking this up. The direction is right: OpenViking does need to handle on_session_switch() so it does not keep writing to the old Hermes session.
I think this needs two small hardening changes before merge:
-
In
sync_turn(), capture the target session id before starting the background thread, e.g.sid = str(session_id or self._session_id).strip(), and have the worker use that fixed value. Otherwise a delayed worker can read_session_idafter rotation and write the previous turn into the new session. -
Make the old-session commit path idempotent with existing
commit_memory_session()calls./newand compression already callon_session_end()beforeon_session_switch(), so after a successful commit the provider should mark the old session clean instead of letting the switch hook commit it again.
With those changes, this stays focused on #28296 while closing the remaining session-boundary race.
…sion_end Two hardening fixes prompted by review on NousResearch#28296: 1. sync_turn() now snapshots the target session id before spawning the worker. The previous code read self._session_id inside the worker, so a worker delayed past on_session_switch's bounded join could read the rotated-in NEW id and write the OLD turn's messages into the wrong session. 2. on_session_end() resets _turn_count to 0 after a successful commit, making the old-session commit path idempotent with the new switch hook. /new and compression call commit_memory_session() (which fires on_session_end) immediately before on_session_switch; without this, the old session would be committed twice. On commit failure we leave _turn_count > 0 so on_session_switch retries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Both addressed in 2ea8d5c:
Tests added: worker-rotation race (worker parked mid-call, |
ehz0ah
left a comment
There was a problem hiding this comment.
Thanks for the update. The two earlier points are addressed: sync_turn() now captures the target sid before spawning the worker, and on_session_end() marks the session clean after a successful commit.
I think there is still one session-boundary race left. on_session_end() / on_session_switch() join _sync_thread for 10s, but the sync worker can spend up to two 30s OpenViking POSTs. If the worker is still alive after the bounded join, the old session can be committed and marked clean while that worker later writes old-session messages. Those late writes would not be extracted unless another commit happens for the old sid.
A focused fix would be to make the commit path conditional on the writer actually finishing for that session: if the writer is still alive after the join, do not mark the session clean, and avoid treating that old session as safely committed.
While touching this area, on_memory_write() has the same late _session_id capture shape, so it should snapshot the target sid before spawning its worker. Prefetch should also ignore results from an old generation after a switch so stale recall cannot repopulate the new session.
Three follow-ups from review on NousResearch#28296: 1. Sync worker outliving the bounded join. Each sync_turn POST has _TIMEOUT=30s and there are two per turn, but on_session_end and on_session_switch only join for 10s. If the worker is still alive after the join, committing the old session orphans the worker's late writes past the commit boundary — they land in an already- committed session and never get extracted. Both hooks now re-check is_alive() after the join and skip the commit when the worker hasn't drained. 2. on_memory_write late session_id capture. Same shape as the pre-fix sync_turn: f-string for the post path read self._session_id inside the worker, so a switch between thread spawn and post call landed the memory note in the new session. Snapshot sid at call time, same pattern as sync_turn. 3. Stale prefetch repopulating the new session. The pre-switch drain+clear only protects against workers that finish before the join completes; one finishing after the clear would write its result into the new generation's slot. Added a monotonic _prefetch_generation; workers capture it at spawn and refuse to write if it has advanced. Tests: existing in-flight-sync test updated to drain (it tested the join-before-commit happy path); four new tests cover hung-writer skip on end + switch, on_memory_write sid capture, and prefetch generation gating. 177/177 memory tests pass.
|
All three addressed in 3791a87:
Tests added: hung-writer skip on |
There was a problem hiding this comment.
This fixes the direct case where the currently tracked _sync_thread outlives the 10s join, but I think the write/commit race is still not fully closed.
sync_turn() can lose track of an older writer before session end/switch: it joins the previous _sync_thread for 5s, but if that thread is still alive, it still replaces self._sync_thread with the next worker. After that, on_session_end() / on_session_switch() only check the latest worker, while the older worker can still write into the same OpenViking session after commit.
The commit path needs to account for all in-flight writes for that session, not only the latest _sync_thread.
Small cleanup while touching this: some of the new comments explain the full review rationale inline. I’d trim them to the invariant the code needs to preserve, e.g. “commit only after session writes drain” and “drop prefetch results from older switch generations.”
sync_turn's bounded join could drop a still-alive previous worker by replacing the single _sync_thread slot. The dropped worker kept POSTing under the old sid but was no longer visible to on_session_end / on_session_switch, so the commit could fire while orphaned writes were still in flight — those writes landed past the commit boundary and were never extracted. Replace the single _sync_thread slot with _inflight_writers: Dict[sid, Set[Thread]]. Writers self-register on spawn (sync_turn, on_memory_write) and self-deregister on exit. The commit path drains _drain_writers(sid, 10.0) and skips the commit if any writer for that sid is still alive after the bounded budget. Also trim inline review-rationale comments to short invariants per reviewer style ask: "commit only after session writes drain" and "drop prefetch results from older switch generations." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed in 7537ee6:
40/40 openviking provider tests pass; 277/277 across memory. |
Resolve conflict in plugins/memory/openviking/__init__.py: keep our
on_session_switch hook (the PR feature) and adopt main's new
content/write-based on_memory_write (URI via _build_memory_uri,
metadata param, bare threading.Thread — no longer session-scoped).
Drop test_on_memory_write_captures_session_id_at_call_time; it
guarded the old /api/v1/sessions/{sid}/messages payload that no
longer exists under main's new memory-write design.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@ehz0ah @alt-glitch Please review. |
Summary
Fixes #28296.
OpenVikingMemoryProvideroverrideson_session_endbut inherits the base-class no-op foron_session_switch. When the agent rotatessession_id(via/new,/branch,/reset,/resume, or context compression),_session_idstays stuck on the valueinitialize()captured. Subsequentsync_turnwrites land in the already-closed old session,on_session_endtries to commit it a second time, and the new session never accumulates messages — so memory extraction never fires for it.The fix mirrors the pattern Hindsight uses (PR #17508):
_sync_threadfrom the old session to finish writing under the OLD_session_idbefore mutating it — otherwise the commit below races the last message write.on_session_end. Skip if empty (nothing to extract) or if the provider was never initialized._prefetch_threadand clear_prefetch_resultso the new session doesn't see recall text from before the switch._session_idto the new value and reset_turn_count = 0.Commit failures on the old session are logged at WARN and swallowed — same posture as the existing
on_session_endcommit — so a flaky server can't strand the provider on the old id forever.Test plan
pytest tests/plugins/memory/test_openviking_provider.py— 28/28 pass (21 existing + 7 new regression tests)pytest tests/plugins/memory/— 168/168 pass, no regressions in Hindsight/mem0/SupermemoryNew tests cover:
_sync_threadjoined before commit