Skip to content

fix(openviking): implement on_session_switch hook (#28296)#28445

Open
harshitAgr wants to merge 5 commits into
NousResearch:mainfrom
harshitAgr:fix/openviking-on-session-switch
Open

fix(openviking): implement on_session_switch hook (#28296)#28445
harshitAgr wants to merge 5 commits into
NousResearch:mainfrom
harshitAgr:fix/openviking-on-session-switch

Conversation

@harshitAgr

Copy link
Copy Markdown

Summary

Fixes #28296. OpenVikingMemoryProvider overrides on_session_end but inherits the base-class no-op for on_session_switch. When the agent rotates session_id (via /new, /branch, /reset, /resume, or context compression), _session_id stays stuck on the value initialize() captured. Subsequent sync_turn writes land in the already-closed old session, on_session_end tries to commit it a second time, and the new session never accumulates messages — so memory extraction never fires for it.

The fix mirrors the pattern Hindsight uses (PR #17508):

  1. Drain in-flight sync. Wait for any _sync_thread from the old session to finish writing under the OLD _session_id before mutating it — otherwise the commit below races the last message write.
  2. Commit the old session if it has turns. Same extraction semantics as on_session_end. Skip if empty (nothing to extract) or if the provider was never initialized.
  3. Drop stale prefetch. Drain any in-flight _prefetch_thread and clear _prefetch_result so the new session doesn't see recall text from before the switch.
  4. Rotate. Update _session_id to the new value and reset _turn_count = 0.

Commit failures on the old session are logged at WARN and swallowed — same posture as the existing on_session_end commit — so a flaky server can't strand the provider on the old id forever.

Test plan

  • pytest tests/plugins/memory/test_openviking_provider.py — 28/28 pass (21 existing + 7 new regression tests)
  • pytest tests/plugins/memory/ — 168/168 pass, no regressions in Hindsight/mem0/Supermemory

New tests cover:

  • happy path: commit + rotate
  • empty old session: no commit call
  • prefetch cache cleared on switch
  • in-flight _sync_thread joined before commit
  • no-op on empty new id
  • no-op when client missing
  • commit failure swallowed and rotate still completes

OpenVikingMemoryProvider only overrides on_session_end and inherits the
base-class no-op for on_session_switch. When the agent rotates session_id
(via /new, /branch, /reset, /resume, or context compression), the
provider's cached _session_id stays at the value initialize() captured.
All subsequent sync_turn writes then land in the already-closed old
session, and on_session_end tries to commit it a second time — the new
session never accumulates messages and never triggers memory extraction.

The fix mirrors the pattern Hindsight uses (NousResearch#17508):

  1. Wait for any in-flight sync thread to drain under the OLD _session_id
     before we mutate it, otherwise the commit below races the last
     message write.
  2. Commit the old session if it accumulated turns — same extraction
     semantics as on_session_end. Skip if empty (nothing to extract).
  3. Drain in-flight prefetch from the old session and clear its cached
     result so the new session doesn't see stale recall.
  4. Rotate _session_id to the new value and reset _turn_count.

Commit failures are swallowed (logged at WARN) so a flaky server can't
strand the provider on the old session forever — same posture as the
existing on_session_end commit.
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins tool/memory Memory tool and memory providers labels May 19, 2026

@ehz0ah ehz0ah left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up. The direction is right: OpenViking does need to handle on_session_switch() so it does not keep writing to the old Hermes session.

I think this needs two small hardening changes before merge:

  1. In sync_turn(), capture the target session id before starting the background thread, e.g. sid = str(session_id or self._session_id).strip(), and have the worker use that fixed value. Otherwise a delayed worker can read _session_id after rotation and write the previous turn into the new session.

  2. Make the old-session commit path idempotent with existing commit_memory_session() calls. /new and compression already call on_session_end() before on_session_switch(), so after a successful commit the provider should mark the old session clean instead of letting the switch hook commit it again.

With those changes, this stays focused on #28296 while closing the remaining session-boundary race.

…sion_end

Two hardening fixes prompted by review on NousResearch#28296:

1. sync_turn() now snapshots the target session id before spawning the
   worker. The previous code read self._session_id inside the worker, so
   a worker delayed past on_session_switch's bounded join could read the
   rotated-in NEW id and write the OLD turn's messages into the wrong
   session.

2. on_session_end() resets _turn_count to 0 after a successful commit,
   making the old-session commit path idempotent with the new switch
   hook. /new and compression call commit_memory_session() (which fires
   on_session_end) immediately before on_session_switch; without this,
   the old session would be committed twice. On commit failure we leave
   _turn_count > 0 so on_session_switch retries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@harshitAgr

Copy link
Copy Markdown
Author

Both addressed in 2ea8d5c:

  1. sync_turn() now snapshots sid = str(session_id or self._session_id).strip() before spawning the worker, and the closure uses the captured value. Returns early on blank.

  2. on_session_end() resets _turn_count = 0 after a successful commit. The existing if old_turn_count > 0 guard in on_session_switch then skips the redundant commit on the /new and compression paths. Commit-failure path leaves the count intact so the switch hook still retries.

Tests added: worker-rotation race (worker parked mid-call, _session_id mutated, asserts both writes still target the old sid); on_session_end clean-mark on success and dirty-keep on failure; end→switch sequence asserts exactly one commit call on the old session.

@harshitAgr

Copy link
Copy Markdown
Author

@ehz0ah

@ehz0ah ehz0ah left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. The two earlier points are addressed: sync_turn() now captures the target sid before spawning the worker, and on_session_end() marks the session clean after a successful commit.

I think there is still one session-boundary race left. on_session_end() / on_session_switch() join _sync_thread for 10s, but the sync worker can spend up to two 30s OpenViking POSTs. If the worker is still alive after the bounded join, the old session can be committed and marked clean while that worker later writes old-session messages. Those late writes would not be extracted unless another commit happens for the old sid.

A focused fix would be to make the commit path conditional on the writer actually finishing for that session: if the writer is still alive after the join, do not mark the session clean, and avoid treating that old session as safely committed.

While touching this area, on_memory_write() has the same late _session_id capture shape, so it should snapshot the target sid before spawning its worker. Prefetch should also ignore results from an old generation after a switch so stale recall cannot repopulate the new session.

Three follow-ups from review on NousResearch#28296:

1. Sync worker outliving the bounded join. Each sync_turn POST has
   _TIMEOUT=30s and there are two per turn, but on_session_end and
   on_session_switch only join for 10s. If the worker is still alive
   after the join, committing the old session orphans the worker's
   late writes past the commit boundary — they land in an already-
   committed session and never get extracted. Both hooks now re-check
   is_alive() after the join and skip the commit when the worker
   hasn't drained.

2. on_memory_write late session_id capture. Same shape as the
   pre-fix sync_turn: f-string for the post path read self._session_id
   inside the worker, so a switch between thread spawn and post call
   landed the memory note in the new session. Snapshot sid at call
   time, same pattern as sync_turn.

3. Stale prefetch repopulating the new session. The pre-switch
   drain+clear only protects against workers that finish before the
   join completes; one finishing after the clear would write its
   result into the new generation's slot. Added a monotonic
   _prefetch_generation; workers capture it at spawn and refuse to
   write if it has advanced.

Tests: existing in-flight-sync test updated to drain (it tested the
join-before-commit happy path); four new tests cover hung-writer skip
on end + switch, on_memory_write sid capture, and prefetch generation
gating. 177/177 memory tests pass.
@harshitAgr

Copy link
Copy Markdown
Author

All three addressed in 3791a87:

  1. Hung sync worker. on_session_end and on_session_switch now re-check _sync_thread.is_alive() after the 10s join. If still alive, skip the commit and log a WARN — committing while the worker is mid-flight would orphan its late writes past the commit boundary (extraction never re-runs for that sid). On on_session_end we leave _turn_count intact so the session stays dirty; on on_session_switch rotation still proceeds with a clear log naming the orphaned sid.

  2. on_memory_write late capture. Snapshots sid at call time before spawning the worker, mirroring the sync_turn fix from the last round. Early-returns on blank sid.

  3. Stale prefetch. Added a monotonic _prefetch_generation. Prefetch workers capture the value at spawn and bail under _prefetch_lock if it has advanced. on_session_switch bumps the generation before joining/clearing — so a worker that finishes after the clear can't repopulate the new session's slot.

Tests added: hung-writer skip on on_session_end and on_session_switch, on_memory_write sid capture (blocking in the stub ctor so rotation deterministically beats f-string evaluation), prefetch generation gating. One existing test (test_on_session_switch_waits_for_inflight_sync_thread) updated so its FakeThread actually drains after join() — it was testing the join-before-commit happy path, not the hung case. 177/177 memory tests pass.

@ehz0ah ehz0ah left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes the direct case where the currently tracked _sync_thread outlives the 10s join, but I think the write/commit race is still not fully closed.

sync_turn() can lose track of an older writer before session end/switch: it joins the previous _sync_thread for 5s, but if that thread is still alive, it still replaces self._sync_thread with the next worker. After that, on_session_end() / on_session_switch() only check the latest worker, while the older worker can still write into the same OpenViking session after commit.

The commit path needs to account for all in-flight writes for that session, not only the latest _sync_thread.

Small cleanup while touching this: some of the new comments explain the full review rationale inline. I’d trim them to the invariant the code needs to preserve, e.g. “commit only after session writes drain” and “drop prefetch results from older switch generations.”

sync_turn's bounded join could drop a still-alive previous worker by
replacing the single _sync_thread slot. The dropped worker kept POSTing
under the old sid but was no longer visible to on_session_end /
on_session_switch, so the commit could fire while orphaned writes were
still in flight — those writes landed past the commit boundary and were
never extracted.

Replace the single _sync_thread slot with _inflight_writers:
Dict[sid, Set[Thread]]. Writers self-register on spawn (sync_turn,
on_memory_write) and self-deregister on exit. The commit path drains
_drain_writers(sid, 10.0) and skips the commit if any writer for that
sid is still alive after the bounded budget.

Also trim inline review-rationale comments to short invariants per
reviewer style ask: "commit only after session writes drain" and
"drop prefetch results from older switch generations."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@harshitAgr

Copy link
Copy Markdown
Author

Addressed in 7537ee6:

  1. Orphaned writer. Replaced the single _sync_thread slot with _inflight_writers: Dict[sid, Set[Thread]]. sync_turn and on_memory_write now register their workers via _spawn_writer(sid, ...) and the workers self-deregister on exit. on_session_end / on_session_switch call _drain_writers(sid, 10.0) which joins every still-alive writer for that sid under a shared budget — including ones that previous sync_turn calls used to silently drop after the 5s join timeout. If any writer outlives the drain, the commit is skipped. Tests cover the case directly (test_on_session_{end,switch}_waits_for_all_writers_not_just_latest, plus test_sync_turn_tracks_writer_under_session_id).

  2. Comment trim. Verbose review-rationale comments collapsed to the invariants — "commit only after session writes drain" and "drop prefetch results from older switch generations."

40/40 openviking provider tests pass; 277/277 across memory.

Resolve conflict in plugins/memory/openviking/__init__.py: keep our
on_session_switch hook (the PR feature) and adopt main's new
content/write-based on_memory_write (URI via _build_memory_uri,
metadata param, bare threading.Thread — no longer session-scoped).

Drop test_on_memory_write_captures_session_id_at_call_time; it
guarded the old /api/v1/sessions/{sid}/messages payload that no
longer exists under main's new memory-write design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@harshitAgr

Copy link
Copy Markdown
Author

@ehz0ah @alt-glitch Please review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have tool/memory Memory tool and memory providers type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenVikingMemoryProvider missing on_session_switch — session ID goes stale after /new

3 participants