Skip to content

fix(gateway): remove expired sessions from disk after proactive memory flush#4362

Closed
insecurejezza wants to merge 1 commit into
NousResearch:mainfrom
insecurejezza:fix/session-expiry-flush-cleanup
Closed

fix(gateway): remove expired sessions from disk after proactive memory flush#4362
insecurejezza wants to merge 1 commit into
NousResearch:mainfrom
insecurejezza:fix/session-expiry-flush-cleanup

Conversation

@insecurejezza

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes a bug where the gateway becomes unresponsive after a restart due to the proactive memory flush loop re-flushing already-flushed expired sessions.

The _session_expiry_watcher (introduced in d80c30c) flushes memories for expired sessions to Honcho, then tracks them in the in-memory _pre_flushed_sessions set to prevent double-flushing within a single process lifetime. However, this set is never persisted to disk. On gateway restart, the set resets to empty while the expired session entries remain in sessions.json. The watcher rediscovers and re-flushes the same sessions every restart cycle, blocking the event loop and making all platforms (Discord, Telegram, etc.) unresponsive for 10-15+ minutes per cycle.

Related Issue

No existing issue — discovered in production when a gateway restart triggered a cascade of ~30 expired session flushes that repeated on every subsequent restart.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • gateway/run.py: After a successful proactive memory flush, remove the expired session entry from the session store on disk (_entries.pop() + _save()). This ensures flushed sessions are never re-discovered on restart.
  • tests/gateway/test_async_memory_flush.py: Two regression tests — one verifying entries are removed from disk after flush, one demonstrating the original bug (in-memory guard lost on restart).

How to Test

  1. Start the gateway with Honcho enabled and several expired sessions in sessions.json
  2. Observe the watcher flush them once
  3. Restart the gateway
  4. Confirm the flushed sessions are NOT re-discovered or re-flushed
source .venv/bin/activate
python -m pytest -q tests/gateway/test_async_memory_flush.py

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes

…y flush

The session expiry watcher (_session_expiry_watcher) flushes memories for
expired sessions to Honcho, then marks them in the in-memory
_pre_flushed_sessions set to prevent double-flushing within a single
process lifetime.

However, _pre_flushed_sessions is never persisted to disk. On gateway
restart, the set resets to empty while the expired session entries remain
in sessions.json. The watcher rediscovers and re-flushes the same
sessions, blocking the event loop and making all platforms (Discord, etc.)
unresponsive for 10-15+ minutes per restart cycle.

Fix: after a successful flush, pop the entry from the session store and
persist the removal to disk. This ensures flushed expired sessions are
never re-discovered on restart.

Adds two regression tests:
- Verifies flushed entries are removed from disk
- Demonstrates the original bug (in-memory guard lost on restart)
@teknium1

teknium1 commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

Thanks @insecurejezza! This bug was real — the in-memory _pre_flushed_sessions set was indeed lost on restart, causing redundant re-flushes.

Merged via #4481 with a slightly different approach: instead of removing entries from sessions.json (which loses the session metadata useful for debugging), we persist a memory_flushed flag on the entry itself. Your observation about the event loop blocking and the lock usage informed the final fix. Appreciate the contribution!

@teknium1 teknium1 closed this Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants