Skip to content

fix(gateway): cap memory flush retries at 3 to prevent infinite loop#5224

Closed
nibzard wants to merge 1 commit into
NousResearch:mainfrom
nibzard:fix/flush-infinite-retry-loop
Closed

fix(gateway): cap memory flush retries at 3 to prevent infinite loop#5224
nibzard wants to merge 1 commit into
NousResearch:mainfrom
nibzard:fix/flush-infinite-retry-loop

Conversation

@nibzard

@nibzard nibzard commented Apr 5, 2026

Copy link
Copy Markdown
Contributor

Problem

The _session_expiry_watcher in gateway/run.py retries failed memory flushes forever. When _async_flush_memories throws (rate limit, network error), the exception is caught at debug level without setting memory_flushed = True, so the same expired session retries every 5 minutes indefinitely.

This burns API quota and triggers 429 rate limit cascades that block all gateway message processing (Telegram, Discord, etc.), making the bot appear unresponsive.

Observed case: A March 19 session retried 28+ times over ~17 days, causing repeated 429 errors that made Telegram completely unresponsive.

Fix

Add a per-session failure counter (_flush_failures) in the watcher loop. After 3 consecutive failures for the same session:

  1. Log a warning explaining the give-up
  2. Set memory_flushed = True and persist to disk
  3. Remove from failure tracker

This breaks the infinite retry loop while still allowing transient failures to recover (up to 3 attempts at 5-min intervals = ~15 min grace period).

Testing

  • All 9 existing flush tests pass (test_flush_memory_stale_guard.py)
  • All 19 session hygiene tests pass (test_session_hygiene.py)
  • Patch is scoped to the watcher loop only, no dataclass changes needed

The _session_expiry_watcher retried failed memory flushes forever
because exceptions were caught at debug level without setting
memory_flushed=True. Expired sessions with transient failures
(rate limits, network errors) would retry every 5 minutes
indefinitely, burning API quota and blocking gateway message
processing via 429 rate limit cascades.

Observed case: a March 19 session retried 28+ times over ~17 days,
causing repeated 429 errors that made Telegram unresponsive.

Add a per-session failure counter (_flush_failures) that gives up
after 3 consecutive attempts and marks the session as flushed to
break the loop.
@teknium1

teknium1 commented Apr 5, 2026

Copy link
Copy Markdown
Contributor

Merged via PR #5288 (consolidated bugfix salvage). Your commit(s) were cherry-picked onto current main with your authorship preserved in git log. Thanks @nibzard for the fix!

@teknium1 teknium1 closed this Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants