Skip to content

fix(gateway): retry startup auto-resume when a failed platform reconnects#37669

Closed
Frowtek wants to merge 1 commit into
NousResearch:mainfrom
Frowtek:fix/late-reconnect-auto-resume
Closed

fix(gateway): retry startup auto-resume when a failed platform reconnects#37669
Frowtek wants to merge 1 commit into
NousResearch:mainfrom
Frowtek:fix/late-reconnect-auto-resume

Conversation

@Frowtek

@Frowtek Frowtek commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

The gateway's documented startup auto-resume skips sessions whose platform
adapter isn't connected yet — and never retries them. A platform that connects
shortly after startup left its interrupted sessions behind; they only recovered
if the user sent a new message. This wires the auto-resume into the reconnect
path so a recovered platform retries the continuation for its own sessions.

Related Issue


Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • gateway/run.py_schedule_resume_pending_sessions() gains an optional
    platform filter + an in-flight guard against double resumes.
  • gateway/run.py — the reconnect-watcher success path re-runs the auto-resume
    scoped to the reconnected platform.
  • tests/gateway/test_restart_resume_pending.py,
    tests/gateway/test_platform_reconnect.py — regression coverage.

How to Test

  1. Mark a session resume_pending while its adapter is offline at startup → skipped.
  2. Bring the platform online → reconnect schedules the resume for that session.
  3. scripts/run_tests.sh tests/gateway/test_restart_resume_pending.py tests/gateway/test_platform_reconnect.py -q — all pass.

Checklist

  • Read the Contributing Guide
  • Commit messages follow Conventional Commits (fix(gateway): ...)
  • Searched for duplicate PRs
  • Only changes related to this fix
  • Tests pass and I've added regression tests
  • Cross-platform impact — N/A (pure async control-flow)

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels Jun 2, 2026
@mohamedorigami-jpg

Copy link
Copy Markdown
Contributor

Clean separation of startup resume vs reconnect resume. The platform scoping is the right approach — reconnects should only retry their own sessions.

One thing worth checking: the entry.session_key in self._running_agents guard on the reconnect path assumes the original agent has already been launched and stored in _running_agents before the reconnect fires. If a platform is flapping (disconnect→reconnect in rapid succession), there's a timing window where the first reconnect triggers resume, the agent fires, the platform drops again, the second reconnect fires before the first agent is stored in _running_agents — you'd get duplicate resumes for the same session. A transition counter (disconnect_count incremented on each drop, matched on reconnect) would close that gap if flappy platforms are a real concern for your deployment.

@teknium1

teknium1 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Merged via PR #39018 — your commit was cherry-picked onto current main with your authorship preserved in git log (rebase-merge, commit 71a9f44). Thanks @Frowtek!

@xiaoyaner0201

Copy link
Copy Markdown

Cherry-picked this onto a v0.15.2-based fork (Discord gateway) alongside #30030. Resolves a restart-resume amnesia we were hitting — a session whose platform reconnected slightly after startup never got its interrupted turn resumed until a fresh user message arrived. The test_restart_resume_pending + test_platform_reconnect suites pass. Thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants