Skip to content

fix(gateway): auto-resume interrupted restart sessions#18896

Closed
juanfradb wants to merge 1 commit into
NousResearch:mainfrom
juanfradb:codex/gateway-auto-resume-restart
Closed

fix(gateway): auto-resume interrupted restart sessions#18896
juanfradb wants to merge 1 commit into
NousResearch:mainfrom
juanfradb:codex/gateway-auto-resume-restart

Conversation

@juanfradb

@juanfradb juanfradb commented May 2, 2026

Copy link
Copy Markdown

Summary

  • schedule an internal continuation turn for fresh sessions explicitly marked resume_pending by a restart/shutdown drain timeout
  • keep auto-resume scoped to explicit resume_pending markers so generic old sessions are not revived
  • preserve resume_pending while the gateway is draining, so an interrupt acknowledgement during shutdown cannot erase the recovery marker
  • stop streaming retry/status reconnect paths once an interrupt has already been requested

Related

Tests

  • /home/juan/.hermes/hermes-agent/venv/bin/python -m pytest tests/gateway/test_restart_drain.py tests/run_agent/test_stream_interrupt_retry.py -q
  • /home/juan/.hermes/hermes-agent/venv/bin/python -m py_compile gateway/run.py run_agent.py tests/gateway/test_restart_drain.py

@juanfradb juanfradb force-pushed the codex/gateway-auto-resume-restart branch from 0e80abd to ce247d9 Compare May 2, 2026 16:42
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery comp/agent Core agent loop, run_agent.py, prompt builder labels May 2, 2026
@juanfradb juanfradb force-pushed the codex/gateway-auto-resume-restart branch from ce247d9 to 87c8a4f Compare May 3, 2026 14:10
@teknium1

Copy link
Copy Markdown
Contributor

This looks implemented on current main by the later restart-resume work. This is an automated hermes-sweeper review.

Evidence:

  • gateway/run.py:4390 defines _schedule_resume_pending_sessions(), scoped to explicit resume_pending entries that are not suspended, have an origin, and whose resume_reason is one of the restart/shutdown/crash interruption reasons.
  • gateway/run.py:4454 synthesizes the internal continuation MessageEvent through the normal adapter pipeline, and gateway/run.py:4962 calls the scheduler during gateway startup.
  • gateway/run.py:1789 and gateway/run.py:8470 preserve resume_pending unless the turn really completed successfully, so interrupted/failed/partial turns do not erase the recovery marker.
  • agent/chat_completion_helpers.py:2160 and agent/chat_completion_helpers.py:2177 stop streaming retry/reconnect paths once an interrupt has been requested or a request was force-cancelled.
  • Coverage exists in tests/gateway/test_restart_resume_pending.py:862 and following tests for fresh scheduling, stale skips, adapter-unavailable skips, platform-scoped reconnect retry, and duplicate-running-agent suppression.

Mainline commits carrying the implementation include fad684b1f35baa20b2b01556e50bec24ce6ffccd, df11f53190cfb188f3878876fc171b9792d41043, 38b1c7dce558f7ad1077b89e1efd3217bf8d6c69, and the streaming interrupt retry fix dd0d1222a247c4e815f2dbee3b88736ca5440976.

@teknium1 teknium1 closed this Jun 11, 2026
@teknium1 teknium1 added the sweeper:implemented-on-main Sweeper: behavior already present on current main label Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists sweeper:implemented-on-main Sweeper: behavior already present on current main type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants