Skip to content

fix(gateway): pre-mark sessions resume_pending before drain (#28217)#28576

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-3ad7d98a
May 19, 2026
Merged

fix(gateway): pre-mark sessions resume_pending before drain (#28217)#28576
teknium1 merged 1 commit into
mainfrom
hermes/hermes-3ad7d98a

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Salvage of #28217 by @LifeJiggy. Extends just-merged #27831 to cover the SIGKILL-during-drain edge case.

What: mark_resume_pending was called AFTER drain timeout. If the service manager hard-kills the process mid-drain (e.g. systemd TimeoutStopSec exceeded before drain completes), the post-drain marker is never written and in-flight sessions lose their durable resume hint.

How:

  • Pre-mark every running session as resume_pending BEFORE entering the drain wait, using restart_timeout / shutdown_timeout as the reason depending on the stop cause.
  • After drain completes gracefully (not timed out), call clear_resume_pending on sessions that finished during the window, so they don't carry a stale flag into the next boot.
  • Sentinels (pending agents that never started) are skipped — same logic as _interrupt_running_agents().

Original PR: #28217

…vent data loss (#27856)

Pre-mark all running agent sessions as resume_pending BEFORE the drain
wait begins. If the service manager kills the process during the drain
(window), the durable marker is already written so the next gateway boot
can recover in-flight sessions. On graceful drain completion, clear the
early markers for sessions that finished successfully.
@teknium1 teknium1 merged commit e2a1a2b into main May 19, 2026
4 checks passed
@teknium1 teknium1 deleted the hermes/hermes-3ad7d98a branch May 19, 2026 07:01
@github-actions

Copy link
Copy Markdown
Contributor

🔎 Lint report: hermes/hermes-3ad7d98a vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 8952 on HEAD, 8952 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4702 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/gateway Gateway runner, session dispatch, delivery labels May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants