Skip to content

fix(gateway): write clean-shutdown marker before drain to preserve session context#11099

Closed
brantzh6 wants to merge 2 commits into
NousResearch:mainfrom
brantzh6:fix/early-clean-shutdown-marker
Closed

fix(gateway): write clean-shutdown marker before drain to preserve session context#11099
brantzh6 wants to merge 2 commits into
NousResearch:mainfrom
brantzh6:fix/early-clean-shutdown-marker

Conversation

@brantzh6

Copy link
Copy Markdown
Contributor

Problem

When the gateway stops (hermes update, hermes gateway restart, /restart), it writes a .clean_shutdown marker after draining active agents. If the drain exceeds systemd's TimeoutStopSec (default 60s), systemd sends SIGKILL and the marker is never written.

On the next startup, suspend_recently_active() sees no marker, concludes it was a crash, and resets all recently-active sessions — wiping conversation context.

Impact on messaging users

For users on Telegram, Discord, Feishu, WeChat etc., conversation continuity is critical. They expect the agent to remember what was just discussed. Losing context after a gateway restart is jarring and breaks trust:

  • User is mid-discussion about a complex task
  • Admin runs hermes update → gateway restarts
  • Drain takes >60s (common with long-running tool calls, multiple active agents)
  • systemd SIGKILLs the process
  • User returns, sends a follow-up message → blank slate, agent has no memory of the prior conversation

Evidence

In production over 3 days, we observed 6 SIGKILL events from systemd due to drain timeout:

Apr 14 14:05:39 systemd: Killing process 263 (python) with signal SIGKILL.
Apr 14 22:11:21 systemd: Killing process 231 (python) with signal SIGKILL.
Apr 15 08:03:03 python: Gateway drain timed out after 60.0s with 2 active agent(s)
Apr 15 08:03:04 systemd: Killing process 1412 (python) with signal SIGKILL.
Apr 15 09:00:58 systemd: Killing process 6263 (python) with signal SIGKILL.
Apr 16 12:31:19 python: Gateway drain timed out after 60.0s with 1 active agent(s)
Apr 16 12:31:29 systemd: hermes-gateway.service: Failed with result exit-code.

Without this fix, each of these events would cause session context loss for active users.

Solution

Move the .clean_shutdown marker write to before drain begins. Two-line change:

  1. Early write (top of _stop_impl()): Write marker immediately when stop begins, before any drain logic
  2. Late re-touch (after drain): Keep the existing write as a no-op for completeness

This guarantees the marker survives even if SIGKILL arrives during drain.

Tradeoff

The old code intentionally skipped the marker when drain timed out, reasoning that force-interrupted sessions might be in an inconsistent state (trailing tool response, no final assistant message). That's a valid concern, but:

  1. Losing the entire conversation is worse than resuming a slightly stale state
  2. The agent can handle a trailing tool result gracefully — it just continues
  3. The stuck-loop detector (_suspend_stuck_loop_sessions) already catches genuinely stuck sessions across 3+ restarts

Testing

  • Deployed in production with this patch for 3 days
  • Gateway restarted multiple times (both planned and SIGKILL)
  • All active sessions resumed seamlessly after restart
  • No false-positive session resets observed

brantzh6 added 2 commits April 16, 2026 23:32
…ssion context

Problem:
When the gateway stops (restart/update/shutdown), it writes a
`.clean_shutdown` marker AFTER draining active agents. If the drain
takes longer than systemd's TimeoutStopSec (default 60s), systemd
sends SIGKILL. The marker is never written, so the next startup calls
suspend_recently_active(), which resets all recently-active sessions.

Impact:
Users lose their entire conversation context after a gateway restart.
The agent has no memory of what was just discussed — this is a poor
experience for messaging platform users who expect continuity.

For example, a user discussing a complex task via Telegram or Feishu
triggers `hermes update`. The gateway restarts, and if drain exceeds
60s (common with long-running tool calls), the conversation is wiped.
The user returns to a blank slate with no idea what happened.

Solution:
Move the clean-shutdown marker write to BEFORE drain begins. This
guarantees the marker exists even if SIGKILL arrives during drain.
The marker is still re-touched after successful drain for
completeness, but the early write is the real safeguard.

Evidence:
In production over 3 days, we observed 6 SIGKILL events from systemd
due to drain timeout. Without this fix, each would have caused session
context loss.
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/gateway Gateway runner, session dispatch, delivery labels Apr 25, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #9128 — same fix: write clean-shutdown marker before drain begins. Also related to #11806 which addresses the same session continuity issue.

@teknium1

Copy link
Copy Markdown
Contributor

Thanks for the detailed bug report and production evidence — the SIGKILL logs and session-loss observations are exactly the kind of real-world signal that matters.

This is an automated hermes-sweeper review.

The session continuity problem this PR addresses was independently fixed on main by commit cb4adda (PR #11852/#12301, merged Apr 18 2026: "fix(gateway): auto-resume sessions after drain-timeout restart"). Rather than moving the .clean_shutdown marker write earlier, that fix introduces a dedicated resume_pending state on SessionEntry that:

  • Sets resume_pending=True on all forcibly-interrupted sessions before interrupting them (gateway/run.py:2527–2558)
  • suspend_recently_active() at startup now skips resume_pending sessions instead of resetting them (gateway/session.py:1079)
  • get_or_create_session() branches on resume_pending to reuse the existing session_id+transcript rather than spawning a fresh session (gateway/session.py:851–855)

This approach is more precise than the early-marker strategy — it flags exactly the affected sessions rather than suppressing startup-wide suspension globally, which preserves the stuck-loop detection behavior intact.

The .clean_shutdown intentional-skip-on-timeout path that this PR proposes to change is now effectively inert with respect to session context preservation, since resume_pending handles the active-session case independently.

Related open PRs #9128 and #11806 that @alt-glitch flagged address the same root cause and are also superseded by the same merged fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants