fix(gateway): write clean-shutdown marker before drain to preserve session context by brantzh6 · Pull Request #11099 · NousResearch/hermes-agent

brantzh6 · 2026-04-16T15:35:23Z

Problem

When the gateway stops (hermes update, hermes gateway restart, /restart), it writes a .clean_shutdown marker after draining active agents. If the drain exceeds systemd's TimeoutStopSec (default 60s), systemd sends SIGKILL and the marker is never written.

On the next startup, suspend_recently_active() sees no marker, concludes it was a crash, and resets all recently-active sessions — wiping conversation context.

Impact on messaging users

For users on Telegram, Discord, Feishu, WeChat etc., conversation continuity is critical. They expect the agent to remember what was just discussed. Losing context after a gateway restart is jarring and breaks trust:

User is mid-discussion about a complex task
Admin runs hermes update → gateway restarts
Drain takes >60s (common with long-running tool calls, multiple active agents)
systemd SIGKILLs the process
User returns, sends a follow-up message → blank slate, agent has no memory of the prior conversation

Evidence

In production over 3 days, we observed 6 SIGKILL events from systemd due to drain timeout:

Apr 14 14:05:39 systemd: Killing process 263 (python) with signal SIGKILL.
Apr 14 22:11:21 systemd: Killing process 231 (python) with signal SIGKILL.
Apr 15 08:03:03 python: Gateway drain timed out after 60.0s with 2 active agent(s)
Apr 15 08:03:04 systemd: Killing process 1412 (python) with signal SIGKILL.
Apr 15 09:00:58 systemd: Killing process 6263 (python) with signal SIGKILL.
Apr 16 12:31:19 python: Gateway drain timed out after 60.0s with 1 active agent(s)
Apr 16 12:31:29 systemd: hermes-gateway.service: Failed with result exit-code.

Without this fix, each of these events would cause session context loss for active users.

Solution

Move the .clean_shutdown marker write to before drain begins. Two-line change:

Early write (top of _stop_impl()): Write marker immediately when stop begins, before any drain logic
Late re-touch (after drain): Keep the existing write as a no-op for completeness

This guarantees the marker survives even if SIGKILL arrives during drain.

Tradeoff

The old code intentionally skipped the marker when drain timed out, reasoning that force-interrupted sessions might be in an inconsistent state (trailing tool response, no final assistant message). That's a valid concern, but:

Losing the entire conversation is worse than resuming a slightly stale state
The agent can handle a trailing tool result gracefully — it just continues
The stuck-loop detector (_suspend_stuck_loop_sessions) already catches genuinely stuck sessions across 3+ restarts

Testing

Deployed in production with this patch for 3 days
Gateway restarted multiple times (both planned and SIGKILL)
All active sessions resumed seamlessly after restart
No false-positive session resets observed

…ssion context Problem: When the gateway stops (restart/update/shutdown), it writes a `.clean_shutdown` marker AFTER draining active agents. If the drain takes longer than systemd's TimeoutStopSec (default 60s), systemd sends SIGKILL. The marker is never written, so the next startup calls suspend_recently_active(), which resets all recently-active sessions. Impact: Users lose their entire conversation context after a gateway restart. The agent has no memory of what was just discussed — this is a poor experience for messaging platform users who expect continuity. For example, a user discussing a complex task via Telegram or Feishu triggers `hermes update`. The gateway restarts, and if drain exceeds 60s (common with long-running tool calls), the conversation is wiped. The user returns to a blank slate with no idea what happened. Solution: Move the clean-shutdown marker write to BEFORE drain begins. This guarantees the marker exists even if SIGKILL arrives during drain. The marker is still re-touched after successful drain for completeness, but the early write is the real safeguard. Evidence: In production over 3 days, we observed 6 SIGKILL events from systemd due to drain timeout. Without this fix, each would have caused session context loss.

alt-glitch · 2026-04-25T10:17:14Z

Likely duplicate of #9128 — same fix: write clean-shutdown marker before drain begins. Also related to #11806 which addresses the same session continuity issue.

teknium1 · 2026-04-27T04:32:03Z

Thanks for the detailed bug report and production evidence — the SIGKILL logs and session-loss observations are exactly the kind of real-world signal that matters.

This is an automated hermes-sweeper review.

The session continuity problem this PR addresses was independently fixed on main by commit cb4adda (PR #11852/#12301, merged Apr 18 2026: "fix(gateway): auto-resume sessions after drain-timeout restart"). Rather than moving the .clean_shutdown marker write earlier, that fix introduces a dedicated resume_pending state on SessionEntry that:

Sets resume_pending=True on all forcibly-interrupted sessions before interrupting them (gateway/run.py:2527–2558)
suspend_recently_active() at startup now skips resume_pending sessions instead of resetting them (gateway/session.py:1079)
get_or_create_session() branches on resume_pending to reuse the existing session_id+transcript rather than spawning a fresh session (gateway/session.py:851–855)

This approach is more precise than the early-marker strategy — it flags exactly the affected sessions rather than suppressing startup-wide suspension globally, which preserves the stuck-loop detection behavior intact.

The .clean_shutdown intentional-skip-on-timeout path that this PR proposes to change is now effectively inert with respect to session context preservation, since resume_pending handles the active-session case independently.

Related open PRs #9128 and #11806 that @alt-glitch flagged address the same root cause and are also superseded by the same merged fix.

brantzh6 added 2 commits April 16, 2026 23:32

chore: add brantzh6 to AUTHOR_MAP

9675c71

alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/gateway Gateway runner, session dispatch, delivery labels Apr 25, 2026

teknium1 closed this Apr 27, 2026

alt-glitch mentioned this pull request Apr 27, 2026

fix(gateway): write clean-shutdown marker at start of stop() not end #9128

Open

4 tasks

alt-glitch mentioned this pull request May 18, 2026

Gateway restart can lose long-running sessions during shutdown drain #27856

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): write clean-shutdown marker before drain to preserve session context#11099

fix(gateway): write clean-shutdown marker before drain to preserve session context#11099
brantzh6 wants to merge 2 commits into
NousResearch:mainfrom
brantzh6:fix/early-clean-shutdown-marker

brantzh6 commented Apr 16, 2026

Uh oh!

alt-glitch commented Apr 25, 2026

Uh oh!

teknium1 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

brantzh6 commented Apr 16, 2026

Problem

Impact on messaging users

Evidence

Solution

Tradeoff

Testing

Uh oh!

alt-glitch commented Apr 25, 2026

Uh oh!

teknium1 commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants