fix(gateway): write clean-shutdown marker before drain to preserve session context#11099
fix(gateway): write clean-shutdown marker before drain to preserve session context#11099brantzh6 wants to merge 2 commits into
Conversation
…ssion context Problem: When the gateway stops (restart/update/shutdown), it writes a `.clean_shutdown` marker AFTER draining active agents. If the drain takes longer than systemd's TimeoutStopSec (default 60s), systemd sends SIGKILL. The marker is never written, so the next startup calls suspend_recently_active(), which resets all recently-active sessions. Impact: Users lose their entire conversation context after a gateway restart. The agent has no memory of what was just discussed — this is a poor experience for messaging platform users who expect continuity. For example, a user discussing a complex task via Telegram or Feishu triggers `hermes update`. The gateway restarts, and if drain exceeds 60s (common with long-running tool calls), the conversation is wiped. The user returns to a blank slate with no idea what happened. Solution: Move the clean-shutdown marker write to BEFORE drain begins. This guarantees the marker exists even if SIGKILL arrives during drain. The marker is still re-touched after successful drain for completeness, but the early write is the real safeguard. Evidence: In production over 3 days, we observed 6 SIGKILL events from systemd due to drain timeout. Without this fix, each would have caused session context loss.
|
Thanks for the detailed bug report and production evidence — the SIGKILL logs and session-loss observations are exactly the kind of real-world signal that matters. This is an automated hermes-sweeper review. The session continuity problem this PR addresses was independently fixed on
This approach is more precise than the early-marker strategy — it flags exactly the affected sessions rather than suppressing startup-wide suspension globally, which preserves the stuck-loop detection behavior intact. The Related open PRs #9128 and #11806 that @alt-glitch flagged address the same root cause and are also superseded by the same merged fix. |
Problem
When the gateway stops (
hermes update,hermes gateway restart,/restart), it writes a.clean_shutdownmarker after draining active agents. If the drain exceeds systemd'sTimeoutStopSec(default 60s), systemd sends SIGKILL and the marker is never written.On the next startup,
suspend_recently_active()sees no marker, concludes it was a crash, and resets all recently-active sessions — wiping conversation context.Impact on messaging users
For users on Telegram, Discord, Feishu, WeChat etc., conversation continuity is critical. They expect the agent to remember what was just discussed. Losing context after a gateway restart is jarring and breaks trust:
hermes update→ gateway restartsEvidence
In production over 3 days, we observed 6 SIGKILL events from systemd due to drain timeout:
Without this fix, each of these events would cause session context loss for active users.
Solution
Move the
.clean_shutdownmarker write to before drain begins. Two-line change:_stop_impl()): Write marker immediately when stop begins, before any drain logicThis guarantees the marker survives even if SIGKILL arrives during drain.
Tradeoff
The old code intentionally skipped the marker when drain timed out, reasoning that force-interrupted sessions might be in an inconsistent state (trailing tool response, no final assistant message). That's a valid concern, but:
_suspend_stuck_loop_sessions) already catches genuinely stuck sessions across 3+ restartsTesting