Skip to content

Gateway restart silently drops in-flight Feishu sessions — no user notification #38836

@vulturekun

Description

@vulturekun

Summary

When the gateway is restarted (via systemctl restart or openclaw gateway restart), any in-flight LLM requests are silently dropped. The user who sent a message via Feishu receives no reply and no notification that a restart occurred. From the user's perspective, the bot simply stops responding.

Environment

  • OpenClaw version: 2026.3.2
  • Channel: Feishu (飞书) via WebSocket long connection
  • Deployment: systemd user service (openclaw-gateway.service)
  • Agents affected: All Feishu-bound agents (8 agents, 8 Feishu bot accounts)

Steps to Reproduce

  1. Send a message to any Feishu-bound agent (e.g., xiaotang)
  2. While the agent is processing (LLM request in-flight), restart the gateway:
    systemctl --user restart openclaw-gateway
  3. Observe: the user never receives a reply or any indication that the request was lost

Expected Behavior

After a gateway restart, users with interrupted sessions should receive a notification (e.g., "Service restarted, please resend your last message") through the same Feishu channel.

Analysis

Looking at the gateway source code, I found:

  1. SIGUSR1 triggers graceful restart with drain (DRAIN_TIMEOUT_MS = 30s), but SIGTERM (used by systemctl restart) does NOT drain — it proceeds to shutdown immediately.

  2. abortedLastRun flag exists in sessions.json entries but is only set to false on new runs — there's no post-restart logic to scan for abortedLastRun=true sessions and notify users.

  3. Restart Sentinel mechanism exists (writeRestartSentinel / consumeRestartSentinel in gateway-cli-vk3t7zJU.js) but only notifies a single session specified at shutdown time — not all affected sessions.

  4. server.close() sends restartExpectedMs: 1500 to WebSocket clients, but Feishu users connecting via bot → WebSocket bridge don't see this.

Relevant code locations

Component File Lines
Drain logic (SIGUSR1 only) gateway-cli-vk3t7zJU.js 22991-22999
SIGTERM handler (no drain) gateway-cli-vk3t7zJU.js 23015-23017
Restart Sentinel (single session) gateway-cli-vk3t7zJU.js 20568-20635
abortedLastRun field sessions-XdimqNx2.js 9089-9091
Session write lock cleanup sessions-XdimqNx2.js 23-164

Proposed Enhancement

Option A: Built-in post-restart notification (preferred)

On gateway startup, scan all agent sessions.json for sessions where:

  • abortedLastRun == true, OR
  • updatedAt is within the last N minutes AND the session has a valid deliveryContext

Then automatically send a notification via the original channel (Feishu, Telegram, etc.) informing the user that a restart occurred.

Option B: Lifecycle hooks

Provide pre-stop / post-start hooks in the gateway configuration:

gateway:
  hooks:
    pre-stop: "script-to-collect-active-sessions.sh"
    post-start: "script-to-notify-users.sh"

Current Workaround

We created a wrapper script (openclaw-restart.sh) that:

  1. Collects recently active Feishu sessions from sessions.json before restart
  2. Uses openclaw gateway restart (SIGUSR1-based, with drain) instead of systemctl restart
  3. After the new process starts, sends notifications via openclaw agent --deliver to each affected user

This works but is fragile and shouldn't be necessary — the gateway should handle this natively.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions