Skip to content

Config-reload deferral logged but not honored — systemd SIGTERM kills gateway, in-flight user message lost with no retry #73918

@shandutta

Description

@shandutta

Config-reload deferral honored, but systemd SIGTERM kills gateway 9s later — inbound user message dropped, no retry

Summary

When a config change triggers a gateway restart, the gateway's "defer until N operations complete" logic runs and logs the deferral, but systemd issues SIGTERM ~9 seconds later regardless. Any user message that landed during the deferral window is persisted to the session JSONL but never receives an assistant reply, and the gateway has no retry-on-restart for in-flight dispatches — so it stays orphaned indefinitely.

In my case: an inbound Telegram message (Is that ok?, msg 3675) landed at 23:00:30 UTC. The next assistant reply for that turn happened 2 hours 14 minutes later, only after the user re-pinged. Both messages were processed in the same turn at 01:14 UTC.

This is related to but distinct from:

This issue is the concrete config-reload + systemd path, with two narrow bugs that could be fixed in isolation.

Environment

  • OpenClaw: 2026.4.25
  • Linux 6.8.0-110-generic, node 25.x
  • Gateway under systemd user unit (openclaw-gateway.service)
  • Channel: Telegram, forum group with thread routing
  • Agent: claude-max-proxy backend, model claude-opus-4-7
  • Date: 2026-04-28 → 2026-04-29

Timeline (real incident, journalctl --user)

Time (UTC) Event
23:00:30 User msg 3675 ("Is that ok?") arrives, persisted to session JSONL with runtime context
23:01:03 [reload] config change detected; evaluating reload (browser.ssrfPolicy.allowedHostnames)
23:01:09 [reload] config change requires gateway restart — deferring until 4 operation(s), 2 reply(ies), 2 embedded run(s) complete
23:01:18 systemd: Stopping openclaw-gateway.service (deferral did not hold)
23:01:18 [gateway] signal SIGTERM received; shutting down
23:01:19 Stopped openclaw-gateway.service. Consumed 4h 22min CPU time
23:01:41 New gateway process loading configuration
23:01:59 [gateway] ready (6 plugins; 17.7s)
23:01:59 No retry of msg 3675. No assistant reply ever generated for that turn.
01:14 (next day) User re-pings ("Why didn't you respond earlier..."), Claude finally sees both messages and responds to both at once

Total dispatch loss: 2h 14min, only resolved by user-initiated re-prompt.

Root cause: two narrow bugs

Bug 1 — Deferral isn't honored by systemd

The reload path logs deferring until N operations complete (line 23:01:09), but systemd SIGTERM's the unit 9 seconds later anyway. Either:

  • The deferral mechanism is purely in-process (logs intent but doesn't actually delay the systemctl restart call), or
  • systemctl restart is being called immediately after the deferral logging without honoring the in-process gate, or
  • TimeoutStopSec in the unit file is too short for the 4 operations + 2 replies + 2 embedded runs to drain

Whichever it is, the user-facing effect is that the deferral log message is misleading — it suggests the gateway will hold off, but it doesn't.

Bug 2 — No retry-on-restart for already-persisted messages

The user message landed at 23:00:30 and was persisted to the session JSONL before the restart. After the restart at 23:01:59, the gateway came up clean — but it never scanned the session JSONL for messages whose newest sibling isn't an assistant reply.

The data is already on disk. The only missing piece is a startup pass that:

  1. Walks recent session JSONLs (last hour, say)
  2. Identifies turns where the last entry is a user message with no assistant reply
  3. Re-dispatches those to the appropriate agent

This would close the gap for any restart cause — config reload, openclaw update, OOM, manual restart — without needing a per-cause fix.

Proposed fixes

Fix 1 (smaller, easier): make deferral actually defer

If the deferral mechanism is meant to gate restart, wire it through to systemctl. Either:

  • systemctl restart --no-block immediately, but have the gateway internally hold the SIGTERM handler until ops drain (suspect this is what the current logic thinks it does)
  • Or: the config-watcher should not call systemctl restart directly — it should set a "pending restart" flag, complete the in-flight ops, then call restart

If the deferral is purely advisory and the restart is non-negotiable, remove the misleading log line so operators don't think they have a grace period.

Fix 2 (bigger, more durable): startup retry from session JSONL

On gateway startup, after channels and plugins load:

for each session jsonl modified in last 60 minutes:
    last_entry = tail -1
    if last_entry.role == "user" and no_assistant_reply_after(last_entry):
        dispatch_to_agent(session, last_entry)
        log "[startup] retried orphaned user message <id> from <session>"

This piggybacks on the existing JSONL persistence and would fix the entire class of "restart killed in-flight dispatch" bugs covered partially by #57425, #71178, #71429, and this issue.

Severity

High for any user using OpenClaw as a primary chat surface. Silent message loss with multi-hour delay is the worst possible failure mode — the user thinks the assistant is ignoring them, the assistant has no record of being asked, and recovery requires the user to figure out something is wrong and re-prompt. In my case the assistant only realized what happened after the user explicitly asked "why didn't you respond earlier."

The exact config change in this incident was an SSRF allowlist update — a routine operation. This will fire any time someone touches browser.ssrfPolicy (or any other reloadable config) while a conversation is active.

Workaround

None. Until either fix lands, users have to notice the silence and re-prompt manually.

Cross-references

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions