Config-reload deferral honored, but systemd SIGTERM kills gateway 9s later — inbound user message dropped, no retry
Summary
When a config change triggers a gateway restart, the gateway's "defer until N operations complete" logic runs and logs the deferral, but systemd issues SIGTERM ~9 seconds later regardless. Any user message that landed during the deferral window is persisted to the session JSONL but never receives an assistant reply, and the gateway has no retry-on-restart for in-flight dispatches — so it stays orphaned indefinitely.
In my case: an inbound Telegram message (Is that ok?, msg 3675) landed at 23:00:30 UTC. The next assistant reply for that turn happened 2 hours 14 minutes later, only after the user re-pinged. Both messages were processed in the same turn at 01:14 UTC.
This is related to but distinct from:
This issue is the concrete config-reload + systemd path, with two narrow bugs that could be fixed in isolation.
Environment
- OpenClaw: 2026.4.25
- Linux 6.8.0-110-generic, node 25.x
- Gateway under systemd user unit (
openclaw-gateway.service)
- Channel: Telegram, forum group with thread routing
- Agent: claude-max-proxy backend, model
claude-opus-4-7
- Date: 2026-04-28 → 2026-04-29
Timeline (real incident, journalctl --user)
| Time (UTC) |
Event |
| 23:00:30 |
User msg 3675 ("Is that ok?") arrives, persisted to session JSONL with runtime context |
| 23:01:03 |
[reload] config change detected; evaluating reload (browser.ssrfPolicy.allowedHostnames) |
| 23:01:09 |
[reload] config change requires gateway restart — deferring until 4 operation(s), 2 reply(ies), 2 embedded run(s) complete |
| 23:01:18 |
systemd: Stopping openclaw-gateway.service (deferral did not hold) |
| 23:01:18 |
[gateway] signal SIGTERM received; shutting down |
| 23:01:19 |
Stopped openclaw-gateway.service. Consumed 4h 22min CPU time |
| 23:01:41 |
New gateway process loading configuration |
| 23:01:59 |
[gateway] ready (6 plugins; 17.7s) |
| 23:01:59 |
No retry of msg 3675. No assistant reply ever generated for that turn. |
| 01:14 (next day) |
User re-pings ("Why didn't you respond earlier..."), Claude finally sees both messages and responds to both at once |
Total dispatch loss: 2h 14min, only resolved by user-initiated re-prompt.
Root cause: two narrow bugs
Bug 1 — Deferral isn't honored by systemd
The reload path logs deferring until N operations complete (line 23:01:09), but systemd SIGTERM's the unit 9 seconds later anyway. Either:
- The deferral mechanism is purely in-process (logs intent but doesn't actually delay the
systemctl restart call), or
systemctl restart is being called immediately after the deferral logging without honoring the in-process gate, or
TimeoutStopSec in the unit file is too short for the 4 operations + 2 replies + 2 embedded runs to drain
Whichever it is, the user-facing effect is that the deferral log message is misleading — it suggests the gateway will hold off, but it doesn't.
Bug 2 — No retry-on-restart for already-persisted messages
The user message landed at 23:00:30 and was persisted to the session JSONL before the restart. After the restart at 23:01:59, the gateway came up clean — but it never scanned the session JSONL for messages whose newest sibling isn't an assistant reply.
The data is already on disk. The only missing piece is a startup pass that:
- Walks recent session JSONLs (last hour, say)
- Identifies turns where the last entry is a user message with no assistant reply
- Re-dispatches those to the appropriate agent
This would close the gap for any restart cause — config reload, openclaw update, OOM, manual restart — without needing a per-cause fix.
Proposed fixes
Fix 1 (smaller, easier): make deferral actually defer
If the deferral mechanism is meant to gate restart, wire it through to systemctl. Either:
systemctl restart --no-block immediately, but have the gateway internally hold the SIGTERM handler until ops drain (suspect this is what the current logic thinks it does)
- Or: the config-watcher should not call
systemctl restart directly — it should set a "pending restart" flag, complete the in-flight ops, then call restart
If the deferral is purely advisory and the restart is non-negotiable, remove the misleading log line so operators don't think they have a grace period.
Fix 2 (bigger, more durable): startup retry from session JSONL
On gateway startup, after channels and plugins load:
for each session jsonl modified in last 60 minutes:
last_entry = tail -1
if last_entry.role == "user" and no_assistant_reply_after(last_entry):
dispatch_to_agent(session, last_entry)
log "[startup] retried orphaned user message <id> from <session>"
This piggybacks on the existing JSONL persistence and would fix the entire class of "restart killed in-flight dispatch" bugs covered partially by #57425, #71178, #71429, and this issue.
Severity
High for any user using OpenClaw as a primary chat surface. Silent message loss with multi-hour delay is the worst possible failure mode — the user thinks the assistant is ignoring them, the assistant has no record of being asked, and recovery requires the user to figure out something is wrong and re-prompt. In my case the assistant only realized what happened after the user explicitly asked "why didn't you respond earlier."
The exact config change in this incident was an SSRF allowlist update — a routine operation. This will fire any time someone touches browser.ssrfPolicy (or any other reloadable config) while a conversation is active.
Workaround
None. Until either fix lands, users have to notice the silence and re-prompt manually.
Cross-references
Config-reload deferral honored, but systemd SIGTERM kills gateway 9s later — inbound user message dropped, no retry
Summary
When a config change triggers a gateway restart, the gateway's "defer until N operations complete" logic runs and logs the deferral, but systemd issues SIGTERM ~9 seconds later regardless. Any user message that landed during the deferral window is persisted to the session JSONL but never receives an assistant reply, and the gateway has no retry-on-restart for in-flight dispatches — so it stays orphaned indefinitely.
In my case: an inbound Telegram message (
Is that ok?, msg3675) landed at 23:00:30 UTC. The next assistant reply for that turn happened 2 hours 14 minutes later, only after the user re-pinged. Both messages were processed in the same turn at 01:14 UTC.This is related to but distinct from:
openclaw updaterun mid-turn causes total message loss on Telegram (and likely Discord) #71178 (openclaw updatemid-turn message loss)This issue is the concrete config-reload + systemd path, with two narrow bugs that could be fixed in isolation.
Environment
openclaw-gateway.service)claude-opus-4-7Timeline (real incident, journalctl --user)
3675("Is that ok?") arrives, persisted to session JSONL with runtime context[reload] config change detected; evaluating reload (browser.ssrfPolicy.allowedHostnames)[reload] config change requires gateway restart — deferring until 4 operation(s), 2 reply(ies), 2 embedded run(s) completesystemd: Stopping openclaw-gateway.service(deferral did not hold)[gateway] signal SIGTERM received; shutting downStopped openclaw-gateway.service. Consumed 4h 22min CPU time[gateway] ready (6 plugins; 17.7s)3675. No assistant reply ever generated for that turn.Total dispatch loss: 2h 14min, only resolved by user-initiated re-prompt.
Root cause: two narrow bugs
Bug 1 — Deferral isn't honored by systemd
The reload path logs
deferring until N operations complete(line 23:01:09), but systemd SIGTERM's the unit 9 seconds later anyway. Either:systemctl restartcall), orsystemctl restartis being called immediately after the deferral logging without honoring the in-process gate, orTimeoutStopSecin the unit file is too short for the 4 operations + 2 replies + 2 embedded runs to drainWhichever it is, the user-facing effect is that the deferral log message is misleading — it suggests the gateway will hold off, but it doesn't.
Bug 2 — No retry-on-restart for already-persisted messages
The user message landed at 23:00:30 and was persisted to the session JSONL before the restart. After the restart at 23:01:59, the gateway came up clean — but it never scanned the session JSONL for messages whose newest sibling isn't an assistant reply.
The data is already on disk. The only missing piece is a startup pass that:
This would close the gap for any restart cause — config reload,
openclaw update, OOM, manual restart — without needing a per-cause fix.Proposed fixes
Fix 1 (smaller, easier): make deferral actually defer
If the deferral mechanism is meant to gate restart, wire it through to
systemctl. Either:systemctl restart --no-blockimmediately, but have the gateway internally hold the SIGTERM handler until ops drain (suspect this is what the current logic thinks it does)systemctl restartdirectly — it should set a "pending restart" flag, complete the in-flight ops, then call restartIf the deferral is purely advisory and the restart is non-negotiable, remove the misleading log line so operators don't think they have a grace period.
Fix 2 (bigger, more durable): startup retry from session JSONL
On gateway startup, after channels and plugins load:
This piggybacks on the existing JSONL persistence and would fix the entire class of "restart killed in-flight dispatch" bugs covered partially by #57425, #71178, #71429, and this issue.
Severity
High for any user using OpenClaw as a primary chat surface. Silent message loss with multi-hour delay is the worst possible failure mode — the user thinks the assistant is ignoring them, the assistant has no record of being asked, and recovery requires the user to figure out something is wrong and re-prompt. In my case the assistant only realized what happened after the user explicitly asked "why didn't you respond earlier."
The exact config change in this incident was an SSRF allowlist update — a routine operation. This will fire any time someone touches
browser.ssrfPolicy(or any other reloadable config) while a conversation is active.Workaround
None. Until either fix lands, users have to notice the silence and re-prompt manually.
Cross-references
openclaw updaterun mid-turn causes total message loss on Telegram (and likely Discord) #71178 (openclaw updatemid-turn message loss) — same failure class, different trigger