Config-reload deferral logged but not honored — systemd SIGTERM kills gateway, in-flight user message lost with no retry

# Config-reload deferral honored, but systemd SIGTERM kills gateway 9s later — inbound user message dropped, no retry

## Summary

When a config change triggers a gateway restart, the gateway's "defer until N operations complete" logic runs and logs the deferral, but **systemd issues SIGTERM ~9 seconds later regardless**. Any user message that landed during the deferral window is persisted to the session JSONL but never receives an assistant reply, and the gateway has no retry-on-restart for in-flight dispatches — so it stays orphaned indefinitely.

In my case: an inbound Telegram message (`Is that ok?`, msg `3675`) landed at 23:00:30 UTC. The next assistant reply for that turn happened **2 hours 14 minutes later**, only after the user re-pinged. Both messages were processed in the same turn at 01:14 UTC.

This is related to but distinct from:
- #57425 (broad "graceful restart with session recovery" feature)
- #71178 (`openclaw update` mid-turn message loss)

This issue is the concrete *config-reload + systemd* path, with two narrow bugs that could be fixed in isolation.

## Environment

- OpenClaw: 2026.4.25
- Linux 6.8.0-110-generic, node 25.x
- Gateway under systemd user unit (`openclaw-gateway.service`)
- Channel: Telegram, forum group with thread routing
- Agent: claude-max-proxy backend, model `claude-opus-4-7`
- Date: 2026-04-28 → 2026-04-29

## Timeline (real incident, journalctl --user)

| Time (UTC) | Event |
|---|---|
| 23:00:30 | User msg `3675` ("Is that ok?") arrives, persisted to session JSONL with runtime context |
| 23:01:03 | `[reload] config change detected; evaluating reload (browser.ssrfPolicy.allowedHostnames)` |
| 23:01:09 | `[reload] config change requires gateway restart — deferring until 4 operation(s), 2 reply(ies), 2 embedded run(s) complete` |
| 23:01:18 | `systemd: Stopping openclaw-gateway.service` (deferral did not hold) |
| 23:01:18 | `[gateway] signal SIGTERM received; shutting down` |
| 23:01:19 | `Stopped openclaw-gateway.service. Consumed 4h 22min CPU time` |
| 23:01:41 | New gateway process loading configuration |
| 23:01:59 | `[gateway] ready (6 plugins; 17.7s)` |
| 23:01:59 | **No retry of msg `3675`. No assistant reply ever generated for that turn.** |
| 01:14 (next day) | User re-pings ("Why didn't you respond earlier..."), Claude finally sees both messages and responds to both at once |

Total dispatch loss: 2h 14min, only resolved by user-initiated re-prompt.

## Root cause: two narrow bugs

### Bug 1 — Deferral isn't honored by systemd

The reload path logs `deferring until N operations complete` (line 23:01:09), but systemd SIGTERM's the unit 9 seconds later anyway. Either:

- The deferral mechanism is purely in-process (logs intent but doesn't actually delay the `systemctl restart` call), or
- `systemctl restart` is being called immediately after the deferral logging without honoring the in-process gate, or
- `TimeoutStopSec` in the unit file is too short for the 4 operations + 2 replies + 2 embedded runs to drain

Whichever it is, the user-facing effect is that the deferral log message is misleading — it suggests the gateway will hold off, but it doesn't.

### Bug 2 — No retry-on-restart for already-persisted messages

The user message landed at 23:00:30 and was persisted to the session JSONL **before** the restart. After the restart at 23:01:59, the gateway came up clean — but it never scanned the session JSONL for messages whose newest sibling isn't an assistant reply.

The data is already on disk. The only missing piece is a startup pass that:

1. Walks recent session JSONLs (last hour, say)
2. Identifies turns where the last entry is a user message with no assistant reply
3. Re-dispatches those to the appropriate agent

This would close the gap for *any* restart cause — config reload, `openclaw update`, OOM, manual restart — without needing a per-cause fix.

## Proposed fixes

### Fix 1 (smaller, easier): make deferral actually defer

If the deferral mechanism is meant to gate restart, wire it through to `systemctl`. Either:

- `systemctl restart --no-block` immediately, but have the gateway internally hold the SIGTERM handler until ops drain (suspect this is what the current logic *thinks* it does)
- Or: the config-watcher should not call `systemctl restart` directly — it should set a "pending restart" flag, complete the in-flight ops, then call restart

If the deferral is purely advisory and the restart is non-negotiable, **remove the misleading log line** so operators don't think they have a grace period.

### Fix 2 (bigger, more durable): startup retry from session JSONL

On gateway startup, after channels and plugins load:

```
for each session jsonl modified in last 60 minutes:
    last_entry = tail -1
    if last_entry.role == "user" and no_assistant_reply_after(last_entry):
        dispatch_to_agent(session, last_entry)
        log "[startup] retried orphaned user message <id> from <session>"
```

This piggybacks on the existing JSONL persistence and would fix the entire class of "restart killed in-flight dispatch" bugs covered partially by #57425, #71178, #71429, and this issue.

## Severity

**High for any user using OpenClaw as a primary chat surface.** Silent message loss with multi-hour delay is the worst possible failure mode — the user thinks the assistant is ignoring them, the assistant has no record of being asked, and recovery requires the user to figure out something is wrong and re-prompt. In my case the assistant only realized what happened after the user explicitly asked "why didn't you respond earlier."

The exact config change in this incident was an SSRF allowlist update — a routine operation. This will fire any time someone touches `browser.ssrfPolicy` (or any other reloadable config) while a conversation is active.

## Workaround

None. Until either fix lands, users have to notice the silence and re-prompt manually.

## Cross-references

- #57425 (Feature: Graceful Gateway Restart with Session Recovery) — this issue is a concrete instance / smaller scope
- #71178 (`openclaw update` mid-turn message loss) — same failure class, different trigger
- #71429 (Telegram drops in-flight messages on sendChatAction failure) — same data-loss surface
- #55412 (GatewayDrainingError should auto-retry) — adjacent


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Config-reload deferral logged but not honored — systemd SIGTERM kills gateway, in-flight user message lost with no retry #73918

Config-reload deferral honored, but systemd SIGTERM kills gateway 9s later — inbound user message dropped, no retry

Summary

Environment

Timeline (real incident, journalctl --user)

Root cause: two narrow bugs

Bug 1 — Deferral isn't honored by systemd

Bug 2 — No retry-on-restart for already-persisted messages

Proposed fixes

Fix 1 (smaller, easier): make deferral actually defer

Fix 2 (bigger, more durable): startup retry from session JSONL

Severity

Workaround

Cross-references

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Time (UTC)	Event
23:00:30	User msg `3675` ("Is that ok?") arrives, persisted to session JSONL with runtime context
23:01:03	`[reload] config change detected; evaluating reload (browser.ssrfPolicy.allowedHostnames)`
23:01:09	`[reload] config change requires gateway restart — deferring until 4 operation(s), 2 reply(ies), 2 embedded run(s) complete`
23:01:18	`systemd: Stopping openclaw-gateway.service` (deferral did not hold)
23:01:18	`[gateway] signal SIGTERM received; shutting down`
23:01:19	`Stopped openclaw-gateway.service. Consumed 4h 22min CPU time`
23:01:41	New gateway process loading configuration
23:01:59	`[gateway] ready (6 plugins; 17.7s)`
23:01:59	No retry of msg `3675`. No assistant reply ever generated for that turn.
01:14 (next day)	User re-pings ("Why didn't you respond earlier..."), Claude finally sees both messages and responds to both at once

Uh oh!

Config-reload deferral logged but not honored — systemd SIGTERM kills gateway, in-flight user message lost with no retry #73918

Description

Config-reload deferral honored, but systemd SIGTERM kills gateway 9s later — inbound user message dropped, no retry

Summary

Environment

Timeline (real incident, journalctl --user)

Root cause: two narrow bugs

Bug 1 — Deferral isn't honored by systemd

Bug 2 — No retry-on-restart for already-persisted messages

Proposed fixes

Fix 1 (smaller, easier): make deferral actually defer

Fix 2 (bigger, more durable): startup retry from session JSONL

Severity

Workaround

Cross-references

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions