Gateway silently dies after auto-update: launchd removes service due to rapid restart cycle

## Summary

After auto-updating to v2026.3.24, the gateway was silently killed by macOS launchd ~3 hours later and never restarted, despite `KeepAlive: true` in the LaunchAgent plist. The service was dead for 4+ hours with zero alerts until manual recovery.

## Environment

- macOS 26.3.1 (arm64), Node 25.8.1
- OpenClaw v2026.3.23 → v2026.3.24 (auto-update via `openclaw doctor --fix`)
- LaunchAgent plist: `KeepAlive: true`, `ThrottleInterval: 1`

## Reproduction Timeline (March 25, 2026 — all times CT)

### Phase 1: Auto-update triggers rapid restart chain

| Time | Event |
|------|-------|
| 15:04:24 | Auto-update from v2026.3.23 → v2026.3.24. Gateway receives SIGTERM via `launchctl kickstart -k`. |
| 15:04:33 | Gateway restarts (9s later). |
| 15:06:36 | Telegram polling stall detected (no getUpdates for 119.87s), forces restart. |
| 15:07:16 | Gateway receives SIGTERM again. |
| 15:07:21 | Second SIGTERM during shutdown, ignored. |
| 15:08:17 | SIGUSR1 from restart sentinel (new v2026.3.24 feature). Full process restart. |
| 15:08:21 | Gateway restarts. Runs stably for ~3 hours. |

**3 restarts in 4 minutes.**

### Phase 2: launchd kills the service

| Time | Event |
|------|-------|
| 18:04:47 | Gateway still alive, serving WS requests (`channels.status`, `doctor.memory.status`). |
| 18:07:17 | Gateway receives SIGTERM. Clean shutdown logged. |

macOS system log at 18:07:17:
```
service inactive: ai.openclaw.gateway
removing service: ai.openclaw.gateway
```

**launchd did not just stop the process — it unregistered the entire service.** No `Setting service to enabled` follows (compare with the 15:04 cycle where re-registration happens immediately).

### Phase 3: 4-hour silent outage

| Time | Event |
|------|-------|
| 18:25 – 21:56 | Cron jobs attempt to connect every ~30 min, all fail with `gateway closed (1006 abnormal closure)`. |
| 22:16:13 | Manual recovery via Terminal (`openclaw gateway install` rewrites plist + `launchctl bootstrap`). |

`launchctl print gui/502/ai.openclaw.gateway` after recovery shows `immediate reason = inefficient` and `runs = 3`.

## Root Cause

The v2026.3.24 auto-update + restart sentinel + Telegram polling stall recovery triggered **3 rapid restarts in 4 minutes**. With `ThrottleInterval: 1` in the plist, macOS launchd classified the service as "inefficient" (crash-loop heuristic). 

~3 hours later, launchd proactively killed the gateway process and **removed the service from its job table entirely**. Once removed, `KeepAlive: true` is irrelevant — the service definition no longer exists. The gateway stayed dead until manual re-registration.

## Evidence

- **launchd system log** confirms `removing service` at 18:07:17 with no subsequent re-registration until 22:16:13.
- **`launchctl print`** shows `immediate reason = inefficient` — macOS's daemon throttle classification.
- **Gateway logs** show clean SIGTERM shutdown (not a crash), confirming the kill came from launchd.
- **No crash-looping** between 18:07 and 22:16 — zero gateway startup attempts. The service was simply gone from launchd's job table.
- **Mac was awake** throughout (caffeinate running 62+ hours).

## Suggested Fix

1. **Raise `ThrottleInterval` to 10–15 seconds** in the plist template. `ThrottleInterval: 1` tells launchd rapid restarts are expected, but macOS interprets multiple restarts within the throttle window as a crash loop and penalizes the service.

2. **Add `ProcessType: Interactive`** to the plist. This changes launchd's resource classification heuristic and raises the threshold for "inefficient" termination.

3. **Consolidate the update restart path.** The current v2026.3.24 update triggers up to 3 separate restarts (kickstart + Telegram polling stall + restart sentinel SIGUSR1). These should be coalesced into a single restart to avoid tripping launchd's throttle.

4. **Add a self-check.** If the gateway detects it was removed from launchd (e.g., periodic `launchctl print` self-check), it could re-register itself before dying.

## Impact

- 4+ hours of complete silence — no cron jobs, no Telegram messages, no heartbeats.
- 9 cron jobs left in zombie "running" state after recovery, requiring manual disable/enable to clear.
- No alert was sent before the outage (the gateway was killed, so it couldn't alert).

## Workaround

After the gateway dies, run from Terminal:
```bash
openclaw gateway install   # rewrites plist + bootstraps service
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gateway silently dies after auto-update: launchd removes service due to rapid restart cycle #54861

Summary

Environment

Reproduction Timeline (March 25, 2026 — all times CT)

Phase 1: Auto-update triggers rapid restart chain

Phase 2: launchd kills the service

Phase 3: 4-hour silent outage

Root Cause

Evidence

Suggested Fix

Impact

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Time	Event
15:04:24	Auto-update from v2026.3.23 → v2026.3.24. Gateway receives SIGTERM via `launchctl kickstart -k`.
15:04:33	Gateway restarts (9s later).
15:06:36	Telegram polling stall detected (no getUpdates for 119.87s), forces restart.
15:07:16	Gateway receives SIGTERM again.
15:07:21	Second SIGTERM during shutdown, ignored.
15:08:17	SIGUSR1 from restart sentinel (new v2026.3.24 feature). Full process restart.
15:08:21	Gateway restarts. Runs stably for ~3 hours.

Time	Event
18:04:47	Gateway still alive, serving WS requests (`channels.status`, `doctor.memory.status`).
18:07:17	Gateway receives SIGTERM. Clean shutdown logged.

Time	Event
18:25 – 21:56	Cron jobs attempt to connect every ~30 min, all fail with `gateway closed (1006 abnormal closure)`.
22:16:13	Manual recovery via Terminal (`openclaw gateway install` rewrites plist + `launchctl bootstrap`).

Uh oh!

Gateway silently dies after auto-update: launchd removes service due to rapid restart cycle #54861

Description

Summary

Environment

Reproduction Timeline (March 25, 2026 — all times CT)

Phase 1: Auto-update triggers rapid restart chain

Phase 2: launchd kills the service

Phase 3: 4-hour silent outage

Root Cause

Evidence

Suggested Fix

Impact

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions