Skip to content

Gateway silently dies after auto-update: launchd removes service due to rapid restart cycle #54861

@QuarkAssistant

Description

@QuarkAssistant

Summary

After auto-updating to v2026.3.24, the gateway was silently killed by macOS launchd ~3 hours later and never restarted, despite KeepAlive: true in the LaunchAgent plist. The service was dead for 4+ hours with zero alerts until manual recovery.

Environment

  • macOS 26.3.1 (arm64), Node 25.8.1
  • OpenClaw v2026.3.23 → v2026.3.24 (auto-update via openclaw doctor --fix)
  • LaunchAgent plist: KeepAlive: true, ThrottleInterval: 1

Reproduction Timeline (March 25, 2026 — all times CT)

Phase 1: Auto-update triggers rapid restart chain

Time Event
15:04:24 Auto-update from v2026.3.23 → v2026.3.24. Gateway receives SIGTERM via launchctl kickstart -k.
15:04:33 Gateway restarts (9s later).
15:06:36 Telegram polling stall detected (no getUpdates for 119.87s), forces restart.
15:07:16 Gateway receives SIGTERM again.
15:07:21 Second SIGTERM during shutdown, ignored.
15:08:17 SIGUSR1 from restart sentinel (new v2026.3.24 feature). Full process restart.
15:08:21 Gateway restarts. Runs stably for ~3 hours.

3 restarts in 4 minutes.

Phase 2: launchd kills the service

Time Event
18:04:47 Gateway still alive, serving WS requests (channels.status, doctor.memory.status).
18:07:17 Gateway receives SIGTERM. Clean shutdown logged.

macOS system log at 18:07:17:

service inactive: ai.openclaw.gateway
removing service: ai.openclaw.gateway

launchd did not just stop the process — it unregistered the entire service. No Setting service to enabled follows (compare with the 15:04 cycle where re-registration happens immediately).

Phase 3: 4-hour silent outage

Time Event
18:25 – 21:56 Cron jobs attempt to connect every ~30 min, all fail with gateway closed (1006 abnormal closure).
22:16:13 Manual recovery via Terminal (openclaw gateway install rewrites plist + launchctl bootstrap).

launchctl print gui/502/ai.openclaw.gateway after recovery shows immediate reason = inefficient and runs = 3.

Root Cause

The v2026.3.24 auto-update + restart sentinel + Telegram polling stall recovery triggered 3 rapid restarts in 4 minutes. With ThrottleInterval: 1 in the plist, macOS launchd classified the service as "inefficient" (crash-loop heuristic).

~3 hours later, launchd proactively killed the gateway process and removed the service from its job table entirely. Once removed, KeepAlive: true is irrelevant — the service definition no longer exists. The gateway stayed dead until manual re-registration.

Evidence

  • launchd system log confirms removing service at 18:07:17 with no subsequent re-registration until 22:16:13.
  • launchctl print shows immediate reason = inefficient — macOS's daemon throttle classification.
  • Gateway logs show clean SIGTERM shutdown (not a crash), confirming the kill came from launchd.
  • No crash-looping between 18:07 and 22:16 — zero gateway startup attempts. The service was simply gone from launchd's job table.
  • Mac was awake throughout (caffeinate running 62+ hours).

Suggested Fix

  1. Raise ThrottleInterval to 10–15 seconds in the plist template. ThrottleInterval: 1 tells launchd rapid restarts are expected, but macOS interprets multiple restarts within the throttle window as a crash loop and penalizes the service.

  2. Add ProcessType: Interactive to the plist. This changes launchd's resource classification heuristic and raises the threshold for "inefficient" termination.

  3. Consolidate the update restart path. The current v2026.3.24 update triggers up to 3 separate restarts (kickstart + Telegram polling stall + restart sentinel SIGUSR1). These should be coalesced into a single restart to avoid tripping launchd's throttle.

  4. Add a self-check. If the gateway detects it was removed from launchd (e.g., periodic launchctl print self-check), it could re-register itself before dying.

Impact

  • 4+ hours of complete silence — no cron jobs, no Telegram messages, no heartbeats.
  • 9 cron jobs left in zombie "running" state after recovery, requiring manual disable/enable to clear.
  • No alert was sent before the outage (the gateway was killed, so it couldn't alert).

Workaround

After the gateway dies, run from Terminal:

openclaw gateway install   # rewrites plist + bootstraps service

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions