Skip to content

Bug: gateway restart race condition causes launchd restart loop (macOS) #26904

@Lunar-actuary

Description

@Lunar-actuary

Summary

On macOS, running openclaw gateway restart can leave the old gateway process alive long enough for launchd to start a new one before the old process has fully exited and released its port/lock. The new process fails to acquire the lock (lock timeout after 5000ms) and exits — then launchd retries every 10 seconds indefinitely, creating a restart loop that can persist for weeks unnoticed.

Environment

  • macOS (Apple Silicon, M4)
  • OpenClaw v2026.2.19-2
  • Node.js v25.6.1
  • Managed via launchd plist (ai.openclaw.gateway.plist)
  • KeepAlive: true (no ThrottleInterval set)

Steps to Reproduce

  1. Gateway is running normally via launchd.
  2. Run openclaw gateway restart.
  3. launchd sends SIGTERM to the tracked process and immediately starts a new one.
  4. New process attempts to acquire the startup lock while old process (still shutting down) holds the port (18789) and PID lock.
  5. New process logs: Gateway failed to start: gateway already running (pid XXXXX); lock timeout after 5000ms and exits.
  6. launchd sees the exit → retries in 10 seconds (default ThrottleInterval) → loop begins.

Observed Symptoms

Gateway failed to start: gateway already running (pid 57305); lock timeout after 5000ms
Port 18789 is already in use.
- Gateway already running locally. Stop it (clawdbot gateway stop) or use a different port.

This message repeated every ~10 seconds for 31 days, producing a log file of 49MB / 500,000+ lines, completely drowning out real errors.

Impact

  • Real errors become invisible in the log noise.
  • Background CPU/IO overhead from thousands of failed spawn attempts per day.
  • No user-visible alert; the gateway itself runs fine, so the issue is entirely silent.

Root Cause Analysis

The restart command does not appear to wait for the old process to fully exit (port released, lock file cleared) before signalling launchd to start the new instance. On a fast M-series Mac, launchd starts the new process within milliseconds of SIGTERM — before Node.js has completed its shutdown sequence.

Suggested Fix

One or more of the following:

  1. Add a shutdown fence: Before launchd reloads, poll until the port/lock file is no longer held (with a reasonable timeout, e.g., 5s).
  2. Use launchctl stop + sleep + launchctl start instead of unload/load to give the old process time to exit cleanly.
  3. Document a recommended ThrottleInterval in the generated plist (e.g., 30–60 seconds) so that if a race does occur, the retry storm is bounded.

Workaround

Adding ThrottleInterval 60 to the plist limits retry damage. The loop itself can be broken by running openclaw gateway restart again once the orphaned process has been manually killed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions