-
-
Notifications
You must be signed in to change notification settings - Fork 52.7k
Description
Summary
On macOS, running openclaw gateway restart can leave the old gateway process alive long enough for launchd to start a new one before the old process has fully exited and released its port/lock. The new process fails to acquire the lock (lock timeout after 5000ms) and exits — then launchd retries every 10 seconds indefinitely, creating a restart loop that can persist for weeks unnoticed.
Environment
- macOS (Apple Silicon, M4)
- OpenClaw v2026.2.19-2
- Node.js v25.6.1
- Managed via launchd plist (
ai.openclaw.gateway.plist) KeepAlive: true(noThrottleIntervalset)
Steps to Reproduce
- Gateway is running normally via launchd.
- Run
openclaw gateway restart. - launchd sends SIGTERM to the tracked process and immediately starts a new one.
- New process attempts to acquire the startup lock while old process (still shutting down) holds the port (18789) and PID lock.
- New process logs:
Gateway failed to start: gateway already running (pid XXXXX); lock timeout after 5000msand exits. - launchd sees the exit → retries in 10 seconds (default
ThrottleInterval) → loop begins.
Observed Symptoms
Gateway failed to start: gateway already running (pid 57305); lock timeout after 5000ms
Port 18789 is already in use.
- Gateway already running locally. Stop it (clawdbot gateway stop) or use a different port.
This message repeated every ~10 seconds for 31 days, producing a log file of 49MB / 500,000+ lines, completely drowning out real errors.
Impact
- Real errors become invisible in the log noise.
- Background CPU/IO overhead from thousands of failed spawn attempts per day.
- No user-visible alert; the gateway itself runs fine, so the issue is entirely silent.
Root Cause Analysis
The restart command does not appear to wait for the old process to fully exit (port released, lock file cleared) before signalling launchd to start the new instance. On a fast M-series Mac, launchd starts the new process within milliseconds of SIGTERM — before Node.js has completed its shutdown sequence.
Suggested Fix
One or more of the following:
- Add a shutdown fence: Before launchd reloads, poll until the port/lock file is no longer held (with a reasonable timeout, e.g., 5s).
- Use
launchctl stop+ sleep +launchctl startinstead of unload/load to give the old process time to exit cleanly. - Document a recommended
ThrottleIntervalin the generated plist (e.g., 30–60 seconds) so that if a race does occur, the retry storm is bounded.
Workaround
Adding ThrottleInterval 60 to the plist limits retry damage. The loop itself can be broken by running openclaw gateway restart again once the orphaned process has been manually killed.