Bug: gateway restart race condition causes launchd restart loop (macOS)

## Summary

On macOS, running `openclaw gateway restart` can leave the old gateway process alive long enough for launchd to start a new one before the old process has fully exited and released its port/lock. The new process fails to acquire the lock (`lock timeout after 5000ms`) and exits — then launchd retries every 10 seconds indefinitely, creating a restart loop that can persist for weeks unnoticed.

## Environment

- macOS (Apple Silicon, M4)
- OpenClaw v2026.2.19-2
- Node.js v25.6.1
- Managed via launchd plist (`ai.openclaw.gateway.plist`)
- `KeepAlive: true` (no `ThrottleInterval` set)

## Steps to Reproduce

1. Gateway is running normally via launchd.
2. Run `openclaw gateway restart`.
3. launchd sends SIGTERM to the tracked process and immediately starts a new one.
4. New process attempts to acquire the startup lock while old process (still shutting down) holds the port (18789) and PID lock.
5. New process logs: `Gateway failed to start: gateway already running (pid XXXXX); lock timeout after 5000ms` and exits.
6. launchd sees the exit → retries in 10 seconds (default `ThrottleInterval`) → loop begins.

## Observed Symptoms

```
Gateway failed to start: gateway already running (pid 57305); lock timeout after 5000ms
Port 18789 is already in use.
- Gateway already running locally. Stop it (clawdbot gateway stop) or use a different port.
```

This message repeated **every ~10 seconds** for **31 days**, producing a log file of **49MB / 500,000+ lines**, completely drowning out real errors.

## Impact

- Real errors become invisible in the log noise.
- Background CPU/IO overhead from thousands of failed spawn attempts per day.
- No user-visible alert; the gateway itself runs fine, so the issue is entirely silent.

## Root Cause Analysis

The `restart` command does not appear to wait for the old process to fully exit (port released, lock file cleared) before signalling launchd to start the new instance. On a fast M-series Mac, launchd starts the new process within milliseconds of SIGTERM — before Node.js has completed its shutdown sequence.

## Suggested Fix

One or more of the following:

1. **Add a shutdown fence**: Before launchd reloads, poll until the port/lock file is no longer held (with a reasonable timeout, e.g., 5s).
2. **Use `launchctl stop` + sleep + `launchctl start`** instead of unload/load to give the old process time to exit cleanly.
3. **Document a recommended `ThrottleInterval`** in the generated plist (e.g., 30–60 seconds) so that if a race does occur, the retry storm is bounded.

## Workaround

Adding `ThrottleInterval 60` to the plist limits retry damage. The loop itself can be broken by running `openclaw gateway restart` again once the orphaned process has been manually killed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: gateway restart race condition causes launchd restart loop (macOS) #26904

Summary

Environment

Steps to Reproduce

Observed Symptoms

Impact

Root Cause Analysis

Suggested Fix

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Bug: gateway restart race condition causes launchd restart loop (macOS) #26904

Description

Summary

Environment

Steps to Reproduce

Observed Symptoms

Impact

Root Cause Analysis

Suggested Fix

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions