Skip to content

fix: clean stale gateway PIDs in triggerOpenClawRestart before launchctl/systemctl#27448

Merged
steipete merged 1 commit intoopenclaw:mainfrom
Sid-Qin:fix/restart-stale-pid-cleanup-26736
Feb 26, 2026
Merged

fix: clean stale gateway PIDs in triggerOpenClawRestart before launchctl/systemctl#27448
steipete merged 1 commit intoopenclaw:mainfrom
Sid-Qin:fix/restart-stale-pid-cleanup-26736

Conversation

@Sid-Qin
Copy link
Contributor

@Sid-Qin Sid-Qin commented Feb 26, 2026

Summary

Addresses the second trigger path described in #26736 (comment by @doomsday616).

  • Problem: When the /restart command runs inside an embedded agent process (openclaw agent --message), there is no SIGUSR1 listener, so it falls through to triggerOpenClawRestart(). This calls launchctl kickstart -k directly — bypassing the pre-restart port cleanup added in fix(gateway): kill stale processes before restart to prevent port conflicts #27013. If the gateway was originally started via TUI/CLI, the orphaned process still holds port 18789 and the new launchd instance enters a crash loop.
  • Fix: Added synchronous stale-PID detection and termination directly inside triggerOpenClawRestart():
    1. findGatewayPidsOnPortSync(port) — uses lsof (sync via spawnSync) to find openclaw gateway processes holding the port
    2. terminateStaleProcessesSync(pids) — SIGTERM → 300ms wait → SIGKILL for survivors → 200ms wait
    3. cleanStaleGatewayProcessesSync() — resolves gateway port from env/defaults and runs the above
    4. Called at the top of triggerOpenClawRestart(), before any launchctl/systemctl command

This ensures every caller of triggerOpenClawRestart() gets port cleanup — not just the CLI openclaw gateway restart path.

Test plan

  • findGatewayPidsOnPortSync returns empty array for unused ports
  • Current process PID is never included in stale PID list
  • TypeScript compiles cleanly (tsc --noEmit)
  • Existing restart-sentinel tests still pass
  • Manual: start gateway via TUI, send /restart from embedded agent, verify clean restart without crash loop

Closes #26736

…nchctl/systemctl

When the /restart command runs inside an embedded agent process (no
SIGUSR1 listener), it falls through to triggerOpenClawRestart() which
calls launchctl kickstart -k directly — bypassing the pre-restart port
cleanup added in openclaw#27013. If the gateway was started via TUI/CLI, the
orphaned process still holds the port and the new launchd instance
crash-loops.

Add synchronous stale-PID detection (lsof) and termination
(SIGTERM→SIGKILL) inside triggerOpenClawRestart() itself, so every
caller — including the embedded agent /restart path — gets port cleanup
before the service manager restart command fires.

Closes openclaw#26736

Made-with: Cursor
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 26, 2026

Greptile Summary

This PR fixes a crash loop issue when /restart is called from an embedded agent process that lacks a SIGUSR1 listener. The fix adds synchronous stale PID detection and termination directly in triggerOpenClawRestart() to ensure port cleanup happens for all restart paths, not just the CLI path.

Key implementation details:

  • findGatewayPidsOnPortSync() uses lsof with field output parsing (-Fpc) to identify openclaw gateway processes on the configured port, excluding the current process
  • terminateStaleProcessesSync() implements graceful termination (SIGTERM → 300ms wait → SIGKILL for survivors → 200ms wait)
  • cleanStaleGatewayProcessesSync() orchestrates cleanup by resolving the gateway port from env/defaults and terminating any stale processes
  • Called at the top of triggerOpenClawRestart() before any launchctl/systemctl commands execute
  • Platform-aware implementation: returns empty array on Windows where lsof is unavailable

The implementation is defensive with proper error handling, timeout protection on all spawn calls, and tests covering the core functionality.

Confidence Score: 5/5

  • This PR is safe to merge with high confidence
  • The implementation correctly solves the stated problem with defensive programming practices. The lsof output parsing is correct for the -Fpc format, process termination follows standard Unix patterns (SIGTERM → SIGKILL), and all edge cases are handled (Windows platform, process already exited, current PID exclusion). The code includes proper timeout protection on all spawn calls and error handling via try-catch blocks. Tests verify core functionality including empty array returns, current PID exclusion, and array type guarantees.
  • No files require special attention

Last reviewed commit: bc37b36

@steipete steipete merged commit 63c6080 into openclaw:main Feb 26, 2026
27 of 29 checks passed
@steipete
Copy link
Contributor

Landed via /landpr flow.

  • Gate: pnpm check && pnpm build && pnpm test -- src/infra/restart.test.ts src/auto-reply/reply/commands-session.test.ts
  • Land commit: 98c85ceba60382c9e3e53fa3e95053d59e7c7aec
  • Merge commit: 63c6080

Thanks @Sid-Qin!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Gateway restart via launchctl has no effect when gateway was started via TUI or CLI — orphaned process holds port

2 participants