fix: implement Windows stale gateway process cleanup before restart#60480
Conversation
Greptile SummaryThis PR fixes a real Windows-only bug where
Confidence Score: 3/5Not safe to merge as-is; two structural issues — a circular module dependency and a poll-budget mismatch — should be resolved first. Both findings are P1: the circular import between restart-stale-pids.ts and gateway-processes.ts is a real module-graph defect that bundlers can mishandle, and the 5 000 ms PowerShell timeout exceeding the 2 000 ms waitForPortFreeSync budget means the port-free polling loop can stall for much longer than intended on Windows. The core fix for the infinite restart loop is valid, but these two issues need to be addressed before landing. src/infra/restart-stale-pids.ts (circular import with gateway-processes.ts; poll timeout exceeds budget) Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/infra/restart-stale-pids.ts
Line: 5
Comment:
**Circular import between `restart-stale-pids.ts` and `gateway-processes.ts`**
This new import creates a mutual circular dependency: `restart-stale-pids.ts` → `gateway-processes.ts` → `restart-stale-pids.ts` (line 5 of `gateway-processes.ts` already imports `findGatewayPidsOnPortSync` from here). While ESM live bindings mean this won't throw a "cannot access before initialization" error at runtime (both exports are `function` declarations), it is a structural issue that bundlers (esbuild, rollup) can mishandle and that the codebase's dynamic-import guardrail policy is designed to avoid.
The cleaner fix is to lift the Windows-specific `readWindowsListeningPidsOnPortSync` (or the verified variant) out of `gateway-processes.ts` into a shared `windows-port-pids.ts` helper that neither file needs to cross-import.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/infra/restart-stale-pids.ts
Line: 194-201
Comment:
**Windows poll can block for 5 s, far exceeding the 2 s budget in `waitForPortFreeSync`**
`findVerifiedGatewayListenerPidsOnPortSync` runs PowerShell with `WINDOWS_GATEWAY_DISCOVERY_TIMEOUT_MS = 5 000 ms`. The `waitForPortFreeSync` wall-clock budget is `PORT_FREE_TIMEOUT_MS = 2 000 ms`. A single hung PowerShell spawn will block the entire polling loop for 5 s — 2.5× the intended ceiling — before `waitForPortFreeSync` even gets to re-check the deadline. On Unix, `POLL_SPAWN_TIMEOUT_MS = 400 ms` keeps each poll well under the budget.
The Windows helper should respect the same short per-call timeout, or `pollPortOnceWindows` should accept a timeout and pass it through to the underlying PowerShell / netstat calls.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix: implement Windows stale gateway pro..." | Re-trigger Greptile |
There was a problem hiding this comment.
Pull request overview
Fixes Windows self-restart getting stuck in an infinite “already running under schtasks” retry loop by ensuring stale gateway listeners are discovered and terminated before attempting to relaunch.
Changes:
- On Windows,
findGatewayPidsOnPortSync()now uses the existing verified Windows listener discovery instead of returning[]. - Adds a Windows-specific port polling path for
waitForPortFreeSync()using the same listener discovery. - Adds a Windows-specific stale-process terminator using
taskkill(tree kill first, then/Fescalation).
9cca773 to
d492f2c
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d492f2c6d2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
d492f2c to
a8e2bea
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a8e2bead83
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
CI Failure AnalysisAll current CI failures are pre-existing on
Happy to rebase onto |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d4666f364c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
d4666f3 to
f97dd5c
Compare
Update: Greptile P1 findings resolvedBoth structural issues flagged by Greptile have been addressed in the latest push ( 1. Circular import — fixedExtracted all Windows-specific port scanning and process-args helpers from 2. Poll budget overshoot — fixed
CI statusAll PR-related checks are now passing (including @steipete @vincentkoc — this is ready for review when you have a chance. Happy to address any further feedback. |
f97dd5c to
30467bb
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 30467bbe0c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5cf46b7588
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
findGatewayPidsOnPortSync() returned [] immediately on Windows, causing cleanStaleGatewayProcessesSync() to skip killing old gateway processes during self-restart (triggerOpenClawRestart -> schtasks path). This led to an infinite retry loop: 'gateway already running under schtasks; waiting 5000ms before retrying startup'. Changes: - Extract Windows port/process helpers into shared windows-port-pids.ts to break the circular import between restart-stale-pids.ts and gateway-processes.ts, with configurable timeoutMs for poll compliance - findGatewayPidsOnPortSync: discover + verify Windows gateway PIDs via readWindowsListeningPidsOnPortSync + readWindowsProcessArgsSync - pollPortOnceWindows: use short POLL_SPAWN_TIMEOUT_MS (400ms) so a single slow PowerShell call cannot exceed the 2s polling budget - terminateStaleProcessesSync: add terminateStaleProcessesWindows using taskkill.exe (graceful /T first, then /F force-kill) Fixes the Windows gateway restart infinite loop caused by the schtasks supervisor detecting a port conflict it cannot resolve.
535c323 to
9751f62
Compare
obviyus
left a comment
There was a problem hiding this comment.
Reviewed latest changes; landing now.
|
Landed on main. Thanks @arifahmedjoy. |
|
You're welcome, @obviyus. |
…nks @arifahmedjoy) * fix: implement Windows stale gateway process cleanup before restart findGatewayPidsOnPortSync() returned [] immediately on Windows, causing cleanStaleGatewayProcessesSync() to skip killing old gateway processes during self-restart (triggerOpenClawRestart -> schtasks path). This led to an infinite retry loop: 'gateway already running under schtasks; waiting 5000ms before retrying startup'. Changes: - Extract Windows port/process helpers into shared windows-port-pids.ts to break the circular import between restart-stale-pids.ts and gateway-processes.ts, with configurable timeoutMs for poll compliance - findGatewayPidsOnPortSync: discover + verify Windows gateway PIDs via readWindowsListeningPidsOnPortSync + readWindowsProcessArgsSync - pollPortOnceWindows: use short POLL_SPAWN_TIMEOUT_MS (400ms) so a single slow PowerShell call cannot exceed the 2s polling budget - terminateStaleProcessesSync: add terminateStaleProcessesWindows using taskkill.exe (graceful /T first, then /F force-kill) Fixes the Windows gateway restart infinite loop caused by the schtasks supervisor detecting a port conflict it cannot resolve. * fix: tighten windows stale gateway cleanup * fix: preserve windows restart probe failures * refactor: unify windows gateway pid verification * fix: preserve windows argv probe failures * fix: windows self-restart stale gateway cleanup (openclaw#60480) (thanks @arifahmedjoy) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
…nks @arifahmedjoy) * fix: implement Windows stale gateway process cleanup before restart findGatewayPidsOnPortSync() returned [] immediately on Windows, causing cleanStaleGatewayProcessesSync() to skip killing old gateway processes during self-restart (triggerOpenClawRestart -> schtasks path). This led to an infinite retry loop: 'gateway already running under schtasks; waiting 5000ms before retrying startup'. Changes: - Extract Windows port/process helpers into shared windows-port-pids.ts to break the circular import between restart-stale-pids.ts and gateway-processes.ts, with configurable timeoutMs for poll compliance - findGatewayPidsOnPortSync: discover + verify Windows gateway PIDs via readWindowsListeningPidsOnPortSync + readWindowsProcessArgsSync - pollPortOnceWindows: use short POLL_SPAWN_TIMEOUT_MS (400ms) so a single slow PowerShell call cannot exceed the 2s polling budget - terminateStaleProcessesSync: add terminateStaleProcessesWindows using taskkill.exe (graceful /T first, then /F force-kill) Fixes the Windows gateway restart infinite loop caused by the schtasks supervisor detecting a port conflict it cannot resolve. * fix: tighten windows stale gateway cleanup * fix: preserve windows restart probe failures * refactor: unify windows gateway pid verification * fix: preserve windows argv probe failures * fix: windows self-restart stale gateway cleanup (openclaw#60480) (thanks @arifahmedjoy) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
…nks @arifahmedjoy) * fix: implement Windows stale gateway process cleanup before restart findGatewayPidsOnPortSync() returned [] immediately on Windows, causing cleanStaleGatewayProcessesSync() to skip killing old gateway processes during self-restart (triggerOpenClawRestart -> schtasks path). This led to an infinite retry loop: 'gateway already running under schtasks; waiting 5000ms before retrying startup'. Changes: - Extract Windows port/process helpers into shared windows-port-pids.ts to break the circular import between restart-stale-pids.ts and gateway-processes.ts, with configurable timeoutMs for poll compliance - findGatewayPidsOnPortSync: discover + verify Windows gateway PIDs via readWindowsListeningPidsOnPortSync + readWindowsProcessArgsSync - pollPortOnceWindows: use short POLL_SPAWN_TIMEOUT_MS (400ms) so a single slow PowerShell call cannot exceed the 2s polling budget - terminateStaleProcessesSync: add terminateStaleProcessesWindows using taskkill.exe (graceful /T first, then /F force-kill) Fixes the Windows gateway restart infinite loop caused by the schtasks supervisor detecting a port conflict it cannot resolve. * fix: tighten windows stale gateway cleanup * fix: preserve windows restart probe failures * refactor: unify windows gateway pid verification * fix: preserve windows argv probe failures * fix: windows self-restart stale gateway cleanup (openclaw#60480) (thanks @arifahmedjoy) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
Summary
Fixes #60878
findGatewayPidsOnPortSync()insrc/infra/restart-stale-pids.tsreturned[]immediately on Windows (process.platform === 'win32'), causingcleanStaleGatewayProcessesSync()to silently skip killing old gateway processes during self-restart via the schtasks supervisor path (triggerOpenClawRestart).This caused an infinite retry loop on Windows:
The new gateway instance could never bind port 18789 because the old process was never terminated.
Root Cause
The self-restart path on Windows (
triggerOpenClawRestart->relaunchGatewayScheduledTask) callscleanStaleGatewayProcessesSync()before spawning the schtasks restart script. But sincefindGatewayPidsOnPortSyncreturns[]on Windows, no stale processes are ever found or killed. The new schtasks-launched gateway then races the old one for the port and enters an unbounded retry loop.Note:
openclaw daemon restartworks correctly because it uses therestartScheduledTask()path inservice.ts, which properly callsterminateScheduledTaskGatewayListeners()with the Windows-awarefindVerifiedGatewayListenerPidsOnPortSync(). Only the in-process self-restart path was broken.Changes
New:
src/infra/windows-port-pids.tsExtracted all Windows-specific port scanning and process-args helpers from
gateway-processes.tsinto a shared module with configurabletimeoutMsparameter. This:restart-stale-pids.tsandgateway-processes.ts(both now import fromwindows-port-pids.tsinstead of from each other)POLL_SPAWN_TIMEOUT_MS(400ms) instead of the 5000ms default, keeping each poll within thewaitForPortFreeSync2s budgetsrc/infra/restart-stale-pids.tsfindGatewayPidsOnPortSync: On win32, discovers listening PIDs viareadWindowsListeningPidsOnPortSync+ verifies each withreadWindowsProcessArgsSync/isGatewayArgvpollPortOnceWindows: UsesreadWindowsListeningPidsOnPortSync(port, 400)— just checks if port has any listener, no verification needed (stale gateway already killed)terminateStaleProcessesSync: AddterminateStaleProcessesWindows()usingtaskkill.exe(graceful/Tfirst, then/Fforce-kill) instead ofSIGTERM/SIGKILLsrc/infra/gateway-processes.tswindows-port-pids.tsmodulefindVerifiedGatewayListenerPidsOnPortSyncpublic API unchangedsrc/infra/restart-stale-pids.test.tswindows-port-pids.js(port scanning + process args) for the win32 platform-mock testsreadWindowsListeningPidsOnPortSyncandreadWindowsProcessArgsSyncisGatewayArgvfor full integration coverageTesting
openclaw daemon restartandopenclaw gateway call healthwork after fixisGatewayArgvvalidation