Skip to content

fix: implement Windows stale gateway process cleanup before restart#60480

Merged
obviyus merged 6 commits intoopenclaw:mainfrom
arifahmedjoy:fix/windows-stale-gateway-cleanup
Apr 5, 2026
Merged

fix: implement Windows stale gateway process cleanup before restart#60480
obviyus merged 6 commits intoopenclaw:mainfrom
arifahmedjoy:fix/windows-stale-gateway-cleanup

Conversation

@arifahmedjoy
Copy link
Copy Markdown
Contributor

@arifahmedjoy arifahmedjoy commented Apr 3, 2026

Summary

Fixes #60878

findGatewayPidsOnPortSync() in src/infra/restart-stale-pids.ts returned [] immediately on Windows (process.platform === 'win32'), causing cleanStaleGatewayProcessesSync() to silently skip killing old gateway processes during self-restart via the schtasks supervisor path (triggerOpenClawRestart).

This caused an infinite retry loop on Windows:

[gateway] already running under schtasks; waiting 5000ms before retrying startup

The new gateway instance could never bind port 18789 because the old process was never terminated.

Root Cause

The self-restart path on Windows (triggerOpenClawRestart -> relaunchGatewayScheduledTask) calls cleanStaleGatewayProcessesSync() before spawning the schtasks restart script. But since findGatewayPidsOnPortSync returns [] on Windows, no stale processes are ever found or killed. The new schtasks-launched gateway then races the old one for the port and enters an unbounded retry loop.

Note: openclaw daemon restart works correctly because it uses the restartScheduledTask() path in service.ts, which properly calls terminateScheduledTaskGatewayListeners() with the Windows-aware findVerifiedGatewayListenerPidsOnPortSync(). Only the in-process self-restart path was broken.

Changes

New: src/infra/windows-port-pids.ts

Extracted all Windows-specific port scanning and process-args helpers from gateway-processes.ts into a shared module with configurable timeoutMs parameter. This:

  • Breaks the circular import between restart-stale-pids.ts and gateway-processes.ts (both now import from windows-port-pids.ts instead of from each other)
  • Fixes poll budget overshoot: Windows poll calls use POLL_SPAWN_TIMEOUT_MS (400ms) instead of the 5000ms default, keeping each poll within the waitForPortFreeSync 2s budget

src/infra/restart-stale-pids.ts

  • findGatewayPidsOnPortSync: On win32, discovers listening PIDs via readWindowsListeningPidsOnPortSync + verifies each with readWindowsProcessArgsSync / isGatewayArgv
  • pollPortOnceWindows: Uses readWindowsListeningPidsOnPortSync(port, 400) — just checks if port has any listener, no verification needed (stale gateway already killed)
  • terminateStaleProcessesSync: Add terminateStaleProcessesWindows() using taskkill.exe (graceful /T first, then /F force-kill) instead of SIGTERM/SIGKILL

src/infra/gateway-processes.ts

  • Delegates Windows helpers to the new windows-port-pids.ts module
  • Removes ~100 lines of inlined Windows functions
  • Keeps findVerifiedGatewayListenerPidsOnPortSync public API unchanged

src/infra/restart-stale-pids.test.ts

  • Mocks windows-port-pids.js (port scanning + process args) for the win32 platform-mock tests
  • Updated win32 tests verify delegation to readWindowsListeningPidsOnPortSync and readWindowsProcessArgsSync
  • Tests use real isGatewayArgv for full integration coverage

Testing

  • Lightly tested: verified fix resolves the infinite restart loop on Windows 11
  • Confirmed openclaw daemon restart and openclaw gateway call health work after fix
  • Existing Unix tests unaffected — all 38 restart-stale-pids tests pass
  • All 6 gateway-processes tests pass after refactor
  • Updated test mocks verify win32 delegation with real isGatewayArgv validation

Copilot AI review requested due to automatic review settings April 3, 2026 18:45
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 3, 2026

Greptile Summary

This PR fixes a real Windows-only bug where findGatewayPidsOnPortSync returned [] on win32, causing cleanStaleGatewayProcessesSync to skip killing the old gateway before schtasks relaunched a new one, producing an unbounded port-conflict loop. The approach — delegating to findVerifiedGatewayListenerPidsOnPortSync and adding taskkill-based termination — is directionally correct, but the implementation has two structural issues that should be addressed before merging:

  • Circular import: restart-stale-pids.ts now imports from gateway-processes.ts, which already imports from restart-stale-pids.ts, creating a mutual cycle that can trip bundlers and violates the codebase's module-boundary conventions.
  • Poll budget overshoot: pollPortOnceWindows delegates to the same PowerShell helper, whose internal timeout is 5 000 ms — 2.5× the waitForPortFreeSync 2 000 ms budget — so a single slow PowerShell invocation blocks the entire polling loop well past the intended ceiling.

Confidence Score: 3/5

Not safe to merge as-is; two structural issues — a circular module dependency and a poll-budget mismatch — should be resolved first.

Both findings are P1: the circular import between restart-stale-pids.ts and gateway-processes.ts is a real module-graph defect that bundlers can mishandle, and the 5 000 ms PowerShell timeout exceeding the 2 000 ms waitForPortFreeSync budget means the port-free polling loop can stall for much longer than intended on Windows. The core fix for the infinite restart loop is valid, but these two issues need to be addressed before landing.

src/infra/restart-stale-pids.ts (circular import with gateway-processes.ts; poll timeout exceeds budget)

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/infra/restart-stale-pids.ts
Line: 5

Comment:
**Circular import between `restart-stale-pids.ts` and `gateway-processes.ts`**

This new import creates a mutual circular dependency: `restart-stale-pids.ts``gateway-processes.ts``restart-stale-pids.ts` (line 5 of `gateway-processes.ts` already imports `findGatewayPidsOnPortSync` from here). While ESM live bindings mean this won't throw a "cannot access before initialization" error at runtime (both exports are `function` declarations), it is a structural issue that bundlers (esbuild, rollup) can mishandle and that the codebase's dynamic-import guardrail policy is designed to avoid.

The cleaner fix is to lift the Windows-specific `readWindowsListeningPidsOnPortSync` (or the verified variant) out of `gateway-processes.ts` into a shared `windows-port-pids.ts` helper that neither file needs to cross-import.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/infra/restart-stale-pids.ts
Line: 194-201

Comment:
**Windows poll can block for 5 s, far exceeding the 2 s budget in `waitForPortFreeSync`**

`findVerifiedGatewayListenerPidsOnPortSync` runs PowerShell with `WINDOWS_GATEWAY_DISCOVERY_TIMEOUT_MS = 5 000 ms`. The `waitForPortFreeSync` wall-clock budget is `PORT_FREE_TIMEOUT_MS = 2 000 ms`. A single hung PowerShell spawn will block the entire polling loop for 5 s — 2.5× the intended ceiling — before `waitForPortFreeSync` even gets to re-check the deadline. On Unix, `POLL_SPAWN_TIMEOUT_MS = 400 ms` keeps each poll well under the budget.

The Windows helper should respect the same short per-call timeout, or `pollPortOnceWindows` should accept a timeout and pass it through to the underlying PowerShell / netstat calls.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix: implement Windows stale gateway pro..." | Re-trigger Greptile

Comment thread src/infra/restart-stale-pids.ts Outdated
Comment thread src/infra/restart-stale-pids.ts
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Windows self-restart getting stuck in an infinite “already running under schtasks” retry loop by ensuring stale gateway listeners are discovered and terminated before attempting to relaunch.

Changes:

  • On Windows, findGatewayPidsOnPortSync() now uses the existing verified Windows listener discovery instead of returning [].
  • Adds a Windows-specific port polling path for waitForPortFreeSync() using the same listener discovery.
  • Adds a Windows-specific stale-process terminator using taskkill (tree kill first, then /F escalation).

Comment thread src/infra/restart-stale-pids.ts
Comment thread src/infra/restart-stale-pids.ts Outdated
Comment thread src/infra/restart-stale-pids.ts
Comment thread src/infra/restart-stale-pids.ts
@arifahmedjoy arifahmedjoy force-pushed the fix/windows-stale-gateway-cleanup branch 2 times, most recently from 9cca773 to d492f2c Compare April 3, 2026 19:25
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d492f2c6d2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/infra/restart-stale-pids.ts Outdated
Comment thread src/infra/restart-stale-pids.ts
@arifahmedjoy arifahmedjoy force-pushed the fix/windows-stale-gateway-cleanup branch from d492f2c to a8e2bea Compare April 3, 2026 19:32
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a8e2bead83

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/infra/restart-stale-pids.ts Outdated
Comment thread src/infra/restart-stale-pids.ts
@arifahmedjoy
Copy link
Copy Markdown
Contributor Author

CI Failure Analysis

All current CI failures are pre-existing on main and unrelated to this PR (which only modifies src/infra/restart-stale-pids.ts and src/infra/restart-stale-pids.test.ts):

Job Error Related?
check extensions/slack/src/outbound-adapter.ts:66Type 'OpenClawConfig | undefined' is not assignable to type 'OpenClawConfig'
checks-windows-node-test src/cli/update-cli.test.ts:817expected undefined to be defined
checks-node-test src/process/exec.test.ts:92,107 — test timeouts (15s)
checks-fast-contracts-protocol src/plugins/contracts/plugin-sdk-subpaths.test.ts:130 — missing command-auth exports
checks-fast-extensions-shard-3 extensions/feishu/src/tool-account-routing.test.ts:97 — assertion mismatch
checks-fast-extensions-shard-6 / check-additional lint:plugins:no-extension-test-core-imports guard failure

Happy to rebase onto main once these are resolved upstream.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d4666f364c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/infra/restart-stale-pids.ts Outdated
@arifahmedjoy arifahmedjoy force-pushed the fix/windows-stale-gateway-cleanup branch from d4666f3 to f97dd5c Compare April 3, 2026 19:53
@arifahmedjoy
Copy link
Copy Markdown
Contributor Author

Update: Greptile P1 findings resolved

Both structural issues flagged by Greptile have been addressed in the latest push (f97dd5c):

1. Circular import — fixed

Extracted all Windows-specific port scanning and process-args helpers from gateway-processes.ts into a new shared src/infra/windows-port-pids.ts module. Both restart-stale-pids.ts and gateway-processes.ts now import from windows-port-pids.ts — no circular dependency.

2. Poll budget overshoot — fixed

pollPortOnceWindows now passes POLL_SPAWN_TIMEOUT_MS (400ms) to readWindowsListeningPidsOnPortSync instead of using the 5000ms default. Each poll stays within the waitForPortFreeSync 2s budget, matching the Unix lsof behavior. The poll also skips per-PID command-line verification (unnecessary since stale gateway is already killed in the prior step).

CI status

All PR-related checks are now passing (including check, check-additional, and all extension shards). The 3 remaining failures (checks-fast-contracts-protocol, checks-node-test, checks-windows-node-test) are pre-existing on main and unrelated to this change.


@steipete @vincentkoc — this is ready for review when you have a chance. Happy to address any further feedback.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 30467bbe0c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/infra/windows-port-pids.ts
Comment thread src/infra/restart-stale-pids.ts Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5cf46b7588

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/infra/restart-stale-pids.ts
@obviyus obviyus self-assigned this Apr 5, 2026
arifahmedjoy and others added 6 commits April 5, 2026 20:59
findGatewayPidsOnPortSync() returned [] immediately on Windows, causing
cleanStaleGatewayProcessesSync() to skip killing old gateway processes
during self-restart (triggerOpenClawRestart -> schtasks path). This led
to an infinite retry loop: 'gateway already running under schtasks;
waiting 5000ms before retrying startup'.

Changes:
- Extract Windows port/process helpers into shared windows-port-pids.ts
  to break the circular import between restart-stale-pids.ts and
  gateway-processes.ts, with configurable timeoutMs for poll compliance
- findGatewayPidsOnPortSync: discover + verify Windows gateway PIDs via
  readWindowsListeningPidsOnPortSync + readWindowsProcessArgsSync
- pollPortOnceWindows: use short POLL_SPAWN_TIMEOUT_MS (400ms) so a
  single slow PowerShell call cannot exceed the 2s polling budget
- terminateStaleProcessesSync: add terminateStaleProcessesWindows using
  taskkill.exe (graceful /T first, then /F force-kill)

Fixes the Windows gateway restart infinite loop caused by the schtasks
supervisor detecting a port conflict it cannot resolve.
@obviyus obviyus force-pushed the fix/windows-stale-gateway-cleanup branch from 535c323 to 9751f62 Compare April 5, 2026 15:30
Copy link
Copy Markdown
Contributor

@obviyus obviyus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed latest changes; landing now.

@obviyus obviyus merged commit 63fcc52 into openclaw:main Apr 5, 2026
17 of 31 checks passed
@obviyus
Copy link
Copy Markdown
Contributor

obviyus commented Apr 5, 2026

Landed on main.

Thanks @arifahmedjoy.

@arifahmedjoy
Copy link
Copy Markdown
Contributor Author

You're welcome, @obviyus.

lovewanwan pushed a commit to lovewanwan/openclaw that referenced this pull request Apr 28, 2026
…nks @arifahmedjoy)

* fix: implement Windows stale gateway process cleanup before restart

findGatewayPidsOnPortSync() returned [] immediately on Windows, causing
cleanStaleGatewayProcessesSync() to skip killing old gateway processes
during self-restart (triggerOpenClawRestart -> schtasks path). This led
to an infinite retry loop: 'gateway already running under schtasks;
waiting 5000ms before retrying startup'.

Changes:
- Extract Windows port/process helpers into shared windows-port-pids.ts
  to break the circular import between restart-stale-pids.ts and
  gateway-processes.ts, with configurable timeoutMs for poll compliance
- findGatewayPidsOnPortSync: discover + verify Windows gateway PIDs via
  readWindowsListeningPidsOnPortSync + readWindowsProcessArgsSync
- pollPortOnceWindows: use short POLL_SPAWN_TIMEOUT_MS (400ms) so a
  single slow PowerShell call cannot exceed the 2s polling budget
- terminateStaleProcessesSync: add terminateStaleProcessesWindows using
  taskkill.exe (graceful /T first, then /F force-kill)

Fixes the Windows gateway restart infinite loop caused by the schtasks
supervisor detecting a port conflict it cannot resolve.

* fix: tighten windows stale gateway cleanup

* fix: preserve windows restart probe failures

* refactor: unify windows gateway pid verification

* fix: preserve windows argv probe failures

* fix: windows self-restart stale gateway cleanup (openclaw#60480) (thanks @arifahmedjoy)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026
…nks @arifahmedjoy)

* fix: implement Windows stale gateway process cleanup before restart

findGatewayPidsOnPortSync() returned [] immediately on Windows, causing
cleanStaleGatewayProcessesSync() to skip killing old gateway processes
during self-restart (triggerOpenClawRestart -> schtasks path). This led
to an infinite retry loop: 'gateway already running under schtasks;
waiting 5000ms before retrying startup'.

Changes:
- Extract Windows port/process helpers into shared windows-port-pids.ts
  to break the circular import between restart-stale-pids.ts and
  gateway-processes.ts, with configurable timeoutMs for poll compliance
- findGatewayPidsOnPortSync: discover + verify Windows gateway PIDs via
  readWindowsListeningPidsOnPortSync + readWindowsProcessArgsSync
- pollPortOnceWindows: use short POLL_SPAWN_TIMEOUT_MS (400ms) so a
  single slow PowerShell call cannot exceed the 2s polling budget
- terminateStaleProcessesSync: add terminateStaleProcessesWindows using
  taskkill.exe (graceful /T first, then /F force-kill)

Fixes the Windows gateway restart infinite loop caused by the schtasks
supervisor detecting a port conflict it cannot resolve.

* fix: tighten windows stale gateway cleanup

* fix: preserve windows restart probe failures

* refactor: unify windows gateway pid verification

* fix: preserve windows argv probe failures

* fix: windows self-restart stale gateway cleanup (openclaw#60480) (thanks @arifahmedjoy)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
…nks @arifahmedjoy)

* fix: implement Windows stale gateway process cleanup before restart

findGatewayPidsOnPortSync() returned [] immediately on Windows, causing
cleanStaleGatewayProcessesSync() to skip killing old gateway processes
during self-restart (triggerOpenClawRestart -> schtasks path). This led
to an infinite retry loop: 'gateway already running under schtasks;
waiting 5000ms before retrying startup'.

Changes:
- Extract Windows port/process helpers into shared windows-port-pids.ts
  to break the circular import between restart-stale-pids.ts and
  gateway-processes.ts, with configurable timeoutMs for poll compliance
- findGatewayPidsOnPortSync: discover + verify Windows gateway PIDs via
  readWindowsListeningPidsOnPortSync + readWindowsProcessArgsSync
- pollPortOnceWindows: use short POLL_SPAWN_TIMEOUT_MS (400ms) so a
  single slow PowerShell call cannot exceed the 2s polling budget
- terminateStaleProcessesSync: add terminateStaleProcessesWindows using
  taskkill.exe (graceful /T first, then /F force-kill)

Fixes the Windows gateway restart infinite loop caused by the schtasks
supervisor detecting a port conflict it cannot resolve.

* fix: tighten windows stale gateway cleanup

* fix: preserve windows restart probe failures

* refactor: unify windows gateway pid verification

* fix: preserve windows argv probe failures

* fix: windows self-restart stale gateway cleanup (openclaw#60480) (thanks @arifahmedjoy)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Windows gateway self-restart enters infinite retry loop — stale process never killed

3 participants