fix(windows): stabilize gateway restart and avoid false stale cleanup [AI-assisted]#50136
fix(windows): stabilize gateway restart and avoid false stale cleanup [AI-assisted]#50136CharlesHappy wants to merge 1 commit intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR stabilizes Key observations:
Confidence Score: 3/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/cli/daemon-cli/lifecycle.ts
Line: 249-254
Comment:
**Identity warning not added to `warnings` in non-JSON mode**
The identity warning uses a mutually exclusive `if (!json) ... else ...` pattern — it's either logged *or* pushed to `warnings`, never both. Every other warning in this function (e.g. the stale-pid warning just above) is always pushed to `warnings` unconditionally AND then also logged in non-JSON mode. As a result, any code that inspects the `warnings` array after a non-JSON restart will silently miss this caveat.
```suggestion
const identityWarning = `Gateway restart became healthy, but process identity did not clearly change from pid ${preRestartRuntimePid}; runtime may be lagging or the old process may still own the listener.`;
warnings.push(identityWarning);
if (!json) {
defaultRuntime.log(theme.warn(identityWarning));
}
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/cli/daemon-cli/restart-health.test.ts
Line: 213-235
Comment:
**`process.platform` override leaks into subsequent tests**
Both new tests in this file (`"treats a busy unknown Windows listener…"` at line 213 and `"reports probe result, listener classification, and failure reason in diagnostics"` at line 268) call `Object.defineProperty(process, "platform", { value: "win32", … })` but never restore the original value. This lets the `"win32"` override bleed into later tests — most notably `"treats auth-closed probe as healthy gateway reachability"` (line 237), which runs on the leaked `"win32"` platform and may silently change its behavior.
The newly added test in `lifecycle.test.ts` already demonstrates the correct pattern using `try / finally`:
```ts
const originalPlatform = process.platform;
Object.defineProperty(process, "platform", { value: "win32", configurable: true });
try {
// ... test body ...
} finally {
Object.defineProperty(process, "platform", { value: originalPlatform, configurable: true });
}
```
Apply the same `try / finally` guard to both new tests here (line 213 and line 268).
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: "fix(windows): stabil..." |
| const identityWarning = `Gateway restart became healthy, but process identity did not clearly change from pid ${preRestartRuntimePid}; runtime may be lagging or the old process may still own the listener.`; | ||
| if (!json) { | ||
| defaultRuntime.log(theme.warn(identityWarning)); | ||
| } else { | ||
| warnings.push(identityWarning); | ||
| } |
There was a problem hiding this comment.
Identity warning not added to
warnings in non-JSON mode
The identity warning uses a mutually exclusive if (!json) ... else ... pattern — it's either logged or pushed to warnings, never both. Every other warning in this function (e.g. the stale-pid warning just above) is always pushed to warnings unconditionally AND then also logged in non-JSON mode. As a result, any code that inspects the warnings array after a non-JSON restart will silently miss this caveat.
| const identityWarning = `Gateway restart became healthy, but process identity did not clearly change from pid ${preRestartRuntimePid}; runtime may be lagging or the old process may still own the listener.`; | |
| if (!json) { | |
| defaultRuntime.log(theme.warn(identityWarning)); | |
| } else { | |
| warnings.push(identityWarning); | |
| } | |
| const identityWarning = `Gateway restart became healthy, but process identity did not clearly change from pid ${preRestartRuntimePid}; runtime may be lagging or the old process may still own the listener.`; | |
| warnings.push(identityWarning); | |
| if (!json) { | |
| defaultRuntime.log(theme.warn(identityWarning)); | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/lifecycle.ts
Line: 249-254
Comment:
**Identity warning not added to `warnings` in non-JSON mode**
The identity warning uses a mutually exclusive `if (!json) ... else ...` pattern — it's either logged *or* pushed to `warnings`, never both. Every other warning in this function (e.g. the stale-pid warning just above) is always pushed to `warnings` unconditionally AND then also logged in non-JSON mode. As a result, any code that inspects the `warnings` array after a non-JSON restart will silently miss this caveat.
```suggestion
const identityWarning = `Gateway restart became healthy, but process identity did not clearly change from pid ${preRestartRuntimePid}; runtime may be lagging or the old process may still own the listener.`;
warnings.push(identityWarning);
if (!json) {
defaultRuntime.log(theme.warn(identityWarning));
}
```
How can I resolve this? If you propose a fix, please make it concise.| it("treats a busy unknown Windows listener as healthy when the probe succeeds", async () => { | ||
| Object.defineProperty(process, "platform", { value: "win32", configurable: true }); | ||
| classifyPortListener.mockReturnValue("unknown"); | ||
| probeGateway.mockResolvedValue({ | ||
| ok: true, | ||
| close: null, | ||
| }); | ||
|
|
||
| const snapshot = await inspectGatewayRestartWithSnapshot({ | ||
| runtime: { status: "stopped" }, | ||
| portUsage: { | ||
| port: 18789, | ||
| status: "busy", | ||
| listeners: [{ pid: 9100, commandLine: "" }], | ||
| hints: [], | ||
| }, | ||
| includeUnknownListenersAsStale: true, | ||
| }); | ||
|
|
||
| expect(snapshot.healthy).toBe(true); | ||
| expect(snapshot.staleGatewayPids).toEqual([]); | ||
| expect(snapshot.probeResult).toBe("ok"); | ||
| }); |
There was a problem hiding this comment.
process.platform override leaks into subsequent tests
Both new tests in this file ("treats a busy unknown Windows listener…" at line 213 and "reports probe result, listener classification, and failure reason in diagnostics" at line 268) call Object.defineProperty(process, "platform", { value: "win32", … }) but never restore the original value. This lets the "win32" override bleed into later tests — most notably "treats auth-closed probe as healthy gateway reachability" (line 237), which runs on the leaked "win32" platform and may silently change its behavior.
The newly added test in lifecycle.test.ts already demonstrates the correct pattern using try / finally:
const originalPlatform = process.platform;
Object.defineProperty(process, "platform", { value: "win32", configurable: true });
try {
// ... test body ...
} finally {
Object.defineProperty(process, "platform", { value: originalPlatform, configurable: true });
}Apply the same try / finally guard to both new tests here (line 213 and line 268).
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/restart-health.test.ts
Line: 213-235
Comment:
**`process.platform` override leaks into subsequent tests**
Both new tests in this file (`"treats a busy unknown Windows listener…"` at line 213 and `"reports probe result, listener classification, and failure reason in diagnostics"` at line 268) call `Object.defineProperty(process, "platform", { value: "win32", … })` but never restore the original value. This lets the `"win32"` override bleed into later tests — most notably `"treats auth-closed probe as healthy gateway reachability"` (line 237), which runs on the leaked `"win32"` platform and may silently change its behavior.
The newly added test in `lifecycle.test.ts` already demonstrates the correct pattern using `try / finally`:
```ts
const originalPlatform = process.platform;
Object.defineProperty(process, "platform", { value: "win32", configurable: true });
try {
// ... test body ...
} finally {
Object.defineProperty(process, "platform", { value: originalPlatform, configurable: true });
}
```
Apply the same `try / finally` guard to both new tests here (line 213 and line 268).
How can I resolve this? If you propose a fix, please make it concise.|
Codex review: found issues before merge. Summary Reproducibility: yes. by source inspection. On current main, win32 restart health can classify unknown listeners as stale PIDs and the restart path then terminates those returned PIDs; the loose classifier also treats any command text containing "openclaw" as a gateway listener. Next step before merge Security Review findings
Review detailsBest possible solution: Land a rebased or replacement patch that removes unknown-listener termination from Windows restart recovery, tightens listener classification with current argv helpers, preserves current restart-health/version/plugin/channel gates, keeps the exact auth-close allowlist, and adds focused regression coverage plus a changelog entry. Do we have a high-confidence way to reproduce the issue? Yes, by source inspection. On current main, win32 restart health can classify unknown listeners as stale PIDs and the restart path then terminates those returned PIDs; the loose classifier also treats any command text containing "openclaw" as a gateway listener. Is this the best way to solve the issue? No, not as-is. The PR direction is useful, but the branch must be rebased or replaced so its listener-safety changes are layered onto the current restart-health contract and exact restart-probe close allowlist. Full review comments:
Overall correctness: patch is incorrect Security concerns:
Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 103cdd9d96f8. |
Problem
On native Windows environments,
openclaw gateway restartcould be unreliable and sometimes incorrect:Root cause
This was caused by several Windows-specific issues:
Unknown listeners treated as stale
Loose gateway process classification
Lagging or unreliable runtime metadata
schtasksruntime state may be delayed or incomplete, and is not a reliable signal immediately after restart.Insufficient restart identity validation
What changed
1. Safer restart recovery (Windows)
includeUnknownListenersAsStalefrom the Windows service-managed restart path2. Stronger restart health criteria
port busy + probe successis treated as sufficient restart healthunknownor lagging3. Strict gateway process classification
Gateway identification now requires:
isGatewayArgv(...)), ORopenclaw(.exe)oropenclaw-gateway(.exe)Ambiguous processes remain
unknownand are never selected for termination4. Lightweight restart identity check
5. Improved diagnostics
Restart logs now include:
6. Tests updated and added
src/cli/daemon-cli/restart-health.test.tssrc/cli/daemon-cli/lifecycle.test.tssrc/infra/ports-format.test.tssrc/cli/daemon-cli/status.gather.test.tsUpdates include:
GatewayRestartSnapshotfields:probeResultlistenerKindsWhy this fixes Windows restart
Before:
After:
This removes false-positive cleanup paths and stabilizes restart behavior on Windows while preserving existing Linux/macOS behavior.
Validation
Automated
pnpm checkpassedpnpm test -- src/cli/daemon-cli/restart-health.test.ts \ src/cli/daemon-cli/lifecycle.test.ts \ src/infra/ports-format.test.ts \ src/cli/daemon-cli/status.gather.test.ts