Skip to content

fix(cli): stabilize Windows scheduled-task restart health after ready#73889

Open
openclaw-clownfish[bot] wants to merge 1 commit intomainfrom
clownfish/ghcrawl-156629-autonomous-smoke
Open

fix(cli): stabilize Windows scheduled-task restart health after ready#73889
openclaw-clownfish[bot] wants to merge 1 commit intomainfrom
clownfish/ghcrawl-156629-autonomous-smoke

Conversation

@openclaw-clownfish
Copy link
Copy Markdown
Contributor

Summary

Validation

  • pnpm check:changed

References

ProjectClownfish replacement details:

@openclaw-clownfish openclaw-clownfish Bot added the clawsweeper Tracked by ClawSweeper automation label Apr 29, 2026
@openclaw-barnacle openclaw-barnacle Bot added cli CLI command changes size: S labels Apr 29, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 29, 2026

Greptile Summary

Adds a requireReachability flag to the Windows scheduled-task restart health path, ensuring the loopback HTTP/WebSocket probe must succeed before declaring the restart healthy — even after schtasks and port ownership show ready. The change is narrow, well-tested, and consistent with the existing includeUnknownListenersAsStale Windows-only guard.

  • There is a subtle interaction worth reviewing: when requireReachability=true and the first probe throws (e.g. 3s timeout) rather than returning {ok: false}, reachability stays null, and the pre-existing best-effort second probe block (lines 390-398) re-runs unconditionally because !expectedVersion is true. If that second probe succeeds, healthy ends up true, silently bypassing the requirement. In the primary bug scenario the probe returns {ok: false} and the memoised result prevents this — but the throw path is worth guarding explicitly.

Confidence Score: 4/5

Safe to merge; the fix correctly handles the main Windows post-ready HTTP loss scenario, with one narrow edge case where a thrown probe could be bypassed by the existing best-effort retry block.

Only P2 findings are present. The core logic change is correct for the targeted bug (probe returning {ok:false}), and tests confirm the expected behavior. The probe-throw bypass is unlikely in the specific failure mode being fixed but represents a logical gap worth addressing.

src/cli/daemon-cli/restart-health.ts lines 375-398 — the interaction between the new requireReachability first block and the pre-existing best-effort second block.

Comments Outside Diff (1)

  1. src/cli/daemon-cli/restart-health.ts, line 375-398 (link)

    P2 requireReachability enforcement bypassed on probe throw

    When requireReachability=true (Windows path) and the first probe throws (e.g. a 3s timeout), healthy is set to false and reachability remains null. The second best-effort block (lines 390–398) then fires — !healthy && running && portUsage.status === "busy" && !expectedVersion — and re-executes loadReachability() from scratch. If that second probe succeeds, healthy ends up true, silently circumventing the requireReachability constraint.

    In the specific bug scenario (post-ready HTTP loss), the probe likely returns {ok: false} rather than throwing, so the memoised result from the first call would be returned unchanged. But for timeout failures the path diverges. Consider guarding the second block to skip when requireReachability is set:

    if (!healthy && running && portUsage.status === "busy" && !expectedVersion && !params.requireReachability) {
    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/cli/daemon-cli/restart-health.ts
    Line: 375-398
    
    Comment:
    **`requireReachability` enforcement bypassed on probe throw**
    
    When `requireReachability=true` (Windows path) and the first probe throws (e.g. a 3s timeout), `healthy` is set to `false` and `reachability` remains `null`. The second best-effort block (lines 390–398) then fires — `!healthy && running && portUsage.status === "busy" && !expectedVersion` — and re-executes `loadReachability()` from scratch. If that second probe succeeds, `healthy` ends up `true`, silently circumventing the `requireReachability` constraint.
    
    In the specific bug scenario (post-ready HTTP loss), the probe likely returns `{ok: false}` rather than throwing, so the memoised result from the first call would be returned unchanged. But for timeout failures the path diverges. Consider guarding the second block to skip when `requireReachability` is set:
    
    ```typescript
    if (!healthy && running && portUsage.status === "busy" && !expectedVersion && !params.requireReachability) {
    ```
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/cli/daemon-cli/restart-health.ts
Line: 375-398

Comment:
**`requireReachability` enforcement bypassed on probe throw**

When `requireReachability=true` (Windows path) and the first probe throws (e.g. a 3s timeout), `healthy` is set to `false` and `reachability` remains `null`. The second best-effort block (lines 390–398) then fires — `!healthy && running && portUsage.status === "busy" && !expectedVersion` — and re-executes `loadReachability()` from scratch. If that second probe succeeds, `healthy` ends up `true`, silently circumventing the `requireReachability` constraint.

In the specific bug scenario (post-ready HTTP loss), the probe likely returns `{ok: false}` rather than throwing, so the memoised result from the first call would be returned unchanged. But for timeout failures the path diverges. Consider guarding the second block to skip when `requireReachability` is set:

```typescript
if (!healthy && running && portUsage.status === "busy" && !expectedVersion && !params.requireReachability) {
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(cli): stabilize Windows scheduled-ta..." | Re-trigger Greptile

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 29, 2026

Codex review: needs changes before merge.

Summary
The PR adds a Windows-only requireReachability restart-health option, forwards it through gateway restart, adds regression coverage, and adds a changelog entry.

Reproducibility: yes. by source inspection rather than a live Windows run. Current main can treat running && ownsPort as healthy in the normal gateway restart path without requiring a loopback HTTP/WS probe.

Next step before merge
A narrow automated branch repair can change only the changelog marker from a closing keyword to a non-closing reference before normal PR validation.

Security
Cleared: The diff tightens local restart-health probing and updates tests/changelog without adding dependencies, CI execution, secret handling, downloads, or new external trust boundaries.

Review findings

  • [P3] Reference the canonical issue without closing it — CHANGELOG.md:34
Review details

Best possible solution:

Land the focused Windows-only reachability gate after changing the changelog to reference #63491 without closing it, leaving that issue open for the broader Windows Scheduled Task instability investigation.

Do we have a high-confidence way to reproduce the issue?

Yes, by source inspection rather than a live Windows run. Current main can treat running && ownsPort as healthy in the normal gateway restart path without requiring a loopback HTTP/WS probe.

Is this the best way to solve the issue?

Yes, with one wording correction. A Windows-only reachability flag reuses the existing local probe and preserves prior schtasks and policy-close behavior; the changelog should use a non-closing reference to #63491.

Full review comments:

Overall correctness: patch is correct
Overall confidence: 0.86

Acceptance criteria:

  • pnpm exec oxfmt --check --threads=1 CHANGELOG.md
  • pnpm test src/cli/daemon-cli/restart-health.test.ts src/cli/daemon-cli/lifecycle.test.ts
  • pnpm check:changed

What I checked:

  • current_main_restart_call_lacks_required_reachability: The scheduled-task restart path calls waitForGatewayHealthyRestart with includeUnknownListenersAsStale only; there is no option requiring loopback reachability before reporting a healthy restart. (src/cli/daemon-cli/lifecycle.ts:285, e8f13c625e34)
  • current_main_can_return_healthy_from_ownership_only: inspectGatewayRestart initializes healthy from running && ownsPort, and only forces the reachability probe when expectedVersion is present; this leaves the post-ready HTTP/WS loss path source-reproducible. (src/cli/daemon-cli/restart-health.ts:378, e8f13c625e34)
  • prior_policy_close_behavior_preserved_on_main: Current main already has the local 1008 policy-close allowlist for connect/device/pairing/auth cases, so the PR is extending restart health rather than replacing that merged behavior. (src/cli/daemon-cli/restart-health.ts:81, e8f13c625e34)
  • schtasks_last_result_fix_already_on_main: Current main accepts Last Result as an alias for Last Run Result, matching the PR body's intent to preserve the existing scheduled-task parser fix. (src/daemon/schtasks.ts:190, e8f13c625e34)
  • pr_diff_adds_targeted_reachability_requirement: The provided PR diff adds requireReachability, checks reachability when that flag is set, forwards it through the waiter, and passes it on Windows gateway restart calls. (src/cli/daemon-cli/restart-health.ts:277, 6b1ccb6d6d23)
  • pr_test_covers_post_ready_probe_loss: The provided PR diff adds a regression test where Windows schtasks owns the port but the first loopback probe is down, and the waiter keeps polling until a later successful probe. (src/cli/daemon-cli/restart-health.test.ts:245, 6b1ccb6d6d23)

Likely related people:

  • vincentkoc: Provided related PR context shows merged PR fix(cli): tighten Windows restart policy-close health checks #72660 carried forward and tightened the Windows restart policy-close path that this PR preserves and extends; local current-main blame also routes the central restart-health files through Vincent Koc in this checkout. (role: recent maintainer for Windows restart-health policy-close behavior; confidence: high; commits: c9ece5df3786, c8fa0fd1c9d5; files: src/cli/daemon-cli/restart-health.ts, src/cli/daemon-cli/restart-health.test.ts, src/cli/daemon-cli/lifecycle.ts)
  • MarsDoge: Provided related PR context shows fix(cli): accept local policy closes during gateway restart health checks #48801 introduced the local policy-close restart-health approach later carried forward by fix(cli): tighten Windows restart policy-close health checks #72660 and referenced by this PR. (role: source contributor for prior policy-close restart-health work; confidence: medium; commits: e82bc552d094; files: src/cli/daemon-cli/restart-health.ts, src/cli/daemon-cli/restart-health.test.ts)
  • Peter Steinberger: Local history shows repeated adjacent restart-health work on timeout behavior, health verification, and shared restart-health test fixtures. (role: adjacent maintainer for restart-health timeout and test helper work; confidence: medium; commits: 905e355f651d, 80f430c2be2e, 5ad27fa25f73; files: src/cli/daemon-cli/restart-health.ts, src/cli/daemon-cli/restart-health.test.ts, src/cli/daemon-cli/lifecycle.ts)

Remaining risk / open question:

  • No live Windows Scheduled Task run was executed in this read-only review; merge should still wait for the targeted tests and changed gate after the changelog wording fix.

Codex review notes: model gpt-5.5, reasoning high; reviewed against e8f13c625e34.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clawsweeper Tracked by ClawSweeper automation cli CLI command changes size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants