fix(daemon): address clanker review findings for kickstart restart by hydro13 · Pull Request #43555 · openclaw/openclaw

hydro13 · 2026-03-12T01:06:16Z

Follows up on #43514 (closed by barnacle — branch was dirty).

What this fixes

Two findings from Peter's clanker review on the original kickstart PR:

Bug 1 (high) — Race condition in handoff helper

The kickstart and start-after-exit shell scripts used sleep 1 before issuing launchctl kickstart -k. This caused two failure modes:

Health checks in runServiceRestart ran against the old gateway process before the restart completed
The caller could be killed mid health-check when the delayed kickstart finally fired

Fix: Pass process.pid as $4 to the detached helper. The script now polls kill -0 $caller_pid in 0.1 s increments and only issues kickstart -k after the caller has exited.

Bug 2 (medium) — Overly broad bootstrap fallback

In restartLaunchAgent(), any non-zero kickstart -k result triggered enable + bootstrap, including errors unrelated to a missing plist registration. When the job was still registered, bootstrap would fail with "service already registered", turning the original error into an obscured secondary failure.

Fix: Gate the enable + bootstrap path on isLaunchctlNotLoaded(). Non-missing-job kickstart errors now re-throw immediately, preserving the original diagnostic.

Scope

Only daemon/launchd files changed — no logic outside the restart path.

Fixes #43311, #43406, #43035, #43049

Bug 1 (high): replace fixed sleep 1 with caller-PID polling in both kickstart and start-after-exit handoff modes. The helper now waits until kill -0 $caller_pid fails (process exited) before issuing launchctl kickstart -k, eliminating the race where health checks ran against the old gateway process. Bug 2 (medium): gate enable+bootstrap fallback on isLaunchctlNotLoaded(). Only attempt re-registration when kickstart -k fails because the job is absent; all other kickstart failures now re-throw the original error so debugging is not obscured by a secondary bootstrap attempt. Follows up on 3c0fd3d.

greptile-apps · 2026-03-12T01:09:45Z

Greptile Summary

This PR addresses two clanker review findings from the prior kickstart PR: it replaces the fixed sleep 1 race in the detached handoff helper with a proper PID-polling loop, and it gates the enable + bootstrap recovery path on isLaunchctlNotLoaded() so unrelated kickstart errors are surfaced directly rather than being masked by a secondary bootstrap failure. The new GatewayServiceRestartResult discriminated union cleanly propagates the "scheduled" vs "completed" distinction up through the service abstraction and into the CLI restart path.

Key changes:

launchd-restart-handoff.ts: shared waitForCallerPid snippet replaces sleep 1 (kickstart) and the old attempt-capped loop (start-after-exit); both modes now poll kill -0 $caller_pid at 0.1 s intervals
launchd.ts: isLaunchctlNotLoaded guard added before the enable + bootstrap fallback; restartLaunchAgent returns GatewayServiceRestartResult
lifecycle-core.ts: "scheduled" outcome short-circuits post-restart health checks and emits a distinct JSON result
schtasks.ts, systemd.ts: updated to return { outcome: "completed" } to satisfy the new type

Notable concern:
The original start-after-exit implementation had a deliberate 100-attempt (~10 s) cap on the PID-poll loop to prevent a runaway detached helper if the caller never exits. The refactored shared snippet removes that cap entirely. In normal operation both callers (restartGatewayProcessWithFreshPid and restartLaunchAgent) pass process.pid and exit promptly, so this is benign in practice — but the safeguard was intentional and its removal is unexplained.

Confidence Score: 4/5

Safe to merge with a low-risk regression: the removed 100-attempt cap leaves both detached helper modes with an unbounded poll loop, but both callers reliably pass a valid PID that exits promptly in practice.
The core logic fixes are correct and well-tested. The isLaunchctlNotLoaded guard cleanly addresses Bug 2, and the PID-polling approach for Bug 1 is sound. The one substantive concern is the silent removal of the timeout cap that was deliberately added to start-after-exit — it leaves an edge-case infinite loop potential in the detached shell helper. This is not a merge-blocker but is worth restoring for robustness.
src/daemon/launchd-restart-handoff.ts — the shared waitForCallerPid loop has no upper-bound on iterations; consider restoring the 100-attempt safety cap.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/daemon/launchd-restart-handoff.ts
Line: 67-69

Comment:
**Unbounded spin loop — timeout guard was intentionally removed**

The previous `start-after-exit` implementation had an explicit 100-attempt (~10 s) safety cap:

```shell
attempts=0
while kill -0 "$wait_pid" >/dev/null 2>&1; do
  attempts=$((attempts + 1))
  if [ "$attempts" -ge 100 ]; then
    break
  fi
  sleep 0.1
done
```

By sharing a single `waitForCallerPid` snippet between both modes, that cap is gone. If the caller process never exits (hung signal handler, unexpected blocking, or even PID reuse in a long-lived container), the detached helper will spin in `sleep 0.1` forever — accumulating zombie helper processes with no self-termination path.

In normal operation the caller exits quickly, so this is benign; but the original cap was a deliberate defensive measure. Consider restoring a maximum iteration guard, e.g.:

```shell
wait_pid="$4"
if [ -n "$wait_pid" ] && [ "$wait_pid" -gt 1 ] 2>/dev/null; then
  attempts=0
  while kill -0 "$wait_pid" >/dev/null 2>&1; do
    attempts=$((attempts + 1))
    if [ "$attempts" -ge 100 ]; then
      break
    fi
    sleep 0.1
  done
fi
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/daemon/launchd-restart-handoff.test.ts
Line: 18-43

Comment:
**`start-after-exit` mode has no test coverage here**

The new `launchd-restart-handoff.test.ts` only exercises the `kickstart` mode. Because the `waitForCallerPid` snippet is now shared between both modes, it would be worth adding a parallel test for `mode: "start-after-exit"` to confirm:
- `args[2]` is `"openclaw-launchd-restart-handoff"` (same sentinel)
- `args[6]` carries the `waitForPid` value
- The generated script includes `launchctl start` (not `kickstart`)
- `sleep 1` is absent

This guards against accidental divergence if the two branches are ever split again.

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: 838c528}

greptile-apps · 2026-03-12T01:09:49Z

+  while kill -0 "$wait_pid" >/dev/null 2>&1; do
+    sleep 0.1
+  done


Unbounded spin loop — timeout guard was intentionally removed

The previous start-after-exit implementation had an explicit 100-attempt (~10 s) safety cap:

attempts=0 while kill -0 "$wait_pid" >/dev/null 2>&1; do attempts=$((attempts + 1)) if [ "$attempts" -ge 100 ]; then break fi sleep 0.1 done

By sharing a single waitForCallerPid snippet between both modes, that cap is gone. If the caller process never exits (hung signal handler, unexpected blocking, or even PID reuse in a long-lived container), the detached helper will spin in sleep 0.1 forever — accumulating zombie helper processes with no self-termination path.

In normal operation the caller exits quickly, so this is benign; but the original cap was a deliberate defensive measure. Consider restoring a maximum iteration guard, e.g.:

wait_pid="$4" if [ -n "$wait_pid" ] && [ "$wait_pid" -gt 1 ] 2>/dev/null; then attempts=0 while kill -0 "$wait_pid" >/dev/null 2>&1; do attempts=$((attempts + 1)) if [ "$attempts" -ge 100 ]; then break fi sleep 0.1 done fi

Prompt To Fix With AI

This is a comment left during a code review. Path: src/daemon/launchd-restart-handoff.ts Line: 67-69 Comment: **Unbounded spin loop — timeout guard was intentionally removed** The previous `start-after-exit` implementation had an explicit 100-attempt (~10 s) safety cap: ```shell attempts=0 while kill -0 "$wait_pid" >/dev/null 2>&1; do attempts=$((attempts + 1)) if [ "$attempts" -ge 100 ]; then break fi sleep 0.1 done ``` By sharing a single `waitForCallerPid` snippet between both modes, that cap is gone. If the caller process never exits (hung signal handler, unexpected blocking, or even PID reuse in a long-lived container), the detached helper will spin in `sleep 0.1` forever — accumulating zombie helper processes with no self-termination path. In normal operation the caller exits quickly, so this is benign; but the original cap was a deliberate defensive measure. Consider restoring a maximum iteration guard, e.g.: ```shell wait_pid="$4" if [ -n "$wait_pid" ] && [ "$wait_pid" -gt 1 ] 2>/dev/null; then attempts=0 while kill -0 "$wait_pid" >/dev/null 2>&1; do attempts=$((attempts + 1)) if [ "$attempts" -ge 100 ]; then break fi sleep 0.1 done fi ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-12T01:09:50Z

+describe("scheduleDetachedLaunchdRestartHandoff", () => {
+  it("waits for the caller pid before kickstarting launchd", () => {
+    const env = {
+      HOME: "/Users/test",
+      OPENCLAW_PROFILE: "default",
+    };
+    spawnMock.mockReturnValue({ pid: 4242, unref: unrefMock });
+
+    const result = scheduleDetachedLaunchdRestartHandoff({
+      env,
+      mode: "kickstart",
+      waitForPid: 9876,
+    });
+
+    expect(result).toEqual({ ok: true, pid: 4242 });
+    expect(spawnMock).toHaveBeenCalledTimes(1);
+    const [, args] = spawnMock.mock.calls[0] as [string, string[]];
+    expect(args[0]).toBe("-c");
+    expect(args[2]).toBe("openclaw-launchd-restart-handoff");
+    expect(args[6]).toBe("9876");
+    expect(args[1]).toContain('while kill -0 "$wait_pid" >/dev/null 2>&1; do');
+    expect(args[1]).toContain('launchctl kickstart -k "$service_target" >/dev/null 2>&1');
+    expect(args[1]).not.toContain("sleep 1");
+    expect(unrefMock).toHaveBeenCalledTimes(1);
+  });
+});


start-after-exit mode has no test coverage here

The new launchd-restart-handoff.test.ts only exercises the kickstart mode. Because the waitForCallerPid snippet is now shared between both modes, it would be worth adding a parallel test for mode: "start-after-exit" to confirm:

args[2] is "openclaw-launchd-restart-handoff" (same sentinel)

args[6] carries the waitForPid value

The generated script includes launchctl start (not kickstart)

sleep 1 is absent

This guards against accidental divergence if the two branches are ever split again.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/daemon/launchd-restart-handoff.test.ts Line: 18-43 Comment: **`start-after-exit` mode has no test coverage here** The new `launchd-restart-handoff.test.ts` only exercises the `kickstart` mode. Because the `waitForCallerPid` snippet is now shared between both modes, it would be worth adding a parallel test for `mode: "start-after-exit"` to confirm: - `args[2]` is `"openclaw-launchd-restart-handoff"` (same sentinel) - `args[6]` carries the `waitForPid` value - The generated script includes `launchctl start` (not `kickstart`) - `sleep 1` is absent This guards against accidental divergence if the two branches are ever split again. How can I resolve this? If you propose a fix, please make it concise.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 838c528898

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-12T01:29:15Z

+  while kill -0 "$wait_pid" >/dev/null 2>&1; do
+    sleep 0.1
+  done


Bound caller-PID wait loop to avoid restart deadlock

The detached handoff script now waits forever in while kill -0 "$wait_pid" with no timeout. If the caller exits before this loop begins and that PID is quickly reused by an unrelated long-lived process, kill -0 remains true and the script never reaches launchctl kickstart/start, so the restart can be skipped indefinitely. The prior implementation had a bounded wait, so this is a regression in restart reliability under PID reuse or non-exiting caller scenarios.

Useful? React with 👍 / 👎.

openclaw-barnacle Bot added gateway Gateway runtime cli CLI command changes size: S maintainer Maintainer-authored PR labels Mar 12, 2026

greptile-apps Bot reviewed Mar 12, 2026

View reviewed changes

hydro13 merged commit 841ee24 into openclaw:main Mar 12, 2026
32 of 33 checks passed

chatgpt-codex-connector Bot reviewed Mar 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(daemon): address clanker review findings for kickstart restart#43555

fix(daemon): address clanker review findings for kickstart restart#43555
hydro13 merged 1 commit intoopenclaw:mainfrom
hydro13:fix/launchd-bug-fixes

hydro13 commented Mar 12, 2026

Uh oh!

greptile-apps Bot commented Mar 12, 2026

Uh oh!

greptile-apps Bot Mar 12, 2026

Uh oh!

greptile-apps Bot Mar 12, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hydro13 commented Mar 12, 2026

What this fixes

Bug 1 (high) — Race condition in handoff helper

Bug 2 (medium) — Overly broad bootstrap fallback

Scope

Uh oh!

greptile-apps Bot commented Mar 12, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

greptile-apps Bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant