Skip to content

fix(daemon): address clanker review findings for kickstart restart#43555

Merged
hydro13 merged 1 commit intoopenclaw:mainfrom
hydro13:fix/launchd-bug-fixes
Mar 12, 2026
Merged

fix(daemon): address clanker review findings for kickstart restart#43555
hydro13 merged 1 commit intoopenclaw:mainfrom
hydro13:fix/launchd-bug-fixes

Conversation

@hydro13
Copy link
Copy Markdown
Member

@hydro13 hydro13 commented Mar 12, 2026

Follows up on #43514 (closed by barnacle — branch was dirty).

What this fixes

Two findings from Peter's clanker review on the original kickstart PR:

Bug 1 (high) — Race condition in handoff helper

The kickstart and start-after-exit shell scripts used sleep 1 before issuing launchctl kickstart -k. This caused two failure modes:

  • Health checks in runServiceRestart ran against the old gateway process before the restart completed
  • The caller could be killed mid health-check when the delayed kickstart finally fired

Fix: Pass process.pid as $4 to the detached helper. The script now polls kill -0 $caller_pid in 0.1 s increments and only issues kickstart -k after the caller has exited.

Bug 2 (medium) — Overly broad bootstrap fallback

In restartLaunchAgent(), any non-zero kickstart -k result triggered enable + bootstrap, including errors unrelated to a missing plist registration. When the job was still registered, bootstrap would fail with "service already registered", turning the original error into an obscured secondary failure.

Fix: Gate the enable + bootstrap path on isLaunchctlNotLoaded(). Non-missing-job kickstart errors now re-throw immediately, preserving the original diagnostic.

Scope

Only daemon/launchd files changed — no logic outside the restart path.

Fixes #43311, #43406, #43035, #43049

Bug 1 (high): replace fixed sleep 1 with caller-PID polling in both
kickstart and start-after-exit handoff modes. The helper now waits until
kill -0 $caller_pid fails (process exited) before issuing launchctl
kickstart -k, eliminating the race where health checks ran against the
old gateway process.

Bug 2 (medium): gate enable+bootstrap fallback on isLaunchctlNotLoaded().
Only attempt re-registration when kickstart -k fails because the job is
absent; all other kickstart failures now re-throw the original error so
debugging is not obscured by a secondary bootstrap attempt.

Follows up on 3c0fd3d.
@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime cli CLI command changes size: S maintainer Maintainer-authored PR labels Mar 12, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 12, 2026

Greptile Summary

This PR addresses two clanker review findings from the prior kickstart PR: it replaces the fixed sleep 1 race in the detached handoff helper with a proper PID-polling loop, and it gates the enable + bootstrap recovery path on isLaunchctlNotLoaded() so unrelated kickstart errors are surfaced directly rather than being masked by a secondary bootstrap failure. The new GatewayServiceRestartResult discriminated union cleanly propagates the "scheduled" vs "completed" distinction up through the service abstraction and into the CLI restart path.

Key changes:

  • launchd-restart-handoff.ts: shared waitForCallerPid snippet replaces sleep 1 (kickstart) and the old attempt-capped loop (start-after-exit); both modes now poll kill -0 $caller_pid at 0.1 s intervals
  • launchd.ts: isLaunchctlNotLoaded guard added before the enable + bootstrap fallback; restartLaunchAgent returns GatewayServiceRestartResult
  • lifecycle-core.ts: "scheduled" outcome short-circuits post-restart health checks and emits a distinct JSON result
  • schtasks.ts, systemd.ts: updated to return { outcome: "completed" } to satisfy the new type

Notable concern:
The original start-after-exit implementation had a deliberate 100-attempt (~10 s) cap on the PID-poll loop to prevent a runaway detached helper if the caller never exits. The refactored shared snippet removes that cap entirely. In normal operation both callers (restartGatewayProcessWithFreshPid and restartLaunchAgent) pass process.pid and exit promptly, so this is benign in practice — but the safeguard was intentional and its removal is unexplained.

Confidence Score: 4/5

  • Safe to merge with a low-risk regression: the removed 100-attempt cap leaves both detached helper modes with an unbounded poll loop, but both callers reliably pass a valid PID that exits promptly in practice.
  • The core logic fixes are correct and well-tested. The isLaunchctlNotLoaded guard cleanly addresses Bug 2, and the PID-polling approach for Bug 1 is sound. The one substantive concern is the silent removal of the timeout cap that was deliberately added to start-after-exit — it leaves an edge-case infinite loop potential in the detached shell helper. This is not a merge-blocker but is worth restoring for robustness.
  • src/daemon/launchd-restart-handoff.ts — the shared waitForCallerPid loop has no upper-bound on iterations; consider restoring the 100-attempt safety cap.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/daemon/launchd-restart-handoff.ts
Line: 67-69

Comment:
**Unbounded spin loop — timeout guard was intentionally removed**

The previous `start-after-exit` implementation had an explicit 100-attempt (~10 s) safety cap:

```shell
attempts=0
while kill -0 "$wait_pid" >/dev/null 2>&1; do
  attempts=$((attempts + 1))
  if [ "$attempts" -ge 100 ]; then
    break
  fi
  sleep 0.1
done
```

By sharing a single `waitForCallerPid` snippet between both modes, that cap is gone. If the caller process never exits (hung signal handler, unexpected blocking, or even PID reuse in a long-lived container), the detached helper will spin in `sleep 0.1` forever — accumulating zombie helper processes with no self-termination path.

In normal operation the caller exits quickly, so this is benign; but the original cap was a deliberate defensive measure. Consider restoring a maximum iteration guard, e.g.:

```shell
wait_pid="$4"
if [ -n "$wait_pid" ] && [ "$wait_pid" -gt 1 ] 2>/dev/null; then
  attempts=0
  while kill -0 "$wait_pid" >/dev/null 2>&1; do
    attempts=$((attempts + 1))
    if [ "$attempts" -ge 100 ]; then
      break
    fi
    sleep 0.1
  done
fi
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/daemon/launchd-restart-handoff.test.ts
Line: 18-43

Comment:
**`start-after-exit` mode has no test coverage here**

The new `launchd-restart-handoff.test.ts` only exercises the `kickstart` mode. Because the `waitForCallerPid` snippet is now shared between both modes, it would be worth adding a parallel test for `mode: "start-after-exit"` to confirm:
- `args[2]` is `"openclaw-launchd-restart-handoff"` (same sentinel)
- `args[6]` carries the `waitForPid` value
- The generated script includes `launchctl start` (not `kickstart`)
- `sleep 1` is absent

This guards against accidental divergence if the two branches are ever split again.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 838c528

Comment on lines +67 to +69
while kill -0 "$wait_pid" >/dev/null 2>&1; do
sleep 0.1
done
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded spin loop — timeout guard was intentionally removed

The previous start-after-exit implementation had an explicit 100-attempt (~10 s) safety cap:

attempts=0
while kill -0 "$wait_pid" >/dev/null 2>&1; do
  attempts=$((attempts + 1))
  if [ "$attempts" -ge 100 ]; then
    break
  fi
  sleep 0.1
done

By sharing a single waitForCallerPid snippet between both modes, that cap is gone. If the caller process never exits (hung signal handler, unexpected blocking, or even PID reuse in a long-lived container), the detached helper will spin in sleep 0.1 forever — accumulating zombie helper processes with no self-termination path.

In normal operation the caller exits quickly, so this is benign; but the original cap was a deliberate defensive measure. Consider restoring a maximum iteration guard, e.g.:

wait_pid="$4"
if [ -n "$wait_pid" ] && [ "$wait_pid" -gt 1 ] 2>/dev/null; then
  attempts=0
  while kill -0 "$wait_pid" >/dev/null 2>&1; do
    attempts=$((attempts + 1))
    if [ "$attempts" -ge 100 ]; then
      break
    fi
    sleep 0.1
  done
fi
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/daemon/launchd-restart-handoff.ts
Line: 67-69

Comment:
**Unbounded spin loop — timeout guard was intentionally removed**

The previous `start-after-exit` implementation had an explicit 100-attempt (~10 s) safety cap:

```shell
attempts=0
while kill -0 "$wait_pid" >/dev/null 2>&1; do
  attempts=$((attempts + 1))
  if [ "$attempts" -ge 100 ]; then
    break
  fi
  sleep 0.1
done
```

By sharing a single `waitForCallerPid` snippet between both modes, that cap is gone. If the caller process never exits (hung signal handler, unexpected blocking, or even PID reuse in a long-lived container), the detached helper will spin in `sleep 0.1` forever — accumulating zombie helper processes with no self-termination path.

In normal operation the caller exits quickly, so this is benign; but the original cap was a deliberate defensive measure. Consider restoring a maximum iteration guard, e.g.:

```shell
wait_pid="$4"
if [ -n "$wait_pid" ] && [ "$wait_pid" -gt 1 ] 2>/dev/null; then
  attempts=0
  while kill -0 "$wait_pid" >/dev/null 2>&1; do
    attempts=$((attempts + 1))
    if [ "$attempts" -ge 100 ]; then
      break
    fi
    sleep 0.1
  done
fi
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +18 to +43
describe("scheduleDetachedLaunchdRestartHandoff", () => {
it("waits for the caller pid before kickstarting launchd", () => {
const env = {
HOME: "/Users/test",
OPENCLAW_PROFILE: "default",
};
spawnMock.mockReturnValue({ pid: 4242, unref: unrefMock });

const result = scheduleDetachedLaunchdRestartHandoff({
env,
mode: "kickstart",
waitForPid: 9876,
});

expect(result).toEqual({ ok: true, pid: 4242 });
expect(spawnMock).toHaveBeenCalledTimes(1);
const [, args] = spawnMock.mock.calls[0] as [string, string[]];
expect(args[0]).toBe("-c");
expect(args[2]).toBe("openclaw-launchd-restart-handoff");
expect(args[6]).toBe("9876");
expect(args[1]).toContain('while kill -0 "$wait_pid" >/dev/null 2>&1; do');
expect(args[1]).toContain('launchctl kickstart -k "$service_target" >/dev/null 2>&1');
expect(args[1]).not.toContain("sleep 1");
expect(unrefMock).toHaveBeenCalledTimes(1);
});
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start-after-exit mode has no test coverage here

The new launchd-restart-handoff.test.ts only exercises the kickstart mode. Because the waitForCallerPid snippet is now shared between both modes, it would be worth adding a parallel test for mode: "start-after-exit" to confirm:

  • args[2] is "openclaw-launchd-restart-handoff" (same sentinel)
  • args[6] carries the waitForPid value
  • The generated script includes launchctl start (not kickstart)
  • sleep 1 is absent

This guards against accidental divergence if the two branches are ever split again.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/daemon/launchd-restart-handoff.test.ts
Line: 18-43

Comment:
**`start-after-exit` mode has no test coverage here**

The new `launchd-restart-handoff.test.ts` only exercises the `kickstart` mode. Because the `waitForCallerPid` snippet is now shared between both modes, it would be worth adding a parallel test for `mode: "start-after-exit"` to confirm:
- `args[2]` is `"openclaw-launchd-restart-handoff"` (same sentinel)
- `args[6]` carries the `waitForPid` value
- The generated script includes `launchctl start` (not `kickstart`)
- `sleep 1` is absent

This guards against accidental divergence if the two branches are ever split again.

How can I resolve this? If you propose a fix, please make it concise.

@hydro13 hydro13 merged commit 841ee24 into openclaw:main Mar 12, 2026
32 of 33 checks passed
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 838c528898

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +67 to +69
while kill -0 "$wait_pid" >/dev/null 2>&1; do
sleep 0.1
done
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Bound caller-PID wait loop to avoid restart deadlock

The detached handoff script now waits forever in while kill -0 "$wait_pid" with no timeout. If the caller exits before this loop begins and that PID is quickly reused by an unrelated long-lived process, kill -0 remains true and the script never reaches launchctl kickstart/start, so the restart can be skipped indefinitely. The prior implementation had a bounded wait, so this is a regression in restart reliability under PID reuse or non-exiting caller scenarios.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI command changes gateway Gateway runtime maintainer Maintainer-authored PR size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gateway restart from agent session causes self-decapitation on macOS (launchd)

1 participant