fix(daemon): address clanker review findings for kickstart restart#43555
fix(daemon): address clanker review findings for kickstart restart#43555hydro13 merged 1 commit intoopenclaw:mainfrom
Conversation
Bug 1 (high): replace fixed sleep 1 with caller-PID polling in both kickstart and start-after-exit handoff modes. The helper now waits until kill -0 $caller_pid fails (process exited) before issuing launchctl kickstart -k, eliminating the race where health checks ran against the old gateway process. Bug 2 (medium): gate enable+bootstrap fallback on isLaunchctlNotLoaded(). Only attempt re-registration when kickstart -k fails because the job is absent; all other kickstart failures now re-throw the original error so debugging is not obscured by a secondary bootstrap attempt. Follows up on 3c0fd3d.
Greptile SummaryThis PR addresses two clanker review findings from the prior kickstart PR: it replaces the fixed Key changes:
Notable concern: Confidence Score: 4/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/daemon/launchd-restart-handoff.ts
Line: 67-69
Comment:
**Unbounded spin loop — timeout guard was intentionally removed**
The previous `start-after-exit` implementation had an explicit 100-attempt (~10 s) safety cap:
```shell
attempts=0
while kill -0 "$wait_pid" >/dev/null 2>&1; do
attempts=$((attempts + 1))
if [ "$attempts" -ge 100 ]; then
break
fi
sleep 0.1
done
```
By sharing a single `waitForCallerPid` snippet between both modes, that cap is gone. If the caller process never exits (hung signal handler, unexpected blocking, or even PID reuse in a long-lived container), the detached helper will spin in `sleep 0.1` forever — accumulating zombie helper processes with no self-termination path.
In normal operation the caller exits quickly, so this is benign; but the original cap was a deliberate defensive measure. Consider restoring a maximum iteration guard, e.g.:
```shell
wait_pid="$4"
if [ -n "$wait_pid" ] && [ "$wait_pid" -gt 1 ] 2>/dev/null; then
attempts=0
while kill -0 "$wait_pid" >/dev/null 2>&1; do
attempts=$((attempts + 1))
if [ "$attempts" -ge 100 ]; then
break
fi
sleep 0.1
done
fi
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/daemon/launchd-restart-handoff.test.ts
Line: 18-43
Comment:
**`start-after-exit` mode has no test coverage here**
The new `launchd-restart-handoff.test.ts` only exercises the `kickstart` mode. Because the `waitForCallerPid` snippet is now shared between both modes, it would be worth adding a parallel test for `mode: "start-after-exit"` to confirm:
- `args[2]` is `"openclaw-launchd-restart-handoff"` (same sentinel)
- `args[6]` carries the `waitForPid` value
- The generated script includes `launchctl start` (not `kickstart`)
- `sleep 1` is absent
This guards against accidental divergence if the two branches are ever split again.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: 838c528 |
| while kill -0 "$wait_pid" >/dev/null 2>&1; do | ||
| sleep 0.1 | ||
| done |
There was a problem hiding this comment.
Unbounded spin loop — timeout guard was intentionally removed
The previous start-after-exit implementation had an explicit 100-attempt (~10 s) safety cap:
attempts=0
while kill -0 "$wait_pid" >/dev/null 2>&1; do
attempts=$((attempts + 1))
if [ "$attempts" -ge 100 ]; then
break
fi
sleep 0.1
doneBy sharing a single waitForCallerPid snippet between both modes, that cap is gone. If the caller process never exits (hung signal handler, unexpected blocking, or even PID reuse in a long-lived container), the detached helper will spin in sleep 0.1 forever — accumulating zombie helper processes with no self-termination path.
In normal operation the caller exits quickly, so this is benign; but the original cap was a deliberate defensive measure. Consider restoring a maximum iteration guard, e.g.:
wait_pid="$4"
if [ -n "$wait_pid" ] && [ "$wait_pid" -gt 1 ] 2>/dev/null; then
attempts=0
while kill -0 "$wait_pid" >/dev/null 2>&1; do
attempts=$((attempts + 1))
if [ "$attempts" -ge 100 ]; then
break
fi
sleep 0.1
done
fiPrompt To Fix With AI
This is a comment left during a code review.
Path: src/daemon/launchd-restart-handoff.ts
Line: 67-69
Comment:
**Unbounded spin loop — timeout guard was intentionally removed**
The previous `start-after-exit` implementation had an explicit 100-attempt (~10 s) safety cap:
```shell
attempts=0
while kill -0 "$wait_pid" >/dev/null 2>&1; do
attempts=$((attempts + 1))
if [ "$attempts" -ge 100 ]; then
break
fi
sleep 0.1
done
```
By sharing a single `waitForCallerPid` snippet between both modes, that cap is gone. If the caller process never exits (hung signal handler, unexpected blocking, or even PID reuse in a long-lived container), the detached helper will spin in `sleep 0.1` forever — accumulating zombie helper processes with no self-termination path.
In normal operation the caller exits quickly, so this is benign; but the original cap was a deliberate defensive measure. Consider restoring a maximum iteration guard, e.g.:
```shell
wait_pid="$4"
if [ -n "$wait_pid" ] && [ "$wait_pid" -gt 1 ] 2>/dev/null; then
attempts=0
while kill -0 "$wait_pid" >/dev/null 2>&1; do
attempts=$((attempts + 1))
if [ "$attempts" -ge 100 ]; then
break
fi
sleep 0.1
done
fi
```
How can I resolve this? If you propose a fix, please make it concise.| describe("scheduleDetachedLaunchdRestartHandoff", () => { | ||
| it("waits for the caller pid before kickstarting launchd", () => { | ||
| const env = { | ||
| HOME: "/Users/test", | ||
| OPENCLAW_PROFILE: "default", | ||
| }; | ||
| spawnMock.mockReturnValue({ pid: 4242, unref: unrefMock }); | ||
|
|
||
| const result = scheduleDetachedLaunchdRestartHandoff({ | ||
| env, | ||
| mode: "kickstart", | ||
| waitForPid: 9876, | ||
| }); | ||
|
|
||
| expect(result).toEqual({ ok: true, pid: 4242 }); | ||
| expect(spawnMock).toHaveBeenCalledTimes(1); | ||
| const [, args] = spawnMock.mock.calls[0] as [string, string[]]; | ||
| expect(args[0]).toBe("-c"); | ||
| expect(args[2]).toBe("openclaw-launchd-restart-handoff"); | ||
| expect(args[6]).toBe("9876"); | ||
| expect(args[1]).toContain('while kill -0 "$wait_pid" >/dev/null 2>&1; do'); | ||
| expect(args[1]).toContain('launchctl kickstart -k "$service_target" >/dev/null 2>&1'); | ||
| expect(args[1]).not.toContain("sleep 1"); | ||
| expect(unrefMock).toHaveBeenCalledTimes(1); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
start-after-exit mode has no test coverage here
The new launchd-restart-handoff.test.ts only exercises the kickstart mode. Because the waitForCallerPid snippet is now shared between both modes, it would be worth adding a parallel test for mode: "start-after-exit" to confirm:
args[2]is"openclaw-launchd-restart-handoff"(same sentinel)args[6]carries thewaitForPidvalue- The generated script includes
launchctl start(notkickstart) sleep 1is absent
This guards against accidental divergence if the two branches are ever split again.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/daemon/launchd-restart-handoff.test.ts
Line: 18-43
Comment:
**`start-after-exit` mode has no test coverage here**
The new `launchd-restart-handoff.test.ts` only exercises the `kickstart` mode. Because the `waitForCallerPid` snippet is now shared between both modes, it would be worth adding a parallel test for `mode: "start-after-exit"` to confirm:
- `args[2]` is `"openclaw-launchd-restart-handoff"` (same sentinel)
- `args[6]` carries the `waitForPid` value
- The generated script includes `launchctl start` (not `kickstart`)
- `sleep 1` is absent
This guards against accidental divergence if the two branches are ever split again.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 838c528898
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| while kill -0 "$wait_pid" >/dev/null 2>&1; do | ||
| sleep 0.1 | ||
| done |
There was a problem hiding this comment.
Bound caller-PID wait loop to avoid restart deadlock
The detached handoff script now waits forever in while kill -0 "$wait_pid" with no timeout. If the caller exits before this loop begins and that PID is quickly reused by an unrelated long-lived process, kill -0 remains true and the script never reaches launchctl kickstart/start, so the restart can be skipped indefinitely. The prior implementation had a bounded wait, so this is a regression in restart reliability under PID reuse or non-exiting caller scenarios.
Useful? React with 👍 / 👎.
Follows up on #43514 (closed by barnacle — branch was dirty).
What this fixes
Two findings from Peter's clanker review on the original kickstart PR:
Bug 1 (high) — Race condition in handoff helper
The
kickstartandstart-after-exitshell scripts usedsleep 1before issuinglaunchctl kickstart -k. This caused two failure modes:runServiceRestartran against the old gateway process before the restart completedFix: Pass
process.pidas$4to the detached helper. The script now pollskill -0 $caller_pidin 0.1 s increments and only issueskickstart -kafter the caller has exited.Bug 2 (medium) — Overly broad bootstrap fallback
In
restartLaunchAgent(), any non-zerokickstart -kresult triggeredenable + bootstrap, including errors unrelated to a missing plist registration. When the job was still registered,bootstrapwould fail with "service already registered", turning the original error into an obscured secondary failure.Fix: Gate the
enable + bootstrappath onisLaunchctlNotLoaded(). Non-missing-job kickstart errors now re-throw immediately, preserving the original diagnostic.Scope
Only daemon/launchd files changed — no logic outside the restart path.
Fixes #43311, #43406, #43035, #43049