fix(gateway): try SIGUSR1 restart before spawning external processes#68507
fix(gateway): try SIGUSR1 restart before spawning external processes#68507grimmjoww wants to merge 1 commit intoopenclaw:mainfrom
Conversation
The openclaw gateway restart command fails when called from inside the gateway because triggerOpenClawRestart() goes straight to platform-specific external restart (systemd/launchctl/schtasks), which spawns child processes that die when the parent gateway dies. The /restart text command already handles this correctly by trying SIGUSR1 first (in-process restart). This change makes the CLI path behave the same. Fixes: openclaw#40932 (gateway restart fails from exec tool) Fixes: openclaw#62442 (session state loss after restart)
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 67c398f860
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (process.listenerCount("SIGUSR1") > 0) { | ||
| scheduleGatewaySigusr1Restart({ reason: "cli" }); | ||
| return { ok: true, method: "sigusr1", tried: [] }; |
There was a problem hiding this comment.
Keep schtasks restart path when SIGUSR1 listeners exist
This early return bypasses platform-specific restart logic whenever a SIGUSR1 listener is registered, which is true inside the gateway run loop. In the Windows supervised path (restartGatewayProcessWithFreshPid), that means triggerOpenClawRestart reports success via scheduled in-process SIGUSR1, the process exits as supervised, and the scheduled timer never fires; because relaunchGatewayScheduledTask is skipped, the Scheduled Task restart is not triggered and the gateway can stop instead of restarting.
Useful? React with 👍 / 👎.
Greptile SummaryThis PR adds an early-return to Confidence Score: 5/5Safe to merge; logic is sound and the only finding is missing test coverage for the new branch. The SIGUSR1 listener-count guard correctly identifies in-gateway context (Node's --inspect handler is OS-level and invisible to listenerCount, so there's no false-positive risk). The RestartAttempt type union is properly extended. cleanStaleGatewayProcessesSync is correctly skipped on the in-process path. scheduleGatewaySigusr1Restart always returns ok:true so dropping its return value is harmless. The single P2 finding (no unit test for the new branch) does not affect runtime correctness. src/infra/restart.ts — new SIGUSR1 branch in triggerOpenClawRestart lacks a unit test. Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/infra/restart.ts
Line: 301-305
Comment:
**No test coverage for the new SIGUSR1 early-return branch**
`restart.test.ts` only exercises the stale-PID helpers, and `process-respawn.test.ts` mocks `triggerOpenClawRestart` entirely, so the new branch is never exercised by the test suite. A minimal unit test could register a temporary `SIGUSR1` listener on `process`, call `triggerOpenClawRestart()`, assert that the result has `method === "sigusr1"` and `ok === true`, and verify `scheduleGatewaySigusr1Restart` was called (mocked). This matches the pattern recommended by CLAUDE.md for narrowly scoped changes.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(gateway): try SIGUSR1 restart before..." | Re-trigger Greptile |
| // without the structural problem of spawning external processes that die with parent. | ||
| if (process.listenerCount("SIGUSR1") > 0) { | ||
| scheduleGatewaySigusr1Restart({ reason: "cli" }); | ||
| return { ok: true, method: "sigusr1", tried: [] }; | ||
| } |
There was a problem hiding this comment.
No test coverage for the new SIGUSR1 early-return branch
restart.test.ts only exercises the stale-PID helpers, and process-respawn.test.ts mocks triggerOpenClawRestart entirely, so the new branch is never exercised by the test suite. A minimal unit test could register a temporary SIGUSR1 listener on process, call triggerOpenClawRestart(), assert that the result has method === "sigusr1" and ok === true, and verify scheduleGatewaySigusr1Restart was called (mocked). This matches the pattern recommended by CLAUDE.md for narrowly scoped changes.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/restart.ts
Line: 301-305
Comment:
**No test coverage for the new SIGUSR1 early-return branch**
`restart.test.ts` only exercises the stale-PID helpers, and `process-respawn.test.ts` mocks `triggerOpenClawRestart` entirely, so the new branch is never exercised by the test suite. A minimal unit test could register a temporary `SIGUSR1` listener on `process`, call `triggerOpenClawRestart()`, assert that the result has `method === "sigusr1"` and `ok === true`, and verify `scheduleGatewaySigusr1Restart` was called (mocked). This matches the pattern recommended by CLAUDE.md for narrowly scoped changes.
How can I resolve this? If you propose a fix, please make it concise.|
Closing this PR — Codex's P1 is correct: on Windows, the gateway run-loop always registers a SIGUSR1 listener, so the early-return fires when After re-reading the call graph, the root cause is structural, not platform-specific: Thanks to @codex and @greptile-bot for the review. |
|
To use Codex here, create a Codex account and connect to github. |
|
Thanks for the thorough write-up — the call graph analysis makes the structural issue clear. The launchd parallel ( Tip: You can customize Greptile's behavior for this repo with |
…htasks Addresses openclaw#40932 (gateway restart fails from exec tool) without introducing the Windows regression that caused closed PR openclaw#68507 to be reverted. Two structural changes: * Add a SIGUSR1 fast path to `triggerOpenClawRestart()`. When a SIGUSR1 listener is registered (i.e. the gateway run-loop is up), route the restart through `scheduleGatewaySigusr1Restart` instead of spawning an external `systemctl` / `launchctl` / `schtasks` process. This restores the openclaw#40932 fix for direct callers such as the exec tool, which got no such protection previously. * Route the Windows schtasks supervisor-handoff in `restartGatewayProcessWithFreshPid` directly to `relaunchGatewayScheduledTask`, mirroring the launchd handling. Going through `triggerOpenClawRestart` from that codepath would re-enter the SIGUSR1 fast path from inside the SIGUSR1 drain/exit sequence, skipping `relaunchGatewayScheduledTask` — which is the only thing that tells the Windows Scheduled Task to start a new instance after exit. The gateway would stop instead of restart. This was Codex's P1 on closed PR openclaw#68507. Tests: * `process-respawn.test.ts`: update the Windows supervised test to assert `relaunchGatewayScheduledTask` is called directly with `process.env`; add a test covering schtasks failure propagation. * `restart-trigger.test.ts` (new): cover the SIGUSR1 fast path on linux and win32, plus the platform-dispatch fall-through when no listener is registered (addresses Greptile's P2 from openclaw#68507). AI-assisted: drafted and implemented with Claude Code. Testing level: lightly tested (targeted unit tests pass; no live Windows schtasks repro available from this machine — the fix is validated structurally by ensuring the schtasks branch of process-respawn no longer re-enters `triggerOpenClawRestart`). Refs: openclaw#40932, closed openclaw#68507 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
See the diff