fix(systemd,gateway): resolve orphan subprocesses and port 18789 hangs#38463
fix(systemd,gateway): resolve orphan subprocesses and port 18789 hangs#38463spirittechie wants to merge 4 commits intoopenclaw:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d9dce43af0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| @@ -201,6 +202,12 @@ async function runGatewayCommand(opts: GatewayRunOpts) { | |||
| defaultRuntime.exit(1); | |||
| return; | |||
| } | |||
| if (process.env.OPENCLAW_SERVICE_MODE?.trim()) { | |||
There was a problem hiding this comment.
Gate stale-pid cleanup on an env var that is actually set
This new branch is effectively dead code because it checks OPENCLAW_SERVICE_MODE, but the service environment builder sets OPENCLAW_SERVICE_MARKER/OPENCLAW_SERVICE_KIND/OPENCLAW_SERVICE_VERSION and never defines OPENCLAW_SERVICE_MODE (repo-wide search only finds this single reference). In systemd/launchd runs, cleanStaleGatewayProcessesSync() therefore never executes, so the restart path still races stale listeners and the intended port-18789 fix does not apply.
Useful? React with 👍 / 👎.
| const useDetached = process.platform !== "win32"; | ||
| // Keep children attached to the parent process group so service managers | ||
| // (systemd/launchd) can reliably terminate the full tree on restart. | ||
| const useDetached = false; |
There was a problem hiding this comment.
Preserve detached process groups for tree-kill semantics
For POSIX child runs, forcing useDetached = false breaks the assumptions in killProcessTree: it signals -pid to terminate a process group and only reaches descendants when the child is group leader. With non-detached children, that group usually does not exist and the fallback kills only the direct child, so grandchildren spawned by supervised commands can survive cancel/restart and become orphaned.
Useful? React with 👍 / 👎.
Greptile SummaryThis PR addresses orphan subprocess and port-binding issues caused by detached child processes escaping the systemd cgroup on restarts. The three-file change covers: (1) making all child spawns non-detached so systemd's Key issues found:
Confidence Score: 2/5
Last reviewed commit: d9dce43 |
| if (process.env.OPENCLAW_SERVICE_MODE?.trim()) { | ||
| const stale = cleanStaleGatewayProcessesSync(); | ||
| if (stale.length > 0) { | ||
| gatewayLog.info(`service-mode: cleared ${stale.length} stale gateway pid(s) before bind`); | ||
| } | ||
| } |
There was a problem hiding this comment.
Stale PID cleanup scans wrong port when --port is overridden
cleanStaleGatewayProcessesSync() resolves the gateway port internally via resolveGatewayPort(undefined, process.env) (from config/paths.js), completely independent of the port that will actually be bound. The port used by the gateway is resolved a few lines later as portOverride ?? resolveGatewayPort(cfg), where portOverride comes from the CLI --port option.
If OPENCLAW_SERVICE_MODE is set and the gateway is started on a non-default port (e.g. --port 9999), the stale process scan will run against the default port (e.g. 18789), miss any stale processes actually holding port 9999, and the gateway will still hit EADDRINUSE — exactly the problem this fix is trying to solve.
The port needs to be resolved before calling cleanStaleGatewayProcessesSync, and the resolved port should be passed in rather than re-derived inside the function. For example:
if (process.env.OPENCLAW_SERVICE_MODE?.trim()) {
const stale = cleanStaleGatewayProcessesSync(port); // pass the already-resolved port
if (stale.length > 0) {
gatewayLog.info(`service-mode: cleared ${stale.length} stale gateway pid(s) before bind`);
}
}This requires cleanStaleGatewayProcessesSync to accept an optional port parameter, falling back to its own resolution only when not provided.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/gateway-cli/run.ts
Line: 205-210
Comment:
**Stale PID cleanup scans wrong port when `--port` is overridden**
`cleanStaleGatewayProcessesSync()` resolves the gateway port internally via `resolveGatewayPort(undefined, process.env)` (from `config/paths.js`), completely independent of the port that will actually be bound. The port used by the gateway is resolved a few lines later as `portOverride ?? resolveGatewayPort(cfg)`, where `portOverride` comes from the CLI `--port` option.
If `OPENCLAW_SERVICE_MODE` is set and the gateway is started on a non-default port (e.g. `--port 9999`), the stale process scan will run against the default port (e.g. `18789`), miss any stale processes actually holding port `9999`, and the gateway will still hit `EADDRINUSE` — exactly the problem this fix is trying to solve.
The port needs to be resolved before calling `cleanStaleGatewayProcessesSync`, and the resolved port should be passed in rather than re-derived inside the function. For example:
```ts
if (process.env.OPENCLAW_SERVICE_MODE?.trim()) {
const stale = cleanStaleGatewayProcessesSync(port); // pass the already-resolved port
if (stale.length > 0) {
gatewayLog.info(`service-mode: cleared ${stale.length} stale gateway pid(s) before bind`);
}
}
```
This requires `cleanStaleGatewayProcessesSync` to accept an optional port parameter, falling back to its own resolution only when not provided.
How can I resolve this? If you propose a fix, please make it concise.| if (signal === "SIGTERM") { | ||
| if (pid) { | ||
| killProcessTree(pid, { graceMs: 10_000 }); | ||
| } else { | ||
| try { | ||
| child.kill("SIGTERM"); | ||
| } catch { | ||
| // ignore kill errors | ||
| } | ||
| } | ||
| return; | ||
| } |
There was a problem hiding this comment.
killProcessTree with detached: false cannot kill grandchildren outside systemd cgroups
With useDetached = false, spawned children are not process group leaders — they inherit the parent's PGID rather than forming a new one. When killProcessTree(pid, { graceMs: 10_000 }) runs on Unix (killProcessTreeUnix), it first tries process.kill(-pid, "SIGTERM") (signal to process group PGID=pid). Because the child is not a group leader, PGID=childPid doesn't exist, so this throws ESRCH and falls through to process.kill(pid, "SIGTERM") — which only signals the direct child.
Ten seconds later the escalation path has the same problem: isProcessAlive(-pid) returns false (ESRCH again), so it falls through to process.kill(pid, "SIGKILL"), which again only reaches the direct child.
Practical impact: Any grandchild processes (children of the supervised child) will be left alive and become true orphans when the supervisor kills a direct child in a non-systemd context (macOS/launchd, manual process management, CI environments). This is a regression compared to the previous detached: true behaviour on POSIX, where kill(-pid, SIGTERM) reliably reached the whole subtree because each child had its own process group.
For the pure systemd case this is acceptable because KillMode=control-group terminates the whole cgroup regardless. However, the adapter is general-purpose code used beyond the systemd path, and the comment only mentions "service managers" without acknowledging this tradeoff. At minimum, the limitation should be documented, and callers in non-cgroup environments should ensure they have an alternative termination strategy for deeper process trees.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/process/supervisor/adapters/child.ts
Line: 120-131
Comment:
**`killProcessTree` with `detached: false` cannot kill grandchildren outside systemd cgroups**
With `useDetached = false`, spawned children are not process group leaders — they inherit the parent's PGID rather than forming a new one. When `killProcessTree(pid, { graceMs: 10_000 })` runs on Unix (`killProcessTreeUnix`), it first tries `process.kill(-pid, "SIGTERM")` (signal to process group PGID=pid). Because the child is not a group leader, PGID=childPid doesn't exist, so this throws ESRCH and falls through to `process.kill(pid, "SIGTERM")` — which only signals the direct child.
Ten seconds later the escalation path has the same problem: `isProcessAlive(-pid)` returns `false` (ESRCH again), so it falls through to `process.kill(pid, "SIGKILL")`, which again only reaches the direct child.
**Practical impact**: Any grandchild processes (children of the supervised child) will be left alive and become true orphans when the supervisor kills a direct child in a non-systemd context (macOS/launchd, manual process management, CI environments). This is a regression compared to the previous `detached: true` behaviour on POSIX, where `kill(-pid, SIGTERM)` reliably reached the whole subtree because each child had its own process group.
For the pure systemd case this is acceptable because `KillMode=control-group` terminates the whole cgroup regardless. However, the adapter is general-purpose code used beyond the systemd path, and the comment only mentions "service managers" without acknowledging this tradeoff. At minimum, the limitation should be documented, and callers in non-cgroup environments should ensure they have an alternative termination strategy for deeper process trees.
How can I resolve this? If you propose a fix, please make it concise.| @@ -126,6 +117,18 @@ export async function createChildAdapter(params: { | |||
|
|
|||
| const kill = (signal?: NodeJS.Signals) => { | |||
| const pid = child.pid ?? undefined; | |||
| if (signal === "SIGTERM") { | |||
| if (pid) { | |||
| killProcessTree(pid, { graceMs: 10_000 }); | |||
There was a problem hiding this comment.
graceMs: 10_000 conflicts with TimeoutStopSec=30 in a multi-child scenario
killProcessTree is called with a 10-second grace period (vs the 3-second default). With TimeoutStopSec=30 in the systemd unit, systemd will force-kill the cgroup 30 seconds after sending SIGTERM. If the supervisor stops N children in sequence (each taking up to 10 seconds to escalate to SIGKILL), the total grace time could approach or exceed systemd's TimeoutStopSec, causing systemd to force-kill mid-cleanup and potentially leaving state inconsistent.
Consider either reducing the grace period to be proportionally safe (e.g. the existing 3-second default), or stopping children in parallel so the maximum blocking time is bounded by a single grace window rather than N × 10 seconds.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/process/supervisor/adapters/child.ts
Line: 122
Comment:
**`graceMs: 10_000` conflicts with `TimeoutStopSec=30` in a multi-child scenario**
`killProcessTree` is called with a 10-second grace period (vs the 3-second default). With `TimeoutStopSec=30` in the systemd unit, systemd will force-kill the cgroup 30 seconds after sending SIGTERM. If the supervisor stops N children in sequence (each taking up to 10 seconds to escalate to SIGKILL), the total grace time could approach or exceed systemd's `TimeoutStopSec`, causing systemd to force-kill mid-cleanup and potentially leaving state inconsistent.
Consider either reducing the grace period to be proportionally safe (e.g. the existing 3-second default), or stopping children in parallel so the maximum blocking time is bounded by a single grace window rather than N × 10 seconds.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 703d0433f1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| @@ -201,6 +202,12 @@ async function runGatewayCommand(opts: GatewayRunOpts) { | |||
| defaultRuntime.exit(1); | |||
| return; | |||
| } | |||
| if (process.env.OPENCLAW_SERVICE_MODE?.trim()) { | |||
There was a problem hiding this comment.
Gate stale-PID cleanup on a service env var that exists
This service-mode cleanup branch is effectively unreachable in normal supervised runs because it checks OPENCLAW_SERVICE_MODE, while the service environment wiring populates OPENCLAW_SERVICE_MARKER/OPENCLAW_SERVICE_KIND/OPENCLAW_SERVICE_VERSION and never sets OPENCLAW_SERVICE_MODE (see src/daemon/service-env.ts). In systemd/launchd contexts this condition remains false, so cleanStaleGatewayProcessesSync() is skipped and the port-bind race this change is meant to address can still occur.
Useful? React with 👍 / 👎.
| const useDetached = process.platform !== "win32"; | ||
| // Keep children attached to the parent process group so service managers | ||
| // (systemd/launchd) can reliably terminate the full tree on restart. | ||
| const useDetached = false; |
There was a problem hiding this comment.
Keep POSIX child runs detached for tree-kill semantics
Setting detached to false on POSIX undermines subtree termination: killProcessTree targets -pid first (process group) and only falls back to direct pid when that group is missing (src/process/kill-tree.ts). Non-detached children are typically not process-group leaders, so cancellation/restart often kills only the direct child and leaves spawned descendants (for example shell/npm grandchildren) running.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4dd3fe840b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const useDetached = process.platform !== "win32"; | ||
| // Keep children attached to the parent process group so service managers | ||
| // (systemd/launchd) can reliably terminate the full tree on restart. | ||
| const useDetached = false; |
There was a problem hiding this comment.
Keep POSIX child spawns detached for tree-kill correctness
Setting useDetached = false on POSIX breaks subtree termination: killProcessTree targets -pid first and only falls back to direct pid, so when the child is not a process-group leader the fallback kills only the immediate child and leaves spawned descendants running after cancel/restart. Fresh evidence in this diff is that the old no-detach fallback path was also removed (fallbacks: []), so there is no remaining path that preserves process-group-based cleanup.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ae3971e660
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (process.env.OPENCLAW_SERVICE_MARKER?.trim()) { | ||
| const stale = cleanStaleGatewayProcessesSync(port); |
There was a problem hiding this comment.
Detect node/bun gateway listeners before stale-PID cleanup
This new service-mode branch calls cleanStaleGatewayProcessesSync(port), but that helper only retains lsof records whose command contains openclaw (src/infra/restart-stale-pids.ts:63-74). Our service install path typically starts gateway via node/bun (src/daemon/program-args.ts:169-202), so stale listeners appear as node or bun and are filtered out. In the common systemd/launchd runtime, this startup cleanup will therefore miss the lingering gateway process and the port-bind restart race this commit targets can still reproduce.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b2e9e030f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const useDetached = process.platform !== "win32"; | ||
| // Keep children attached to the parent process group so service managers | ||
| // (systemd/launchd) can reliably terminate the full tree on restart. | ||
| const useDetached = false; |
There was a problem hiding this comment.
Keep POSIX child runs detached for tree-kill correctness
Setting detached to false on POSIX breaks the process-group assumption used by killProcessTree: that helper sends signals to -pid first (src/process/kill-tree.ts), which only reaches descendants when the child is the group leader. With non-detached children, cancel/restart paths will often kill only the direct child and leave grandchildren (for example shell/npm subprocesses) running, so this change can reintroduce orphaned workers.
Useful? React with 👍 / 👎.
| @@ -201,6 +202,14 @@ async function runGatewayCommand(opts: GatewayRunOpts) { | |||
| defaultRuntime.exit(1); | |||
| return; | |||
| } | |||
| if (process.env.OPENCLAW_SERVICE_MARKER?.trim()) { | |||
| const stale = cleanStaleGatewayProcessesSync(port); | |||
There was a problem hiding this comment.
Detect node/bun listeners in service-mode stale PID cleanup
This new service-start cleanup call can miss the stale gateway process it is meant to clear: cleanStaleGatewayProcessesSync only keeps lsof rows whose command contains openclaw (src/infra/restart-stale-pids.ts), but gateway services are commonly launched as node/bun (src/daemon/program-args.ts), so stale listeners appear under those command names and are filtered out. In that common runtime, the startup cleanup still leaves port 18789 occupied and the restart bind race remains.
Useful? React with 👍 / 👎.
Root cause: Detached spawns escape cgroup on restarts, port races. Changes: Timeouts/SuccessExitStatus in unit; detached=false in supervisor; port clear service mode; SIGTERM tree-kill. Added imports for killProcessTree and cleanStaleGatewayProcessesSync. Tests: Manual orphan fixed (no PID 53919); 5x systemd restarts clean (journalctl/ps/ss). Closes openclaw#8069 openclaw#7421 openclaw#18420 openclaw#24151 openclaw#32348 openclaw#36793. Signed-off-by: Jesse Paul <drzin69@gmail.com>
…SIGTERM tree-kill Update child adapter unit test expectations for detached=false and SIGTERM process-tree termination with extended grace, matching current supervisor behavior. Signed-off-by: Jesse Paul <drzin69@gmail.com>
…eanup Use OPENCLAW_SERVICE_MARKER (daemon-provided) instead of OPENCLAW_SERVICE_MODE in gateway run mode, and pass the resolved runtime port into cleanStaleGatewayProcessesSync so --port overrides are respected before bind. Add restart stale-pid unit coverage for explicit port override behavior. Signed-off-by: Jesse Paul <drzin69@gmail.com>
7b2e9e0 to
add0b6e
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: add0b6e1f7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const useDetached = process.platform !== "win32"; | ||
| // Keep children attached to the parent process group so service managers | ||
| // (systemd/launchd) can reliably terminate the full tree on restart. | ||
| const useDetached = false; |
There was a problem hiding this comment.
Keep POSIX child spawns detached for subtree cleanup
Setting useDetached = false (and removing the detach fallback) breaks the process-group assumption used by killProcessTree: on Unix it signals -pid first (src/process/kill-tree.ts), which only kills descendants when the child is a group leader. For non-detached child runs (for example shell/npm wrappers that spawn grandchildren), -pid usually does not map to a group, so cancellation/restart paths kill only the direct child and can leave subprocesses orphaned.
Useful? React with 👍 / 👎.
| return; | ||
| } | ||
| if (process.env.OPENCLAW_SERVICE_MARKER?.trim()) { | ||
| const stale = cleanStaleGatewayProcessesSync(port); |
There was a problem hiding this comment.
Match stale-listener detection to node/bun service launches
This new service-start branch calls cleanStaleGatewayProcessesSync(port), but stale PID discovery only keeps lsof records whose command contains openclaw (parsePidsFromLsofOutput in src/infra/restart-stale-pids.ts). Gateway services are commonly launched through node/bun program arguments (src/daemon/program-args.ts), so those stale listeners are filtered out and the startup cleanup can still miss the process holding port 18789.
Useful? React with 👍 / 👎.
|
Deep-review pass complete. Keeping this PR open for now, but it needs a split before landing:
I’m intentionally not landing CI/workflow changes with this runtime batch. |
@spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
@spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
|
Landed on What we changed while landing:
Source commits reviewed:
Thanks @spirittechie. |
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com> (cherry picked from commit cc7e616)
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com> (cherry picked from commit cc7e616)
…, thanks @spirittechie) (cherry picked from commit e20f445)
…, thanks @spirittechie) (#1761) (cherry picked from commit e20f445) Co-authored-by: Peter Steinberger <steipete@gmail.com>
…OSIX The isServiceManagedRuntime() guard (added in e20f445) disabled detached mode when OPENCLAW_SERVICE_MARKER was set. Without detached, the child inherits the gateway PGID, so killProcessTree(-pid) gets ESRCH — only bash is killed while grandchildren hold pipes open, causing agents to hang until gateway restart. Remove the guard: always set detached: true on POSIX. systemd KillMode=control-group is cgroup-based and unaffected by PGID. macOS launchd signals the job's own PGID, but the gateway shutdown handler (abortEmbeddedPiRun) already kills all active runs before exit. Fixes openclaw#44790 Refs openclaw#38463 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
…, thanks @spirittechie) Co-authored-by: Jesse Paul <drzin69@gmail.com>
Root cause: Detached spawns escape cgroup on restarts, port races. Changes: Timeouts/SuccessExitStatus in unit; detached=false in supervisor; port clear service mode; SIGTERM tree-kill. Added imports for killProcessTree and cleanStaleGatewayProcessesSync. Tests: Manual orphan fixed (no PID 53919); 5x systemd restarts clean (journalctl/ps/ss). Closes #8069 #7421 #18420 #24151 #32348 #36793.
Summary
Describe the problem and fix in 2–5 bullets:
Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
User-visible / Behavior Changes
List user-visible changes (including defaults/config).
If none, write
None.Security Impact (required)
Yes/No)Yes/No)Yes/No)Yes/No)Yes/No)Yes, explain risk + mitigation:Repro + Verification
Environment
Steps
Expected
Actual
Evidence
Attach at least one:
Human Verification (required)
What you personally verified (not just CI), and how:
Compatibility / Migration
Yes/No)Yes/No)Yes/No)Failure Recovery (if this breaks)
Risks and Mitigations
List only real risks for this PR. Add/remove entries as needed. If none, write
None.