Skip to content

fix(systemd,gateway): resolve orphan subprocesses and port 18789 hangs#38463

Closed
spirittechie wants to merge 4 commits intoopenclaw:mainfrom
spirittechie:fix-systemd-orphans
Closed

fix(systemd,gateway): resolve orphan subprocesses and port 18789 hangs#38463
spirittechie wants to merge 4 commits intoopenclaw:mainfrom
spirittechie:fix-systemd-orphans

Conversation

@spirittechie
Copy link
Copy Markdown
Contributor

Root cause: Detached spawns escape cgroup on restarts, port races. Changes: Timeouts/SuccessExitStatus in unit; detached=false in supervisor; port clear service mode; SIGTERM tree-kill. Added imports for killProcessTree and cleanStaleGatewayProcessesSync. Tests: Manual orphan fixed (no PID 53919); 5x systemd restarts clean (journalctl/ps/ss). Closes #8069 #7421 #18420 #24151 #32348 #36793.

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem:
  • Why it matters:
  • What changed:
  • What did NOT change (scope boundary):

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #
  • Related #

User-visible / Behavior Changes

List user-visible changes (including defaults/config).
If none, write None.

Security Impact (required)

  • New permissions/capabilities? (Yes/No)
  • Secrets/tokens handling changed? (Yes/No)
  • New/changed network calls? (Yes/No)
  • Command/tool execution surface changed? (Yes/No)
  • Data access scope changed? (Yes/No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS:
  • Runtime/container:
  • Model/provider:
  • Integration/channel (if any):
  • Relevant config (redacted):

Steps

Expected

Actual

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
  • Edge cases checked:
  • What you did not verify:

Compatibility / Migration

  • Backward compatible? (Yes/No)
  • Config/env changes? (Yes/No)
  • Migration needed? (Yes/No)
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly:
  • Files/config to restore:
  • Known bad symptoms reviewers should watch for:

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

  • Risk:
    • Mitigation:

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime cli CLI command changes size: XS labels Mar 7, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9dce43af0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cli/gateway-cli/run.ts Outdated
@@ -201,6 +202,12 @@ async function runGatewayCommand(opts: GatewayRunOpts) {
defaultRuntime.exit(1);
return;
}
if (process.env.OPENCLAW_SERVICE_MODE?.trim()) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate stale-pid cleanup on an env var that is actually set

This new branch is effectively dead code because it checks OPENCLAW_SERVICE_MODE, but the service environment builder sets OPENCLAW_SERVICE_MARKER/OPENCLAW_SERVICE_KIND/OPENCLAW_SERVICE_VERSION and never defines OPENCLAW_SERVICE_MODE (repo-wide search only finds this single reference). In systemd/launchd runs, cleanStaleGatewayProcessesSync() therefore never executes, so the restart path still races stale listeners and the intended port-18789 fix does not apply.

Useful? React with 👍 / 👎.

const useDetached = process.platform !== "win32";
// Keep children attached to the parent process group so service managers
// (systemd/launchd) can reliably terminate the full tree on restart.
const useDetached = false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve detached process groups for tree-kill semantics

For POSIX child runs, forcing useDetached = false breaks the assumptions in killProcessTree: it signals -pid to terminate a process group and only reaches descendants when the child is group leader. With non-detached children, that group usually does not exist and the fallback kills only the direct child, so grandchildren spawned by supervised commands can survive cancel/restart and become orphaned.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 7, 2026

Greptile Summary

This PR addresses orphan subprocess and port-binding issues caused by detached child processes escaping the systemd cgroup on restarts. The three-file change covers: (1) making all child spawns non-detached so systemd's KillMode=control-group can reliably terminate the full tree, (2) adding proper TimeoutStopSec/TimeoutStartSec/SuccessExitStatus directives to the generated systemd unit, and (3) adding a pre-bind stale gateway process cleanup when running in service mode.

Key issues found:

  • Port mismatch in stale PID cleanup (run.ts lines 205–210): cleanStaleGatewayProcessesSync() resolves the gateway port internally and does not see the --port CLI override resolved a few lines later. If the service is started on a non-default port, the scan will target the wrong port and miss the actual stale process, leaving the EADDRINUSE problem unsolved.
  • Grandchild orphans outside systemd cgroups (child.ts lines 120–131): With detached: false, children are not process group leaders, so killProcessTree's group-signal (kill(-pid, ...)) silently falls back to a single-PID signal. Grandchildren are not reached in environments that lack cgroup management (macOS/launchd, CI, manual process management). This is a regression vs. the previous POSIX detached: true behaviour that the comment does not acknowledge.
  • 10-second grace period vs. multi-child shutdown (child.ts line 122): graceMs: 10_000 is 3× the default and, if children are stopped sequentially, the cumulative grace window can approach or exceed TimeoutStopSec=30, risking mid-cleanup force-kills by systemd.

Confidence Score: 2/5

  • Not safe to merge as-is: the stale PID cleanup scans the wrong port when --port is overridden, which directly undermines the core fix this PR is intended to deliver.
  • The systemd unit changes are correct and low-risk. The child adapter's detached:false change is directionally correct for systemd cgroup management. However, the port mismatch bug in cleanStaleGatewayProcessesSync means the primary service-mode fix may be ineffective when a non-default port is used, and the grandchild-kill regression under non-cgroup environments introduces a new class of orphan that was previously handled correctly.
  • src/cli/gateway-cli/run.ts (port mismatch in stale PID cleanup) and src/process/supervisor/adapters/child.ts (grandchild kill regression, excessive grace period)

Last reviewed commit: d9dce43

Comment thread src/cli/gateway-cli/run.ts Outdated
Comment on lines +205 to +210
if (process.env.OPENCLAW_SERVICE_MODE?.trim()) {
const stale = cleanStaleGatewayProcessesSync();
if (stale.length > 0) {
gatewayLog.info(`service-mode: cleared ${stale.length} stale gateway pid(s) before bind`);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale PID cleanup scans wrong port when --port is overridden

cleanStaleGatewayProcessesSync() resolves the gateway port internally via resolveGatewayPort(undefined, process.env) (from config/paths.js), completely independent of the port that will actually be bound. The port used by the gateway is resolved a few lines later as portOverride ?? resolveGatewayPort(cfg), where portOverride comes from the CLI --port option.

If OPENCLAW_SERVICE_MODE is set and the gateway is started on a non-default port (e.g. --port 9999), the stale process scan will run against the default port (e.g. 18789), miss any stale processes actually holding port 9999, and the gateway will still hit EADDRINUSE — exactly the problem this fix is trying to solve.

The port needs to be resolved before calling cleanStaleGatewayProcessesSync, and the resolved port should be passed in rather than re-derived inside the function. For example:

if (process.env.OPENCLAW_SERVICE_MODE?.trim()) {
  const stale = cleanStaleGatewayProcessesSync(port);   // pass the already-resolved port
  if (stale.length > 0) {
    gatewayLog.info(`service-mode: cleared ${stale.length} stale gateway pid(s) before bind`);
  }
}

This requires cleanStaleGatewayProcessesSync to accept an optional port parameter, falling back to its own resolution only when not provided.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli/gateway-cli/run.ts
Line: 205-210

Comment:
**Stale PID cleanup scans wrong port when `--port` is overridden**

`cleanStaleGatewayProcessesSync()` resolves the gateway port internally via `resolveGatewayPort(undefined, process.env)` (from `config/paths.js`), completely independent of the port that will actually be bound. The port used by the gateway is resolved a few lines later as `portOverride ?? resolveGatewayPort(cfg)`, where `portOverride` comes from the CLI `--port` option.

If `OPENCLAW_SERVICE_MODE` is set and the gateway is started on a non-default port (e.g. `--port 9999`), the stale process scan will run against the default port (e.g. `18789`), miss any stale processes actually holding port `9999`, and the gateway will still hit `EADDRINUSE` — exactly the problem this fix is trying to solve.

The port needs to be resolved before calling `cleanStaleGatewayProcessesSync`, and the resolved port should be passed in rather than re-derived inside the function. For example:

```ts
if (process.env.OPENCLAW_SERVICE_MODE?.trim()) {
  const stale = cleanStaleGatewayProcessesSync(port);   // pass the already-resolved port
  if (stale.length > 0) {
    gatewayLog.info(`service-mode: cleared ${stale.length} stale gateway pid(s) before bind`);
  }
}
```

This requires `cleanStaleGatewayProcessesSync` to accept an optional port parameter, falling back to its own resolution only when not provided.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +120 to +131
if (signal === "SIGTERM") {
if (pid) {
killProcessTree(pid, { graceMs: 10_000 });
} else {
try {
child.kill("SIGTERM");
} catch {
// ignore kill errors
}
}
return;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

killProcessTree with detached: false cannot kill grandchildren outside systemd cgroups

With useDetached = false, spawned children are not process group leaders — they inherit the parent's PGID rather than forming a new one. When killProcessTree(pid, { graceMs: 10_000 }) runs on Unix (killProcessTreeUnix), it first tries process.kill(-pid, "SIGTERM") (signal to process group PGID=pid). Because the child is not a group leader, PGID=childPid doesn't exist, so this throws ESRCH and falls through to process.kill(pid, "SIGTERM") — which only signals the direct child.

Ten seconds later the escalation path has the same problem: isProcessAlive(-pid) returns false (ESRCH again), so it falls through to process.kill(pid, "SIGKILL"), which again only reaches the direct child.

Practical impact: Any grandchild processes (children of the supervised child) will be left alive and become true orphans when the supervisor kills a direct child in a non-systemd context (macOS/launchd, manual process management, CI environments). This is a regression compared to the previous detached: true behaviour on POSIX, where kill(-pid, SIGTERM) reliably reached the whole subtree because each child had its own process group.

For the pure systemd case this is acceptable because KillMode=control-group terminates the whole cgroup regardless. However, the adapter is general-purpose code used beyond the systemd path, and the comment only mentions "service managers" without acknowledging this tradeoff. At minimum, the limitation should be documented, and callers in non-cgroup environments should ensure they have an alternative termination strategy for deeper process trees.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/process/supervisor/adapters/child.ts
Line: 120-131

Comment:
**`killProcessTree` with `detached: false` cannot kill grandchildren outside systemd cgroups**

With `useDetached = false`, spawned children are not process group leaders — they inherit the parent's PGID rather than forming a new one. When `killProcessTree(pid, { graceMs: 10_000 })` runs on Unix (`killProcessTreeUnix`), it first tries `process.kill(-pid, "SIGTERM")` (signal to process group PGID=pid). Because the child is not a group leader, PGID=childPid doesn't exist, so this throws ESRCH and falls through to `process.kill(pid, "SIGTERM")` — which only signals the direct child.

Ten seconds later the escalation path has the same problem: `isProcessAlive(-pid)` returns `false` (ESRCH again), so it falls through to `process.kill(pid, "SIGKILL")`, which again only reaches the direct child.

**Practical impact**: Any grandchild processes (children of the supervised child) will be left alive and become true orphans when the supervisor kills a direct child in a non-systemd context (macOS/launchd, manual process management, CI environments). This is a regression compared to the previous `detached: true` behaviour on POSIX, where `kill(-pid, SIGTERM)` reliably reached the whole subtree because each child had its own process group.

For the pure systemd case this is acceptable because `KillMode=control-group` terminates the whole cgroup regardless. However, the adapter is general-purpose code used beyond the systemd path, and the comment only mentions "service managers" without acknowledging this tradeoff. At minimum, the limitation should be documented, and callers in non-cgroup environments should ensure they have an alternative termination strategy for deeper process trees.

How can I resolve this? If you propose a fix, please make it concise.

@@ -126,6 +117,18 @@ export async function createChildAdapter(params: {

const kill = (signal?: NodeJS.Signals) => {
const pid = child.pid ?? undefined;
if (signal === "SIGTERM") {
if (pid) {
killProcessTree(pid, { graceMs: 10_000 });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

graceMs: 10_000 conflicts with TimeoutStopSec=30 in a multi-child scenario

killProcessTree is called with a 10-second grace period (vs the 3-second default). With TimeoutStopSec=30 in the systemd unit, systemd will force-kill the cgroup 30 seconds after sending SIGTERM. If the supervisor stops N children in sequence (each taking up to 10 seconds to escalate to SIGKILL), the total grace time could approach or exceed systemd's TimeoutStopSec, causing systemd to force-kill mid-cleanup and potentially leaving state inconsistent.

Consider either reducing the grace period to be proportionally safe (e.g. the existing 3-second default), or stopping children in parallel so the maximum blocking time is bounded by a single grace window rather than N × 10 seconds.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/process/supervisor/adapters/child.ts
Line: 122

Comment:
**`graceMs: 10_000` conflicts with `TimeoutStopSec=30` in a multi-child scenario**

`killProcessTree` is called with a 10-second grace period (vs the 3-second default). With `TimeoutStopSec=30` in the systemd unit, systemd will force-kill the cgroup 30 seconds after sending SIGTERM. If the supervisor stops N children in sequence (each taking up to 10 seconds to escalate to SIGKILL), the total grace time could approach or exceed systemd's `TimeoutStopSec`, causing systemd to force-kill mid-cleanup and potentially leaving state inconsistent.

Consider either reducing the grace period to be proportionally safe (e.g. the existing 3-second default), or stopping children in parallel so the maximum blocking time is bounded by a single grace window rather than N × 10 seconds.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 703d0433f1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cli/gateway-cli/run.ts Outdated
@@ -201,6 +202,12 @@ async function runGatewayCommand(opts: GatewayRunOpts) {
defaultRuntime.exit(1);
return;
}
if (process.env.OPENCLAW_SERVICE_MODE?.trim()) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate stale-PID cleanup on a service env var that exists

This service-mode cleanup branch is effectively unreachable in normal supervised runs because it checks OPENCLAW_SERVICE_MODE, while the service environment wiring populates OPENCLAW_SERVICE_MARKER/OPENCLAW_SERVICE_KIND/OPENCLAW_SERVICE_VERSION and never sets OPENCLAW_SERVICE_MODE (see src/daemon/service-env.ts). In systemd/launchd contexts this condition remains false, so cleanStaleGatewayProcessesSync() is skipped and the port-bind race this change is meant to address can still occur.

Useful? React with 👍 / 👎.

const useDetached = process.platform !== "win32";
// Keep children attached to the parent process group so service managers
// (systemd/launchd) can reliably terminate the full tree on restart.
const useDetached = false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep POSIX child runs detached for tree-kill semantics

Setting detached to false on POSIX undermines subtree termination: killProcessTree targets -pid first (process group) and only falls back to direct pid when that group is missing (src/process/kill-tree.ts). Non-detached children are typically not process-group leaders, so cancellation/restart often kills only the direct child and leaves spawned descendants (for example shell/npm grandchildren) running.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4dd3fe840b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const useDetached = process.platform !== "win32";
// Keep children attached to the parent process group so service managers
// (systemd/launchd) can reliably terminate the full tree on restart.
const useDetached = false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep POSIX child spawns detached for tree-kill correctness

Setting useDetached = false on POSIX breaks subtree termination: killProcessTree targets -pid first and only falls back to direct pid, so when the child is not a process-group leader the fallback kills only the immediate child and leaves spawned descendants running after cancel/restart. Fresh evidence in this diff is that the old no-detach fallback path was also removed (fallbacks: []), so there is no remaining path that preserves process-group-based cleanup.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ae3971e660

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +205 to +206
if (process.env.OPENCLAW_SERVICE_MARKER?.trim()) {
const stale = cleanStaleGatewayProcessesSync(port);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Detect node/bun gateway listeners before stale-PID cleanup

This new service-mode branch calls cleanStaleGatewayProcessesSync(port), but that helper only retains lsof records whose command contains openclaw (src/infra/restart-stale-pids.ts:63-74). Our service install path typically starts gateway via node/bun (src/daemon/program-args.ts:169-202), so stale listeners appear as node or bun and are filtered out. In the common systemd/launchd runtime, this startup cleanup will therefore miss the lingering gateway process and the port-bind restart race this commit targets can still reproduce.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b2e9e030f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const useDetached = process.platform !== "win32";
// Keep children attached to the parent process group so service managers
// (systemd/launchd) can reliably terminate the full tree on restart.
const useDetached = false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep POSIX child runs detached for tree-kill correctness

Setting detached to false on POSIX breaks the process-group assumption used by killProcessTree: that helper sends signals to -pid first (src/process/kill-tree.ts), which only reaches descendants when the child is the group leader. With non-detached children, cancel/restart paths will often kill only the direct child and leave grandchildren (for example shell/npm subprocesses) running, so this change can reintroduce orphaned workers.

Useful? React with 👍 / 👎.

@@ -201,6 +202,14 @@ async function runGatewayCommand(opts: GatewayRunOpts) {
defaultRuntime.exit(1);
return;
}
if (process.env.OPENCLAW_SERVICE_MARKER?.trim()) {
const stale = cleanStaleGatewayProcessesSync(port);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Detect node/bun listeners in service-mode stale PID cleanup

This new service-start cleanup call can miss the stale gateway process it is meant to clear: cleanStaleGatewayProcessesSync only keeps lsof rows whose command contains openclaw (src/infra/restart-stale-pids.ts), but gateway services are commonly launched as node/bun (src/daemon/program-args.ts), so stale listeners appear under those command names and are filtered out. In that common runtime, the startup cleanup still leaves port 18789 occupied and the restart bind race remains.

Useful? React with 👍 / 👎.

Root cause: Detached spawns escape cgroup on restarts, port races.
Changes: Timeouts/SuccessExitStatus in unit; detached=false in supervisor; port clear service mode; SIGTERM tree-kill. Added imports for killProcessTree and cleanStaleGatewayProcessesSync.
Tests: Manual orphan fixed (no PID 53919); 5x systemd restarts clean (journalctl/ps/ss).
Closes openclaw#8069 openclaw#7421 openclaw#18420 openclaw#24151 openclaw#32348 openclaw#36793.

Signed-off-by: Jesse Paul <drzin69@gmail.com>
…SIGTERM tree-kill

Update child adapter unit test expectations for detached=false and SIGTERM process-tree termination with extended grace, matching current supervisor behavior.

Signed-off-by: Jesse Paul <drzin69@gmail.com>
…eanup

Use OPENCLAW_SERVICE_MARKER (daemon-provided) instead of OPENCLAW_SERVICE_MODE in gateway run mode, and pass the resolved runtime port into cleanStaleGatewayProcessesSync so --port overrides are respected before bind.
Add restart stale-pid unit coverage for explicit port override behavior.

Signed-off-by: Jesse Paul <drzin69@gmail.com>
@spirittechie spirittechie force-pushed the fix-systemd-orphans branch from 7b2e9e0 to add0b6e Compare March 7, 2026 20:25
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: add0b6e1f7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const useDetached = process.platform !== "win32";
// Keep children attached to the parent process group so service managers
// (systemd/launchd) can reliably terminate the full tree on restart.
const useDetached = false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep POSIX child spawns detached for subtree cleanup

Setting useDetached = false (and removing the detach fallback) breaks the process-group assumption used by killProcessTree: on Unix it signals -pid first (src/process/kill-tree.ts), which only kills descendants when the child is a group leader. For non-detached child runs (for example shell/npm wrappers that spawn grandchildren), -pid usually does not map to a group, so cancellation/restart paths kill only the direct child and can leave subprocesses orphaned.

Useful? React with 👍 / 👎.

return;
}
if (process.env.OPENCLAW_SERVICE_MARKER?.trim()) {
const stale = cleanStaleGatewayProcessesSync(port);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Match stale-listener detection to node/bun service launches

This new service-start branch calls cleanStaleGatewayProcessesSync(port), but stale PID discovery only keeps lsof records whose command contains openclaw (parsePidsFromLsofOutput in src/infra/restart-stale-pids.ts). Gateway services are commonly launched through node/bun program arguments (src/daemon/program-args.ts), so those stale listeners are filtered out and the startup cleanup can still miss the process holding port 18789.

Useful? React with 👍 / 👎.

@steipete
Copy link
Copy Markdown
Contributor

steipete commented Mar 7, 2026

Deep-review pass complete.

Keeping this PR open for now, but it needs a split before landing:

  1. Runtime/process fixes (candidate to land):
  • pre-bind stale cleanup with explicit port in gateway run
  • cleanStaleGatewayProcessesSync(portOverride) + tests
  • systemd unit lifecycle/timeouts (KillMode=control-group, stop/start timeout tuning)
  1. Separate follow-up PR:
  • CI workflow fetch-depth: 0 changes (not part of this runtime fix)

I’m intentionally not landing CI/workflow changes with this runtime batch.

steipete added a commit that referenced this pull request Mar 7, 2026
steipete added a commit that referenced this pull request Mar 7, 2026
@steipete
Copy link
Copy Markdown
Contributor

steipete commented Mar 7, 2026

Landed on main in two split commits:

What we changed while landing:

  • kept and hardened service-mode stale-process cleanup before gateway bind using the resolved run port
  • added cleanStaleGatewayProcessesSync(portOverride) support + regression coverage
  • added systemd unit lifecycle hardening (TimeoutStartSec, TimeoutStopSec, SuccessExitStatus) with tests
  • rewrote child adapter behavior to preserve existing default POSIX detach behavior, but disable detach specifically in service-managed runtime (OPENCLAW_SERVICE_MARKER) so service cgroup shutdown can clean descendants reliably
  • added changelog entry in 2026.3.7 fixes
  • intentionally excluded unrelated CI workflow changes from this landing

Source commits reviewed:

Thanks @spirittechie.

@steipete steipete closed this Mar 7, 2026
vincentkoc pushed a commit to BryanTegomoh/openclaw-upstream that referenced this pull request Mar 8, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
vincentkoc pushed a commit to BryanTegomoh/openclaw-upstream that referenced this pull request Mar 8, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
openperf pushed a commit to openperf/moltbot that referenced this pull request Mar 8, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
openperf pushed a commit to openperf/moltbot that referenced this pull request Mar 8, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
mcaxtr pushed a commit to mcaxtr/openclaw that referenced this pull request Mar 8, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
mcaxtr pushed a commit to mcaxtr/openclaw that referenced this pull request Mar 8, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
GordonSH-oss pushed a commit to GordonSH-oss/openclaw that referenced this pull request Mar 9, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
GordonSH-oss pushed a commit to GordonSH-oss/openclaw that referenced this pull request Mar 9, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
jenawant pushed a commit to jenawant/openclaw that referenced this pull request Mar 10, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
jenawant pushed a commit to jenawant/openclaw that referenced this pull request Mar 10, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
V-Gutierrez pushed a commit to V-Gutierrez/openclaw-vendor that referenced this pull request Mar 17, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
V-Gutierrez pushed a commit to V-Gutierrez/openclaw-vendor that referenced this pull request Mar 17, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 21, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
(cherry picked from commit cc7e616)
alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 21, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
(cherry picked from commit cc7e616)
alexey-pelykh pushed a commit to remoteclaw/remoteclaw that referenced this pull request Mar 21, 2026
alexey-pelykh added a commit to remoteclaw/remoteclaw that referenced this pull request Mar 21, 2026
…, thanks @spirittechie) (#1761)

(cherry picked from commit e20f445)

Co-authored-by: Peter Steinberger <steipete@gmail.com>
robbyczgw-cla added a commit to robbyczgw-cla/openclaw that referenced this pull request Mar 22, 2026
…OSIX

The isServiceManagedRuntime() guard (added in e20f445) disabled
detached mode when OPENCLAW_SERVICE_MARKER was set. Without detached,
the child inherits the gateway PGID, so killProcessTree(-pid) gets
ESRCH — only bash is killed while grandchildren hold pipes open,
causing agents to hang until gateway restart.

Remove the guard: always set detached: true on POSIX. systemd
KillMode=control-group is cgroup-based and unaffected by PGID. macOS
launchd signals the job's own PGID, but the gateway shutdown handler
(abortEmbeddedPiRun) already kills all active runs before exit.

Fixes openclaw#44790
Refs openclaw#38463

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
lovewanwan pushed a commit to lovewanwan/openclaw that referenced this pull request Apr 28, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
lovewanwan pushed a commit to lovewanwan/openclaw that referenced this pull request Apr 28, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
…, thanks @spirittechie)

Co-authored-by: Jesse Paul <drzin69@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI command changes gateway Gateway runtime size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway spawns duplicate orphan process causing port conflict

2 participants