Skip to content

fix(daemon): hand off systemd self-restarts#66735

Open
alexlomt wants to merge 2 commits intoopenclaw:mainfrom
alexlomt:fix/systemd-self-restart-handoff
Open

fix(daemon): hand off systemd self-restarts#66735
alexlomt wants to merge 2 commits intoopenclaw:mainfrom
alexlomt:fix/systemd-self-restart-handoff

Conversation

@alexlomt
Copy link
Copy Markdown
Contributor

@alexlomt alexlomt commented Apr 14, 2026

Summary

On Linux/systemd, restarting the gateway from inside the running gateway service can kill the calling CLI process with SIGTERM even though the service restart succeeds.

This happens because the restart command runs inside the same openclaw-gateway.service cgroup, and the unit uses KillMode=control-group. A plain systemctl --user restart openclaw-gateway.service therefore kills the gateway and the caller together.

This change makes the Linux systemd adapter treat self-restarts differently from external restarts:

  • if the caller is already inside the target systemd service unit, schedule a detached systemd-run --user handoff and return scheduled
  • otherwise keep the existing synchronous restart path and return completed

Why this is safe

  • Linux only
  • systemd only
  • restart path only
  • external restart behaviour is preserved
  • no unit-file semantics changed
  • no change to launchd or macOS behaviour

Implementation

  • detect whether the current process is running inside the target systemd service cgroup
  • if yes, schedule a transient user unit outside that cgroup with systemd-run --user
  • wait for the current process PID to exit, then run systemctl --user restart <unit> from the detached handoff
  • surface handoff failures clearly
  • treat a matching expected SIGTERM during handoff as restart semantics instead of a normal stop

Validation performed locally

  • corepack pnpm exec oxfmt --check src/daemon/systemd.ts src/daemon/systemd.test.ts src/cli/gateway-cli/run-loop.ts src/cli/gateway-cli/run-loop.test.ts src/agents/subagent-registry.steer-restart.test.ts src/gateway/server.sessions.gateway-server-sessions-a.test.ts
  • corepack pnpm exec vitest run src/daemon/systemd.test.ts src/cli/gateway-cli/run-loop.test.ts --reporter=verbose
  • corepack pnpm exec vitest run --config test/vitest/vitest.agents.config.ts src/agents/subagent-registry.steer-restart.test.ts --reporter=verbose
  • corepack pnpm exec vitest run --config test/vitest/vitest.gateway-server.config.ts src/gateway/server.sessions.gateway-server-sessions-a.test.ts --reporter=verbose

Notes

This follows the same scheduled-restart contract already used elsewhere in the daemon lifecycle, rather than changing systemd unit kill behaviour.

Review Focus

  • src/daemon/systemd.ts
  • src/cli/gateway-cli/run-loop.ts
  • src/daemon/systemd.test.ts
  • src/cli/gateway-cli/run-loop.test.ts

Scope Boundary

  • Linux only
  • systemd only
  • restart path only
  • external restart behavior remains synchronous and unchanged

Security Impact

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? Yes
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: self-restarts inside the managed unit now schedule a detached systemd-run --user handoff only for the in-unit restart path. External restarts are unchanged, marker parsing is validated strictly, and tests cover malformed markers, custom unit names, and bounded restart draining.

Human Verification

  • Verified scenarios: self-restart handoff scheduling, marker-based SIGTERM restart handling, malformed marker rejection, custom unit normalization, and bounded supervised restart draining
  • Edge cases checked: detached restart failure cleanup, marker timestamp validation, current-main test harness after rebase
  • What I did not verify: live smoke on a real systemd user service manager

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime size: M labels Apr 14, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 14, 2026

Greptile Summary

This PR fixes a SIGTERM-kills-caller issue on Linux/systemd by detecting when restartSystemdService is invoked from within the target service cgroup and scheduling a detached systemd-run --user handoff instead of a synchronous systemctl restart. External restarts keep the existing synchronous path unchanged. Both paths have new regression test coverage and the cgroup-matching logic handles cgroups v1/v2 correctly.

Confidence Score: 5/5

  • Safe to merge; all findings are minor style/robustness suggestions with no correctness impact.
  • The core logic (cgroup detection, argument passing, handoff scheduling, error surfacing) is correct and well-tested. Both remaining comments are P2: an unbounded PID-wait loop that is operationally low-risk given normal SIGTERM handling, and an unnecessarily broad parameter type.
  • No files require special attention.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/daemon/systemd.ts
Line: 125-135

Comment:
**Unbounded PID-wait loop in handoff script**

The `while kill -0 "$wait_pid"` loop has no timeout. If the gateway process ignores `SIGTERM` and never exits, the transient handoff unit will spin indefinitely, and the target service will never restart. Consider adding a deadline (e.g. 30 s) after which the loop falls through or bails with a non-zero exit so the failure is surfaced.

```suggestion
function buildSystemdRestartHandoffScript(): string {
  return `wait_pid="$1"
 target_unit="$2"
 deadline=$(( $(date +%s) + 30 ))
 if [ -n "$wait_pid" ] && [ "$wait_pid" -gt 1 ] 2>/dev/null; then
   while kill -0 "$wait_pid" >/dev/null 2>&1; do
     if [ "$(date +%s)" -ge "$deadline" ]; then
       echo "openclaw-systemd-restart-handoff: timed out waiting for pid $wait_pid to exit" >&2
       exit 1
     fi
     sleep 0.1
   done
 fi
 exec systemctl --user restart "$target_unit"
 `;
}
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/daemon/systemd.ts
Line: 80-90

Comment:
**Unnecessary `undefined` in parameter type**

`cgroupPath` is typed as `string | undefined`, but every call site passes a `string` (the elements of `string[]` from `readCurrentProcessCgroupPaths`). The `| undefined` branch is dead code. Narrowing the type keeps the signature honest.

```suggestion
function matchesSystemdCgroupUnit(cgroupPath: string, unitName: string): boolean {
  const normalizedPath = normalizeOptionalString(cgroupPath);
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(daemon): hand off systemd self-resta..." | Re-trigger Greptile

Comment thread src/daemon/systemd.ts
Comment thread src/daemon/systemd.ts Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d78d9eb6ba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/daemon/systemd.ts Outdated
@alexlomt
Copy link
Copy Markdown
Contributor Author

Pushed f852688b9f with the follow-up fixes.

What changed:

  • self-restart detection now uses current-process evidence instead of merged target env
  • cgroup detection stays primary, with fallback limited to real systemd runtime hints plus matching gateway markers
  • the detached handoff script now has a 30s deadline and warn-and-continue behavior
  • narrowed matchesSystemdCgroupUnit(cgroupPath) to string
  • added regression coverage for merged gateway env staying synchronous and for the runtime-hint fallback

Validation:

  • node scripts/test-projects.mjs src/daemon/systemd.test.ts src/daemon/service.test.ts
  • live external smoke against a disposable user unit: returned {"outcome":"completed"} and restarted immediately
  • live self-restart smoke from inside a disposable user unit: returned {"outcome":"scheduled"} and restarted via detached handoff

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f852688b9f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/daemon/systemd.ts
@alexlomt
Copy link
Copy Markdown
Contributor Author

Unrelated security-fast workflow fix is split out into #66884 so this daemon PR can stay focused. That PR hardens openclaw-cross-os-release-checks-reusable.yml against the current zizmor template-injection findings and adds explicit inputs.mode validation.

@openclaw-barnacle openclaw-barnacle Bot added cli CLI command changes size: L and removed size: M labels Apr 19, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 232b40df87

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cli/gateway-cli/run-loop.ts Outdated
@alexlomt
Copy link
Copy Markdown
Contributor Author

Follow-up added in 232b40df87:

  • write a short-lived systemd restart expectation marker before the detached handoff restarts the unit
  • treat the resulting gateway SIGTERM as restart semantics instead of a normal stop when that marker is present
  • add targeted coverage in src/cli/gateway-cli/run-loop.test.ts alongside the daemon handoff tests

Local verification for the follow-up:

  • corepack pnpm exec oxfmt --check src/daemon/systemd.ts src/daemon/systemd.test.ts src/cli/gateway-cli/run-loop.ts src/cli/gateway-cli/run-loop.test.ts
  • corepack pnpm exec vitest run --config test/vitest/vitest.daemon.config.ts src/daemon/systemd.test.ts
  • corepack pnpm exec vitest run --config test/vitest/vitest.cli.config.ts src/cli/gateway-cli/run-loop.test.ts

@openclaw-barnacle openclaw-barnacle Bot added the agents Agent runtime and tooling label Apr 20, 2026
@alexlomt alexlomt force-pushed the fix/systemd-self-restart-handoff branch from dd5ac56 to cf56098 Compare April 20, 2026 00:12
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cf560980da

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cli/gateway-cli/run-loop.ts Outdated
Comment thread src/daemon/systemd.ts Outdated
@alexlomt alexlomt force-pushed the fix/systemd-self-restart-handoff branch from cf56098 to f6be80b Compare April 20, 2026 00:56
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f6be80b9b6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/cli/gateway-cli/run-loop.ts Outdated
@alexlomt alexlomt force-pushed the fix/systemd-self-restart-handoff branch from f6be80b to d43dff0 Compare April 24, 2026 21:04
@openclaw-barnacle openclaw-barnacle Bot removed the agents Agent runtime and tooling label Apr 24, 2026
@alexlomt
Copy link
Copy Markdown
Contributor Author

High-value ops fix for Linux/systemd installs: restarting from inside the managed unit can SIGTERM the caller because of KillMode=control-group. This PR only changes the self-restart path; external restarts remain unchanged. Review focus: src/daemon/systemd.ts, src/cli/gateway-cli/run-loop.ts, and the matching daemon/run-loop tests. Current branch is green, including custom-unit normalization, malformed marker rejection, and supervisor-bounded restart draining.

@vincentkoc
Copy link
Copy Markdown
Member

ProjectClownfish pushed a narrow repair to this branch so the original contributor path can stay canonical.

Source PR: #66735
Validation: pnpm check:changed
Contributor credit is preserved in the branch history and PR context.

@vincentkoc vincentkoc force-pushed the fix/systemd-self-restart-handoff branch from 7a191b5 to 9eeb526 Compare April 28, 2026 18:43
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 28, 2026

Codex review: needs maintainer review before merge.

What this changes:

The PR adds Linux/systemd self-restart detection, a detached systemd-run --user restart handoff, restart expectation markers consumed by the gateway SIGTERM path, regression tests, and a changelog entry.

Maintainer follow-up before merge:

This is a focused contributor implementation with maintainer follow-up, but the branch is dirty/unmergeable and the remaining work is maintainer rebase/review plus live systemd validation rather than a separate autonomous replacement fix.

Security review:

Security review cleared: No concrete security or supply-chain regression was found; the new command execution path is limited to local Linux/user-systemd self-restarts and passes dynamic values as argv plus quoted shell variables.

Review details

Best possible solution:

Rebase this implementation onto current main, preserve the split between synchronous external restarts and detached in-unit self-restarts, keep the marker-based SIGTERM path bounded by the systemd stop budget, then land after targeted daemon/CLI tests, changed-gate validation, and a disposable user-systemd smoke for both external and self-restart paths.

Do we have a high-confidence way to reproduce the issue?

Yes. A high-confidence source-level reproduction is to invoke a gateway restart from inside the generated openclaw-gateway.service while it uses KillMode=control-group; current main runs systemctl --user restart synchronously from that process, and systemd’s documented kill behavior can terminate the caller with the service cgroup.

Is this the best way to solve the issue?

Yes as a direction, pending rebase and smoke. The proposed fix is narrower than changing unit kill semantics: it only hands off same-unit systemd restarts to a transient user unit, keeps external restarts synchronous, and adds a marker so the expected restart SIGTERM is not treated as a normal stop.

Acceptance criteria:

  • pnpm test src/daemon/systemd.test.ts src/cli/gateway-cli/run-loop.test.ts
  • pnpm check:changed in Testbox after rebase
  • Live disposable Linux user-systemd smoke: external restart returns completed and in-unit self-restart returns scheduled

What I checked:

  • Current main restart path is synchronous: restartSystemdService still delegates to runSystemdServiceAction with action restart and returns { outcome: "completed" }; there is no self-cgroup detection or detached handoff on current main. (src/daemon/systemd.ts:729, a256745323ab)
  • Generated systemd unit uses control-group kill semantics: The generated service unit sets TimeoutStopSec=30, SuccessExitStatus=0 143, and KillMode=control-group, matching the PR premise that restart stops the whole service cgroup. (src/daemon/systemd-unit.ts:80, a256745323ab)
  • Current main lacks systemd marker SIGTERM handling: The current SIGTERM handler only consumes the generic gateway restart intent and otherwise stops; it does not inspect a systemd restart expectation marker. (src/cli/gateway-cli/run-loop.ts:394, a256745323ab)
  • PR head implements the handoff path: At PR head, restartSystemdService detects whether the current process is in the target unit, schedules a detached systemd-run --user handoff, writes scheduled restart output, and returns { outcome: "scheduled" }; the external path remains synchronous. (src/daemon/systemd.ts:943, 9eeb52686ae5)
  • PR head handles expected systemd restart SIGTERM: At PR head, SIGTERM first consumes consumeSystemdRestartExpectationMarker(process.env) and routes matching signals through supervised-restart before falling back to the existing generic restart intent. (src/cli/gateway-cli/run-loop.ts:440, 9eeb52686ae5)
  • Regression coverage targets the edge cases: The PR adds daemon and run-loop tests for custom unit normalization, self-handoff scheduling, external synchronous restart preservation, malformed marker rejection, and supervisor-bounded marker restart draining. (src/daemon/systemd.test.ts:1223, 9eeb52686ae5)

Likely related people:

  • steipete: Recent main history shows repeated maintenance and refactors in src/daemon/systemd.ts, src/daemon/systemd-unit.ts, src/daemon/service.ts, and src/cli/gateway-cli/run-loop.ts, including the current restart and lifecycle seams touched by this PR. (role: current-main maintainer for daemon and gateway lifecycle paths; confidence: high; commits: dd0f5937d2b5, 67f1266fe8a2, 616f24fd49cd; files: src/daemon/systemd.ts, src/daemon/systemd-unit.ts, src/daemon/service.ts)
  • vincentkoc: Recent main commits cover systemd user-bus, sudo user-systemd restart/install behavior, and gateway restart draining, and the PR timeline says Vincent pushed the current repair commit to this branch. (role: recent systemd/gateway maintainer and PR follow-up owner; confidence: high; commits: 56ca4e2269df, b246c06fa5e5, 2b98cb6d8bfc; files: src/daemon/systemd.ts, src/daemon/systemd.test.ts, src/cli/gateway-cli/run-loop.ts)
  • lml2468: Prior run-loop work introduced and repaired in-process restart behavior, lock release, and startup failure handling around the same restart loop that this PR extends for systemd SIGTERM semantics. (role: adjacent owner for gateway restart loop behavior; confidence: medium; commits: 6740cdf160a2, c79a0dbdb4b4, f84adcbe8826; files: src/cli/gateway-cli/run-loop.ts)

Remaining risk / open question:

  • The PR head is dirty and unmergeable against current main, so conflict resolution may change the touched lifecycle surface before landing.
  • I did not run a live Linux user-systemd smoke in this read-only review; the live proof currently comes from the PR discussion plus source and regression-test inspection.
  • Exact-head broad CI contains failures from the April 28 run and should be rerun after rebase; targeted local tests were reported in the PR body/comments but were not rerun here due the read-only constraint.

Codex review notes: model gpt-5.5, reasoning high; reviewed against a256745323ab.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clawsweeper Tracked by ClawSweeper automation cli CLI command changes gateway Gateway runtime size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants