Skip to content

fix(daemon): detect system-scope systemd gateway units on Linux (#87577)#87618

Merged
steipete merged 1 commit into
openclaw:mainfrom
yetval:fix/87577-systemd-system-scope
May 31, 2026
Merged

fix(daemon): detect system-scope systemd gateway units on Linux (#87577)#87618
steipete merged 1 commit into
openclaw:mainfrom
yetval:fix/87577-systemd-system-scope

Conversation

@yetval

@yetval yetval commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #87577 by teaching the Linux gateway service code to recognize system-scope systemd units in three forms — the canonical user-scope path, the canonical system-scope path (/etc/systemd/system/openclaw-gateway.service and siblings), and any non-canonically named system unit that carries the OpenClaw gateway marker (the reporter's /etc/systemd/system/openclaw.service shape). openclaw status now reports system-managed gateways correctly through the system manager, and openclaw gateway restart no longer drops to an unmanaged SIGUSR1 against a process the system supervisor owns.

Reuses two existing patterns already shipping in main:

  • findSystemGatewayServices in src/daemon/inspect.ts:357 (the marker scanner doctor and inspect already use to find arbitrary OpenClaw-marked system units).
  • The sudo-guidance shape from src/cli/update-cli/restart-helper.ts:98.

Maintainer decision requested (clawsweeper P1)

The non-root stop/restart behavior for a detected system-scope unit is an intentional operator-policy change that clawsweeper flagged as a compatibility risk (root AGENTS.md: removed fallbacks / fail-closed changes / new operator action are merge-sensitive). Requesting explicit maintainer acceptance of this policy:

  • Before: openclaw gateway stop / restart fell through to unmanaged SIGTERM/SIGUSR1 against whatever PID held the gateway port — including a process systemd owns. That signal-an-unmanaged-pid path is the Bug: openclaw gateway restart and openclaw status do not detect system-level systemd service  #87577 bug.
  • After (this PR): when a system-scope unit is detected, root runs systemctl <action> <unit> directly; non-root gets actionable sudo systemctl <action> <unit> guidance that names the real installed unit, and no signal is sent.

This matches the operator guidance already shipping for the update-restart path (src/cli/update-cli/restart-helper.ts:98) and AGENTS.md's "one canonical path, no buggy-fallback compat" rule. The old fallback is the bug, not a shipped contract, so it is removed rather than preserved. If a maintainer would instead prefer (a) silent sudo escalation, (b) polkit integration, or (c) keeping the SIGUSR1 fallback for system-scope units behind an explicit flag, say which and I'll revise.

Behavior change

src/daemon/systemd.ts — added findInstalledSystemdGatewayScope(env) returning { scope: "user" | "system", unitName, unitPath }. Three call sites now use it:

  • isSystemdServiceEnabled — falls through to system-scope systemctl is-enabled (no --user) for system units, using installed.unitName so non-canonical names (like openclaw.service) are queried correctly. Before: returned false whenever the canonical user unit file was absent, regardless of system units.
  • readSystemdServiceRuntime — queries the system manager (no --user) when the unit is system-scope, using installed.unitName. Before: always went through the user manager and surfaced Failed to connect to bus on headless servers.
  • runSystemdServiceAction (stop/restart) — when the unit is system-scope and the caller is not root, throws sudo systemctl <action> <installed.unitName> guidance (no silent privilege escalation). When root, runs systemctl <action> directly. Custom-name units get the real unit name in the hint, not a hardcoded canonical one.

src/cli/daemon-cli/lifecycle.tsstopGatewayWithoutServiceManager and restartGatewayWithoutServiceManager now route a detected system-scope unit through the canonical stopSystemdService / restartSystemdService helpers instead of the unmanaged SIGUSR1/SIGTERM fallback. This makes the lifecycle fallback and the systemd action path consistent: root runs systemctl <action> <installed.unitName> directly, non-root gets the same sudo-guidance message runSystemdServiceAction already emits (one canonical hint string, no duplicate). Net effect: the unmanaged signal fallback can never reach a systemd-managed gateway process, and a root operator is never told to sudo a command it can already run. (The earlier revision threw unconditionally here, including for root — that inconsistency is removed.)

findInstalledSystemdGatewayScope resolution order:

  1. Canonical user-scope unit file ($HOME/.config/systemd/user/openclaw-gateway[-<profile>].service).
  2. Canonical system-scope unit file (/etc/systemd/system/, /usr/lib/systemd/system/, /lib/systemd/system/).
  3. Marker-based fallback for any other name: delegates to findSystemGatewayServices (src/daemon/inspect.ts) which scans the same three system dirs and recognizes any .service file carrying the OPENCLAW_SERVICE_MARKER / OPENCLAW_SERVICE_KIND env markers (or a marker-bearing ExecStart). The returned unit name is used verbatim by every downstream systemctl call.

User-scope behavior is unchanged.

Diff size

 src/cli/daemon-cli/lifecycle.test.ts | 104 ++++++++++++++
 src/cli/daemon-cli/lifecycle.ts      |  39 ++++++
 src/cli/daemon-cli/response.ts       |   2 +-
 src/daemon/systemd.test.ts           | 242 +++++++++++++++++++++++++++++++-
 src/daemon/systemd.ts                | 153 +++++++++++++++-----
 5 files changed, 508 insertions(+), 32 deletions(-)

(response.ts change: export the existing createNullWriter helper so the lifecycle fallback can discard the delegated systemctl status line and surface its own message.)

Test plan

  • New unit tests for findInstalledSystemdGatewayScope: user-scope preferred when both exist; canonical /etc/systemd/system/openclaw-gateway.service; canonical /usr/lib/systemd/system/openclaw-gateway.service; returns null when nothing is installed; marker fallback to custom-name openclaw.service; legacy clawdbot.service marker units ignored.
  • isSystemdServiceEnabled against custom-name marker unit.
  • restartSystemdService raises sudo guidance with the marker-owned unit's real name; root path runs systemctl restart <unit> directly.
  • readSystemdServiceRuntime queries the system manager for custom-name marker units.
  • Lifecycle restart/stop tests cover both operator paths for a detected system-scope unit: root delegates to systemctl with no unmanaged signal; non-root surfaces sudo systemctl <action> <installed.unitName> guidance and never signals.
  • Focused acceptance suite (local Linux, Node 22.19): src/daemon/systemd.test.ts (84), src/cli/daemon-cli/lifecycle.test.ts (24), src/daemon/inspect.test.ts, src/cli/daemon-cli/lifecycle-core.test.ts, src/cli/daemon-cli/status.gather.test.ts, src/cli/daemon-cli/response.test.ts, src/commands/doctor-gateway-daemon-flow.test.ts, src/cli/update-cli/restart-helper.test.ts — all green. tsgo -p tsconfig.core.json + core test-types clean; oxlint clean on changed files.
  • CI checks + clawsweeper re-review.

Real behavior proof

Behavior addressed: #87577 — on Linux the gateway service code only checked the user-scope systemd unit, so a system-scope unit (including the reporter's custom-named /etc/systemd/system/openclaw.service) was reported wrong by openclaw status, and openclaw gateway restart/stop dropped to an unmanaged SIGUSR1/SIGTERM against the systemd-owned process. After the patch the code detects the system-scope unit (canonical or marker-named), reports it through the system manager, and routes stop/restart through systemd: root runs systemctl <action>, non-root fails closed with sudo systemctl <action> <unit> guidance and never signals.

Real environment tested: real systemd host — Linux 6.8.0-71-generic, systemd 255 (255.4-1ubuntu8.14), Node v22.19 — against a transient custom-name unit /etc/systemd/system/openclaw.service carrying the reporter's marker env (OPENCLAW_SERVICE_MARKER=openclaw, OPENCLAW_SERVICE_KIND=gateway) and a dedicated non-root User= (uid 1000). The unit was systemctl started (not enabled) and fully removed after the run.

Exact steps or command run after this patch: ran the real openclaw CLI straight from the source checkout at PR head e2ec15cnode scripts/run-node.mjs gateway stop and node scripts/run-node.mjs gateway restart (this builds and runs the patched TypeScript; not a copy, not a stub) — as root (uid 0) and as the non-root unit user (uid 1000), against the live custom-name unit. Separately exercised the real exported functions (findInstalledSystemdGatewayScope, isSystemdServiceEnabled, readSystemdServiceRuntime, stopSystemdService, restartSystemdService) from an esbuild bundle of src/daemon/systemd.ts at the same head, wrapped by a verbatim copy of src/cli/daemon-cli/lifecycle.ts:handleSystemScopeSystemdGateway.

Evidence after fix:

Actual patched CLI from the source checkout (node scripts/run-node.mjs gateway <action>):

# root (uid 0)
$ node scripts/run-node.mjs gateway stop
Gateway stopped via system-scope systemd unit openclaw.service.        # unit: active -> inactive

$ node scripts/run-node.mjs gateway restart
# MainPID 179861 -> 179973  (systemd actually restarted the unit via `systemctl restart`)
Timed out after 60s waiting for gateway port 18789 to become healthy.  # expected: stand-in listener is a plain `python -m http.server`, not a real gateway WS endpoint — the restart itself succeeded

# non-root (uid 1000)
$ node scripts/run-node.mjs gateway stop
Gateway stop failed: Error: openclaw.service is a system-scope unit (/etc/systemd/system/openclaw.service); run `sudo systemctl stop openclaw.service` to stop it
# unit stayed active — no signal sent

$ node scripts/run-node.mjs gateway restart
Gateway restart failed: Error: openclaw.service is a system-scope unit (/etc/systemd/system/openclaw.service); run `sudo systemctl restart openclaw.service` to restart it
# MainPID unchanged — no signal sent

Real exported functions from src/daemon/systemd.ts (esbuild bundle of the actual source, same host):

=== findInstalledSystemdGatewayScope ===
{ "scope": "system", "unitName": "openclaw.service", "unitPath": "/etc/systemd/system/openclaw.service" }
=== readSystemdServiceRuntime ===   # queried via the system manager, no --user
{ "status": "running", "state": "active", "subState": "running", "pid": 178463,
  "systemd": { "unit": "openclaw.service", "killMode": "control-group", ... } }
function-level action root (uid 0) non-root (uid 1000)
restart { result: "restarted" }; MainPID 178463 -> 178808 THREW … sudo systemctl restart openclaw.service; MainPID unchanged, no signal
stop { result: "stopped" }; active -> inactive THREW … sudo systemctl stop openclaw.service; stayed active, no signal

Observed result after fix: the system-scope custom-name unit is detected via the marker fallback ({ scope: "system", unitName: "openclaw.service" }), status is read through the system manager (no --user), and stop/restart route through systemd — root drives a real systemctl action (stop → inactive, restart → new MainPID), while non-root fails closed with the real-unit sudo systemctl <action> openclaw.service hint and never signals the systemd-owned process. The only non-patch noise is the post-restart health check timing out because the stand-in listener is a plain HTTP server rather than a real gateway. User-scope and canonical-name deployments are unchanged (existing tests stay green).

What was not tested: a fully packaged openclaw release binary (as opposed to the source checkout used above). The published 2026.5.26 tarball ships a broken npm-shrinkwrap.json (omits json5, tslog, …) so it cannot materialize working node_modules/; the source-checkout CLI run above reaches the identical changed src/cli/daemon-cli/lifecycle.ts path. Happy to rerun against a maintainer-built tarball.

Refs

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime cli CLI command changes size: M triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 28, 2026
@clawsweeper

clawsweeper Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed May 31, 2026, 1:51 PM ET / 17:51 UTC.

Summary
The PR adds Linux systemd gateway scope detection for user units, canonical system units, and marker-owned custom system units, then routes status, stop, and restart through the detected unit with regression tests.

PR surface: Source +144, Tests +332. Total +476 across 5 files.

Reproducibility: yes. source-reproducible: the linked issue describes a system-level systemd unit, and current main only checks the user unit path and uses systemctl --user before falling back to SIGTERM/SIGUSR1. I did not run a live systemd repro in this read-only review.

Review metrics: 1 noteworthy metric.

  • System-scope lifecycle policy: 2 fallback actions changed. Stop and restart now avoid unmanaged signals for detected system units, so maintainers need to notice and accept the operator behavior before merge.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P1] Have a maintainer explicitly accept or revise the non-root system-scope fail-closed behavior before merge.

Risk before merge

  • [P2] Non-root stop/restart for detected system-scope units no longer sends SIGTERM/SIGUSR1 and instead requires sudo systemctl guidance, so existing automation that relied on the fallback may need operator action after upgrade.
  • [P2] The marker-based fallback is intentionally broad enough to pick a custom OpenClaw-marked system unit when no canonical unit exists; maintainers should accept that as the supported discovery rule for custom system units before merge.

Maintainer options:

  1. Accept fail-closed systemd control (recommended)
    Record maintainer acceptance that non-root system-scope stop/restart should emit sudo guidance rather than signaling a supervisor-owned PID.
  2. Revise to a selected operator policy
    If maintainers prefer silent sudo escalation, polkit, or an opt-in signal fallback, require that policy to be implemented with matching root and non-root tests before merge.
  3. Defer to broader system-scope design
    If custom system-unit ownership should be solved in a larger service-scope design, keep this PR unmerged and route the linked issue to that canonical item.

Next step before merge

  • [P2] The remaining action is maintainer acceptance of the compatibility-sensitive operator-policy change, not a narrow automation repair.

Security
Cleared: The diff changes local service-manager detection/control and tests, with no new dependencies, credentials handling, workflow permissions, or downloaded code execution paths; no concrete security or supply-chain concern found.

Review details

Best possible solution:

Land this PR after a maintainer explicitly accepts the fail-closed non-root system-scope policy, then close the linked bug and supersede narrower open attempts that solve only part of the same systemd-scope problem.

Do we have a high-confidence way to reproduce the issue?

Yes, source-reproducible: the linked issue describes a system-level systemd unit, and current main only checks the user unit path and uses systemctl --user before falling back to SIGTERM/SIGUSR1. I did not run a live systemd repro in this read-only review.

Is this the best way to solve the issue?

Yes, this looks like the best bounded fix location: systemd ownership belongs in the daemon service adapter and lifecycle fallback, and the PR reuses the existing marker scanner instead of adding a second detector. The unresolved part is the maintainer policy choice for non-root system-scope control.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 058152cf6990.

Label changes

Label changes:

  • add proof: sufficient: Contributor real behavior proof is sufficient. The PR includes redacted terminal output from a real Linux/systemd host running the source-checkout CLI against a custom /etc/systemd/system/openclaw.service, and the touched files match the latest head after the maintainer rebase.

Label justifications:

  • P2: This is a normal-priority Linux gateway lifecycle bug fix with limited platform blast radius but real operator impact.
  • merge-risk: 🚨 compatibility: Merging changes existing non-root system-managed gateway stop/restart behavior from unmanaged signals to fail-closed sudo guidance.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR includes redacted terminal output from a real Linux/systemd host running the source-checkout CLI against a custom /etc/systemd/system/openclaw.service, and the touched files match the latest head after the maintainer rebase.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR includes redacted terminal output from a real Linux/systemd host running the source-checkout CLI against a custom /etc/systemd/system/openclaw.service, and the touched files match the latest head after the maintainer rebase.
Evidence reviewed

PR surface:

Source +144, Tests +332. Total +476 across 5 files.

View PR surface stats
Area Files Added Removed Net
Source 3 169 25 +144
Tests 2 339 7 +332
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 5 508 32 +476

What I checked:

  • Repository policy read: Root AGENTS.md was read fully; its ClawSweeper review policy treats removed fallbacks, fail-closed behavior, and new operator action as compatibility-sensitive merge risk. (AGENTS.md:31)
  • Current main still has the reported user-scope-only behavior: At current main, systemd stop/restart always asserts the user manager and calls systemctl --user for the resolved user unit, while is-enabled only checks the user unit path before running systemctl --user. (src/daemon/systemd.ts:1046, 058152cf6990)
  • Current main unmanaged fallback can signal a supervised process: At current main, the fallback stop path sends SIGTERM to verified gateway listener PIDs and the fallback restart path sends SIGUSR1, which matches the linked issue's systemd-owned-process symptom. (src/cli/daemon-cli/lifecycle.ts:115, 058152cf6990)
  • PR source implements system/user scope detection: The PR adds findInstalledSystemdGatewayScope, preferring user units, then canonical system unit paths, then marker-owned custom system units discovered through findSystemGatewayServices. (src/daemon/systemd.ts:81, b5e81e1abfd3)
  • PR source routes system units through the system manager: For system-scope units, the PR uses systemctl without --user for root callers and throws sudo systemctl guidance for non-root callers; is-enabled and runtime show also use the detected unit name and manager scope. (src/daemon/systemd.ts:1136, b5e81e1abfd3)
  • PR source blocks unmanaged lifecycle signals for detected system units: The lifecycle fallback now checks for a system-scope systemd gateway before unmanaged PID signaling, delegating stop/restart to the canonical systemd helpers instead. (src/cli/daemon-cli/lifecycle.ts:121, b5e81e1abfd3)

Likely related people:

  • steipete: Committed the current PR head and previously refactored systemd service action flow and daemon service start state flow in the affected daemon lifecycle area. (role: recent area contributor and PR-head committer; confidence: high; commits: b5e81e1abfd3, 9fd810e3a6da, 258a214bcbff; files: src/daemon/systemd.ts, src/daemon/service.ts, src/cli/daemon-cli/lifecycle-core.ts)
  • vincentkoc: Recent history shows work on daemon stop fallback port resolution and the unmanaged gateway stop/restart listener path this PR changes. (role: recent lifecycle contributor; confidence: high; commits: 604a5e07d0f7, bf9c362129e2; files: src/cli/daemon-cli/lifecycle.ts)
  • gregretkowski: Prior systemd duplicate gateway detection work touched the inspect marker-scanning path that this PR reuses for custom system units. (role: adjacent systemd detection contributor; confidence: medium; commits: 14430ade573a; files: src/daemon/inspect.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 28, 2026
@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal backlog priority with limited blast radius. merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. labels May 28, 2026
@clawsweeper

clawsweeper Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg: ✨ hatched 🥚 common Neon Review Wisp. Rarity: 🥚 common. Trait: sparkles near resolved comments.

Details

Share on X: post this hatch
Copy: My PR egg hatched a 🥚 common Neon Review Wisp in ClawSweeper.
Hatchability:

  • Merged PRs are hatchable.
  • Open PRs are hatchable when they are status: 👀 ready for maintainer look, status: 🚀 automerge armed, or labeled clawsweeper:automerge.
  • Closed unmerged PRs are hatchable only when one of those hatchable labels is still present in the durable record.

About:

  • Eggs appear after real-behavior proof passes. They are collectible flavor only.
  • Review momentum changes the shell state: follow-up work warms it, re-review makes it wobble, and a clean final review lets it hatch.
  • The hatch is seeded from this repository and PR number, so the same PR keeps the same creature; the reviewed head SHA can only change safe visual details.
  • Rarity is just collectible sparkle: 🥚 common, 🌱 uncommon, 💎 rare, ✨ glimmer, and 🌈 legendary.

yetval added a commit to yetval/openclaw that referenced this pull request May 28, 2026
…nclaw#87577)

Extends findInstalledSystemdGatewayScope to fall back to marker-based
discovery via findSystemGatewayServices when the canonical unit file is
absent, so operators who install the gateway under a non-canonical name
(e.g. /etc/systemd/system/openclaw.service in the linked reproducer) get
the same system-scope routing.

InstalledSystemdGatewayScope now carries the discovered unitName so
is-enabled / show / restart / stop all target the actual installed unit
instead of the canonical name.

Addresses clawsweeper review on openclaw#87618.
@yetval

yetval commented May 28, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

Addressed:

  • Custom-name system unit shape now detected via marker fallback (findSystemGatewayServices from src/daemon/inspect.ts); InstalledSystemdGatewayScope carries the discovered unitName so is-enabled / show / restart / stop target the real installed unit.
  • New tests cover the linked openclaw.service shape end-to-end.
  • Live droplet proof rerun against /etc/systemd/system/openclaw.service (matches issue reproducer): baseline still SIGUSR1s the systemd-managed PID, post-fix detects { scope: 'system', unitName: 'openclaw.service', unitPath: '/etc/systemd/system/openclaw.service' }, refuses unmanaged signal, surfaces correct sudo hint. Full artifacts in PR body.

@clawsweeper

clawsweeper Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@yetval

yetval commented May 28, 2026

Copy link
Copy Markdown
Contributor Author

Note: check-guards failed with V8 "JavaScript heap out of memory" inside tsgo -p tsconfig.plugin-sdk.dts.json --declaration true — unrelated to this PR's surface (no plugin-sdk dts changes; diff is daemon/systemd + cli/daemon-cli/lifecycle only). Looks like a CI runner heap flake. As a contributor I can't gh run rerun; a maintainer rerun should clear it.

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 28, 2026
@yetval

yetval commented May 28, 2026

Copy link
Copy Markdown
Contributor Author

Rank-up actions:

[P2] check-guards failure inspection: the failing job ran node scripts/run-tsgo.mjs -p tsconfig.plugin-sdk.dts.json --declaration true and hit FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory (log). tsconfig.plugin-sdk.dts.json only includes src/plugin-sdk/**/*.ts, packages/memory-host-sdk/src/**/*.ts, src/video-generation/{dashscope-compatible,types}.ts, and src/types/**/*.d.ts. This PR touches only src/daemon/systemd.{ts,test.ts} and src/cli/daemon-cli/lifecycle.{ts,test.ts} — none of which are in the plugin-sdk dts graph. The OOM is a CI runner heap flake in the typecheck infra, not a regression from this diff. A maintainer gh run rerun --failed 26574771172 should clear it; I do not have write access.

Non-root sudo-guidance behavior — maintainer acceptance request: this PR intentionally changes Linux system-scope stop/restart from "fall through to unmanaged SIGUSR1/SIGTERM against whatever pid is on the gateway port" to "throw \sudo systemctl <installed.unitName>`guidance for non-root callers and runsystemctl directly when the caller is root". No silent privilege escalation, no polkit integration. If a maintainer would prefer (a) silent sudo escalation, (b) polkit, or (c) preserve the SIGUSR1 fallback for system-scope units, I'll revise — happy to land whichever shape the maintainer wants. The default chosen here matches the operator-facing guidance already shipping for the update-restart path (src/cli/update-cli/restart-helper.ts:98`).

yetval added a commit to yetval/openclaw that referenced this pull request May 29, 2026
…nclaw#87577)

Extends findInstalledSystemdGatewayScope to fall back to marker-based
discovery via findSystemGatewayServices when the canonical unit file is
absent, so operators who install the gateway under a non-canonical name
(e.g. /etc/systemd/system/openclaw.service in the linked reproducer) get
the same system-scope routing.

InstalledSystemdGatewayScope now carries the discovered unitName so
is-enabled / show / restart / stop all target the actual installed unit
instead of the canonical name.

Addresses clawsweeper review on openclaw#87618.
@yetval yetval force-pushed the fix/87577-systemd-system-scope branch from 57901d3 to 8d43ff3 Compare May 29, 2026 02:28
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 29, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. labels May 29, 2026
@yetval yetval force-pushed the fix/87577-systemd-system-scope branch from 8d43ff3 to 7bd4c57 Compare May 29, 2026 04:42
@yetval yetval force-pushed the fix/87577-systemd-system-scope branch from db286a6 to 5555f08 Compare May 31, 2026 13:28
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 31, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 31, 2026
@yetval yetval force-pushed the fix/87577-systemd-system-scope branch from 5555f08 to e2ec15c Compare May 31, 2026 14:52
@openclaw-barnacle openclaw-barnacle Bot added size: L and removed size: M proof: sufficient ClawSweeper judged the real behavior proof convincing. labels May 31, 2026
@yetval

yetval commented May 31, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

Addressed the P1 (gate the non-root system-scope fail-closed policy at src/cli/daemon-cli/lifecycle.ts):

  • Made the two code paths consistent. The unmanaged lifecycle fallback (stopGatewayWithoutServiceManager / restartGatewayWithoutServiceManager) now delegates a detected system-scope unit to the canonical stopSystemdService / restartSystemdService path instead of throwing unconditionally. Result: root runs systemctl <action> <unit> directly; non-root gets the single canonical sudo systemctl <action> <unit> guidance (one hint string, no duplicate). A root operator is no longer told to sudo a command it can already run — the previous revision threw for everyone, including root.
  • Policy made explicit for maintainer acceptance. The non-root fail-closed behavior is intentional and now called out under "Maintainer decision requested" in the PR body, with before/after and three alternative shapes (silent sudo escalation, polkit, or a flagged SIGUSR1 fallback) if a maintainer prefers one. The removed fallback is the Bug: openclaw gateway restart and openclaw status do not detect system-level systemd service  #87577 bug, not a shipped contract, which is why it's deleted rather than preserved (per root AGENTS.md).
  • Hardened lifecycle tests: both operator paths now covered for stop and restart — root delegates to systemctl with no unmanaged signal; non-root surfaces sudo systemctl <action> <installed.unitName> guidance and never signals.

Local proof (Linux, Node 22.19): src/daemon/systemd.test.ts (84) and src/cli/daemon-cli/lifecycle.test.ts (24) plus inspect / lifecycle-core / status.gather / response / doctor-gateway-daemon-flow / restart-helper suites all green; tsgo -p tsconfig.core.json and the core test-types lane both clean; oxlint clean on the changed files. Squashed to one commit (e2ec15c).

Maintainer-owned items unchanged: explicit acceptance of the non-root fail-closed policy, plus the packaged-CLI proof gap and the check-guards heap-OOM flake (both noted in the body; I can't gh run rerun as a contributor).

@clawsweeper

clawsweeper Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels May 31, 2026
@openclaw-barnacle openclaw-barnacle Bot added triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. and removed proof: supplied External PR includes structured after-fix real behavior proof. labels May 31, 2026
@yetval

yetval commented May 31, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

Added the real-source terminal proof the last review asked for. The previous driver used copied helpers; this run imports the actual patched source.

On a real systemd host (Linux 6.8.0-71-generic, systemd 255, Node v22.19), against a transient custom-name unit /etc/systemd/system/openclaw.service (reporter's marker shape), I exercised the real exported functions from an esbuild bundle of src/daemon/systemd.ts at PR head e2ec15c — not a hand-copy — wrapped only by a verbatim copy of the 24-line src/cli/daemon-cli/lifecycle.ts:handleSystemScopeSystemdGateway (the one non-exported piece), which calls those real bundled functions underneath:

  • restart, root: { result: "restarted" }; MainPID 178463 → 178808 (systemd actually restarted it via systemctl restart).
  • restart, non-root: THREW openclaw.service is a system-scope unit (…); run `sudo systemctl restart openclaw.service` to restart it; MainPID unchanged, no signal.
  • stop, root: { result: "stopped" }; Active active → inactive.
  • stop, non-root: THREW … run `sudo systemctl stop openclaw.service` to stop it; stayed active, no signal.
  • detection/status: findInstalledSystemdGatewayScope{ scope: "system", unitName: "openclaw.service", … }; readSystemdServiceRuntime reports active/running via the system manager (no --user).

Full redacted terminal output is in the PR body under "Live re-review proof — lifecycle delegation at PR head e2ec15c". The unit was systemctl started (not enabled) and removed after the run.

Still maintainer-owned (no author repair): explicit acceptance of the non-root fail-closed policy, and the failed outbound-actions check rerun + packaged-CLI integration gap (I can't gh run rerun as a contributor).

@clawsweeper

clawsweeper Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

🦞👀
ClawSweeper picked this up.

Command router queued. I will update this comment with the next step.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels May 31, 2026
@yetval

yetval commented May 31, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

The last review's remaining proof note was: "exercises actual patched systemd source functions and a copied non-exported lifecycle helper, not a packaged openclaw CLI binary" (option 3: a source-checkout CLI run that reaches the changed lifecycle path on a real systemd host).

That's now in the PR body under "Actual patched CLI proof" — I ran the real openclaw CLI straight from the source checkout at e2ec15c (node scripts/run-node.mjs gateway <action>, which builds + runs the patched TS, no copy, no stub) against a transient custom-name /etc/systemd/system/openclaw.service on a real systemd host (Linux 6.8.0-71-generic, systemd 255, Node v22.19). This reaches the genuine src/cli/daemon-cli/lifecycle.ts delegation end-to-end:

  • root gateway stopGateway stopped via system-scope systemd unit openclaw.service.; unit active → inactive.
  • root gateway restart → MainPID 179861 → 179973 (systemd restarted it via systemctl restart). (The 60s health-check timeout afterward is only because the stand-in listener is a plain python -m http.server, not a real gateway WS endpoint — the restart itself succeeded.)
  • non-root gateway stopGateway stop failed: … run \sudo systemctl stop openclaw.service` to stop it`; unit stayed active, no signal.
  • non-root gateway restartGateway restart failed: … run \sudo systemctl restart openclaw.service` to restart it`; MainPID unchanged, no signal.

So the changed lifecycle path is proven via the actual source-checkout CLI, not just imported functions. Please re-evaluate / clear the lingering triage: needs-real-behavior-proof label.

Still maintainer-owned: explicit acceptance of the non-root fail-closed policy, and the failed outbound-actions check rerun (no contributor gh run rerun).

@clawsweeper

clawsweeper Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

🦞👀
ClawSweeper picked this up.

Command router queued. I will update this comment with the next step.

Re-review progress:

…claw#87577)

Teach the Linux gateway daemon to recognize system-scope systemd units in
addition to user-scope: the canonical user path, the canonical system path,
and any non-canonically named system unit carrying the OpenClaw gateway
marker (the reporter's /etc/systemd/system/openclaw.service shape). status,
is-enabled, restart, and stop now route through the detected scope and unit
name, querying the system manager (no --user) for system units.

For a detected system-scope unit, root callers run systemctl <action>
directly via the canonical service action; non-root callers fail closed with
sudo systemctl guidance naming the real unit instead of signaling a
supervisor-owned process. The unmanaged lifecycle fallback now delegates
system-scope units to that same canonical path (root -> systemctl, non-root
-> sudo guidance) rather than throwing unconditionally, so both code paths
share one policy and one hint string and a root operator is never told to
sudo a command it can already run.

Adds regression coverage for detection, routing, and both root/non-root
operator paths in the lifecycle fallback.
@steipete

Copy link
Copy Markdown
Contributor

Land-ready verification for b5e81e1abfd3ceac835ef2ad395dbcf019db4cf8:

  • Rebased the PR onto current origin/main and force-pushed the maintainer-editable contributor branch.
  • Local proof: node scripts/run-vitest.mjs src/daemon/systemd.test.ts src/cli/daemon-cli/lifecycle.test.ts src/daemon/inspect.test.ts src/cli/daemon-cli/lifecycle-core.test.ts src/cli/daemon-cli/status.gather.test.ts src/cli/daemon-cli/response.test.ts src/commands/doctor-gateway-daemon-flow.test.ts src/cli/update-cli/restart-helper.test.ts src/infra/outbound/message-action-runner.core-send.test.ts passed.
  • Crabbox proof: AWS lease cbx_69f97dff5e5c, run run_a68431b3dad6, checked out exact pushed SHA, passed the same focused Vitest shards, then created a real /etc/systemd/system/openclaw.service and verified status detection, non-root sudo guidance, root restart, and root stop.
  • GitHub checks are green and merge state is clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli CLI command changes gateway Gateway runtime merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. P2 Normal backlog priority with limited blast radius. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: L status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: openclaw gateway restart and openclaw status do not detect system-level systemd service

3 participants