Skip to content

fix(daemon): correct systemd availability checks for degraded user sessions#34551

Closed
jnMetaCode wants to merge 3 commits intoopenclaw:mainfrom
jnMetaCode:fix/systemd-degraded-fallthrough
Closed

fix(daemon): correct systemd availability checks for degraded user sessions#34551
jnMetaCode wants to merge 3 commits intoopenclaw:mainfrom
jnMetaCode:fix/systemd-degraded-fallthrough

Conversation

@jnMetaCode
Copy link
Contributor

Summary

isSystemdUserServiceAvailable() and assertSystemdAvailable() both incorrectly treat a degraded systemd user session as unavailable.

systemctl --user status returns a non-zero exit code when any user unit has failed, which is common on real Linux systems. The current code checks for specific unavailability indicators ("not found", "failed to connect", etc.) and correctly returns false for those, but the fall-through also returns false / throws, even though none of the unavailability patterns matched — meaning systemd IS available.

Compare with isSystemctlAvailable() (line 410) in the same file, which handles this correctly:

return !isSystemctlMissing(readSystemctlDetail(res));

Root cause

  • isSystemdUserServiceAvailable() line 202: fall-through returns false instead of true
  • assertSystemdAvailable() line 214: throws instead of returning normally

Impact

On Linux systems with any failed systemd user units:

  • Onboarding wizard silently skips daemon installation
  • openclaw gateway install/stop/restart throws "systemctl --user unavailable"
  • Doctor reports systemd as unavailable

Related to #33633, #34237, #34464.

Note: This fixes different functions from the existing is-enabled PRs — those address isSystemdServiceEnabled() / readSystemctlDetail(), while this PR fixes isSystemdUserServiceAvailable() and assertSystemdAvailable().

Tests

  • Added: degraded systemd returns true from isSystemdUserServiceAvailable
  • Added: stopSystemdService succeeds when systemd is degraded
  • All 21 systemd tests pass.

@openclaw-barnacle openclaw-barnacle bot added gateway Gateway runtime size: XS labels Mar 4, 2026
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9c5a735897

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +216 to +217
// Non-zero exit without missing indicators means systemd is available
// but has degraded/failed units — acceptable for service operations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve unavailability errors in assertSystemdAvailable

assertSystemdAvailable() now returns on any non-zero systemctl --user status result unless isSystemctlMissing() matches, so errors like Failed to connect to bus are treated as available. That change is user-visible because callers proceed as if systemd is usable; for example, uninstallSystemdService does not check the disable --now result and can report success after deleting the unit file even though systemd never processed the disable/stop request. Please keep throwing for non-degraded unavailability cases instead of falling through for all non-missing failures.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — assertSystemdAvailable() now checks for bus-unavailability patterns (failed to connect, not been booted, no such file or directory, not supported) and throws with a descriptive message. See latest commit.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 4, 2026

Greptile Summary

This PR fixes isSystemdUserServiceAvailable() by correctly returning true for degraded systemd sessions (non-zero exit with no unavailability indicators), addressing silent onboarding failures and false "systemctl --user unavailable" errors.

Verified finding:
The fix to assertSystemdAvailable() correctly stops throwing on degraded sessions, but introduces a regression: it now only checks isSystemctlMissing(), which catches binary-absence errors but not bus-unavailability patterns like "failed to connect to bus", "not been booted with systemd", and "not supported". These patterns are explicitly checked in isSystemdUserServiceAvailable() (lines 190, 193, 199) but missing from assertSystemdAvailable().

When the systemd user bus is unreachable, assertSystemdAvailable() silently returns, allowing downstream systemctl --user stop/restart calls to fail with opaque errors instead of the early helpful error message. A similar check is missing in tests—there's coverage for degraded sessions but not for bus-unavailable scenarios in the service control tests.

Impact: The isSystemdUserServiceAvailable() fix is safe and correct. The assertSystemdAvailable() fix solves the degraded-session issue but requires additional checks for bus-unavailability patterns before merging.

Confidence Score: 2/5

  • The PR safely fixes degraded-session handling in isSystemdUserServiceAvailable(), but assertSystemdAvailable() under-guards against bus-unavailability, risking opaque failures downstream.
  • The fix to isSystemdUserServiceAvailable() is correct and verified by tests. However, assertSystemdAvailable() removes the catch-all throw without adding explicit checks for bus-unavailability patterns. This creates a regression where systems with unreachable D-Bus (remote SSH, etc.) will silently pass the assertion and fail later with confusing errors. The suggested fix (reusing unavailability checks from isSystemdUserServiceAvailable()) is straightforward and low-risk, but it must be implemented before merge.
  • src/daemon/systemd.ts — the assertSystemdAvailable() function (lines 207–218) needs to add explicit checks for bus-unavailability patterns before the final implicit return.

Last reviewed commit: 9c5a735

Comment on lines 207 to 218
@@ -211,7 +213,8 @@ async function assertSystemdAvailable() {
if (isSystemctlMissing(detail)) {
throw new Error("systemctl not available; systemd user services are required on Linux.");
}
throw new Error(`systemctl --user unavailable: ${detail || "unknown error"}`.trim());
// Non-zero exit without missing indicators means systemd is available
// but has degraded/failed units — acceptable for service operations.
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertSystemdAvailable() does not catch bus-unavailability errors after this fix.

Compare with isSystemdUserServiceAvailable() (lines 178–205): it explicitly checks for unavailability patterns like "failed to connect" (line 190), "not been booted" (line 193), and "not supported" (line 199), and returns false if any match.

The current implementation only checks isSystemctlMissing(), which catches binary-absence errors (lines 149–160) but NOT bus-unavailability. This means if a system's D-Bus is unreachable (e.g., remote SSH without lingering D-Bus socket), assertSystemdAvailable() silently returns, and subsequent systemctl --user stop/restart calls fail downstream with opaque errors instead of the early, helpful "systemctl not available" message.

Suggested fix: Apply the same unavailability checks:

async function assertSystemdAvailable() {
  const res = await execSystemctl(["--user", "status"]);
  if (res.code === 0) {
    return;
  }
  const detail = readSystemctlDetail(res);
  const detailLower = `${res.stderr} ${res.stdout}`.toLowerCase();
  if (
    isSystemctlMissing(detail) ||
    detailLower.includes("failed to connect") ||
    detailLower.includes("not been booted") ||
    detailLower.includes("not supported")
  ) {
    throw new Error(`systemctl --user unavailable: ${detail || "unknown error"}`.trim());
  }
  // Non-zero exit with none of the above patterns → degraded session, still usable.
}

Also add a regression test (e.g., in the "systemd service control" suite) that verifies stopSystemdService throws when the bus is unavailable.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/daemon/systemd.ts
Line: 207-218

Comment:
`assertSystemdAvailable()` does not catch bus-unavailability errors after this fix. 

Compare with `isSystemdUserServiceAvailable()` (lines 178–205): it explicitly checks for unavailability patterns like `"failed to connect"` (line 190), `"not been booted"` (line 193), and `"not supported"` (line 199), and returns `false` if any match.

The current implementation only checks `isSystemctlMissing()`, which catches binary-absence errors (lines 149–160) but NOT bus-unavailability. This means if a system's D-Bus is unreachable (e.g., remote SSH without lingering D-Bus socket), `assertSystemdAvailable()` silently returns, and subsequent `systemctl --user stop/restart` calls fail downstream with opaque errors instead of the early, helpful `"systemctl not available"` message.

**Suggested fix**: Apply the same unavailability checks:

```ts
async function assertSystemdAvailable() {
  const res = await execSystemctl(["--user", "status"]);
  if (res.code === 0) {
    return;
  }
  const detail = readSystemctlDetail(res);
  const detailLower = `${res.stderr} ${res.stdout}`.toLowerCase();
  if (
    isSystemctlMissing(detail) ||
    detailLower.includes("failed to connect") ||
    detailLower.includes("not been booted") ||
    detailLower.includes("not supported")
  ) {
    throw new Error(`systemctl --user unavailable: ${detail || "unknown error"}`.trim());
  }
  // Non-zero exit with none of the above patterns → degraded session, still usable.
}
```

Also add a regression test (e.g., in the "systemd service control" suite) that verifies `stopSystemdService` throws when the bus is unavailable.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — assertSystemdAvailable() now mirrors the unavailability checks from isSystemdUserServiceAvailable(). Both functions check the same set of patterns.

@jnMetaCode
Copy link
Contributor Author

Thanks for the review! Good catch — I've added bus-unavailability checks (failed to connect, not been booted, no such file or directory, not supported) to assertSystemdAvailable() so it now correctly throws for those cases while still allowing degraded-but-available sessions through. Added a test covering the bus-unavailable path as well.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ec1023f60

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

return false;
// Non-zero exit without any unavailability indicator means systemd is
// present but has degraded/failed units — still available for our purposes.
return true;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Treat ENOENT/EACCES as unavailable in user service check

isSystemdUserServiceAvailable() now returns true for any non-zero systemctl --user status result that does not match the hardcoded substrings, but execFileUtf8 reports missing/blocked binaries as messages like spawn systemctl ENOENT / spawn systemctl EACCES (see isSystemctlMissing), which do not match this list. In those environments, availability probes used by onboarding/doctor will incorrectly report systemd as available and proceed down the wrong flow. Reuse isSystemctlMissing (or equivalent checks) here before falling through to true.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — added isSystemctlMissing(detail) check before the return true fallthrough in isSystemdUserServiceAvailable(), so ENOENT/EACCES now correctly returns false. Added a test covering the ENOENT case.

@jnMetaCode jnMetaCode force-pushed the fix/systemd-degraded-fallthrough branch from 2ec1023 to 518ee79 Compare March 5, 2026 19:53
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 518ee79cf7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +278 to +280
// Non-zero exit without any unavailability indicator means systemd is
// present but has degraded/failed units — still available for our purposes.
return true;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Treat spawn ENOENT/EACCES as unavailable

isSystemdUserServiceAvailable() now falls through to true for any non-zero systemctl --user status result that does not match the hardcoded substrings, but execFileUtf8 reports missing or blocked binaries as messages like spawn systemctl ENOENT/spawn systemctl EACCES (see isSystemctlMissing in the same file). On hosts where systemctl is absent or not executable, onboarding/doctor paths that rely on this probe will incorrectly report systemd as available and proceed down the wrong flow.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest commit — same fix as above. isSystemctlMissing() catches both ENOENT and EACCES patterns.

Comment on lines +302 to 304
// Non-zero exit without missing/unavailability indicators means systemd is
// available but has degraded/failed units — acceptable for service operations.
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail fast on unknown status errors before service actions

The new fall-through in assertSystemdAvailable() treats all non-matching systemctl --user status failures as acceptable, which means non-degraded errors (for example permission/policy failures that don't contain the four checked substrings) now proceed into service operations; uninstallSystemdService() then ignores the disable --now result and can print a successful removal after deleting the unit file even though systemd never accepted the disable/stop request. This regression is introduced by returning success on unknown status failures here.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed — assertSystemdAvailable() now checks isSystemctlMissing() (throws) and bus-unavailability patterns (throws) before silently accepting the degraded state. Only a genuine degraded-but-available systemd falls through without throwing.

@jnMetaCode jnMetaCode force-pushed the fix/systemd-degraded-fallthrough branch 3 times, most recently from 6c94d77 to 3a77074 Compare March 7, 2026 00:56
@jnMetaCode jnMetaCode force-pushed the fix/systemd-degraded-fallthrough branch from 3a77074 to da90474 Compare March 7, 2026 12:13
@vincentkoc vincentkoc self-assigned this Mar 8, 2026
@vincentkoc
Copy link
Member

Thanks for the work here.

Closing this as superseded by #39325, which rebases the degraded-session fix onto current main and carries the related status-surface work for externally managed systemd services in the same pass.

The core degraded-session problem you identified is real and still present on main; the new PR keeps that fix path current and mergeable.

@vincentkoc vincentkoc closed this Mar 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants