Skip to content

Daemon: harden launchd plist writes and abortable service restarts#46493

Closed
shichangs wants to merge 1 commit intoopenclaw:mainfrom
shichangs:codex/rescue-watchdog-daemon
Closed

Daemon: harden launchd plist writes and abortable service restarts#46493
shichangs wants to merge 1 commit intoopenclaw:mainfrom
shichangs:codex/rescue-watchdog-daemon

Conversation

@shichangs
Copy link
Copy Markdown
Contributor

Summary

  • Problem: launchd plist handling and managed-service restart paths were too trusting about filesystem targets and could not propagate abort signals down to child service-control commands.
  • Why it matters: LaunchAgent install/restart touches user-owned plist paths and supervisor commands; symlink-safe writes and abortable restarts reduce both security risk and stuck watchdog behavior.
  • What changed: hardens launchd home/plist resolution and atomic writes, rejects symlinked LaunchAgent paths, tightens plist permissions to 0600, and threads AbortSignal through launchd/systemd/schtasks restart helpers via runCommandWithTimeout.
  • What did NOT change (scope boundary): no rescue watchdog engine, no onboarding flow, no cron payload/schema changes.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • LaunchAgent install now refuses symlinked target paths and writes plist files with 0600 via a temp-file + rename flow.
  • Managed service restart helpers now accept abort signals so higher-level watchdog code can cancel in-flight restart attempts cleanly.

Security Impact (required)

  • New permissions/capabilities? (Yes/No): No
  • Secrets/tokens handling changed? (Yes/No): No
  • New/changed network calls? (Yes/No): No
  • Command/tool execution surface changed? (Yes/No): Yes
  • Data access scope changed? (Yes/No): No
  • If any Yes, explain risk + mitigation:
    This change only narrows existing service-control behavior. Child restart commands now honor AbortSignal, and launchd plist writes reject symlink targets and use trusted homedir resolution plus atomic writes to reduce path-hijack risk.

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node 22 / pnpm workspace
  • Model/provider: N/A
  • Integration/channel (if any): N/A
  • Relevant config (redacted): local managed gateway service only

Steps

  1. Run the daemon/service unit tests covering launchd, systemd, schtasks, and process exec timeout behavior.
  2. Build the repo to confirm the daemon and exec changes compile cleanly.
  3. Inspect the launchd code path to confirm plist writes now resolve a trusted home, reject symlink targets, and use atomic temp-file writes.

Expected

  • Launchd plist writes fail closed on symlinked paths and persist with restrictive permissions.
  • Abort signals reach child service-control commands on launchd/systemd/schtasks restart paths.

Actual

  • Verified by targeted tests and local build; behavior matches the expected outcomes above.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • launchd plist path resolution ignores a conflicting HOME override and uses trusted user home lookup.
    • launchd install rejects symlinked LaunchAgents directories and plist targets.
    • launchd plist writes use restrictive 0600 mode.
    • AbortSignal now propagates through launchd/systemd/schtasks restart helpers into the process exec layer.
  • Edge cases checked:
    • os.userInfo() failure falls back to os.homedir().
    • unresolved trusted home throws instead of silently writing to an unsafe path.
    • Windows scheduled-task wrapper keeps existing timeout behavior while adding abort support.
  • What you did not verify:
    • manual end-to-end launchd/systemd restart behavior on a real long-running host.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes
  • Config/env changes? (Yes/No): No
  • Migration needed? (Yes/No): No

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert this PR.
  • Files/config to restore: src/daemon/launchd.ts, src/daemon/systemd.ts, src/daemon/schtasks.ts, src/process/exec.ts
  • Known bad symptoms reviewers should watch for: LaunchAgent install unexpectedly failing on valid paths, or service restart paths no longer responding to cancellation.

Risks and Mitigations

  • Risk: launchd path hardening could reject previously tolerated but unsafe filesystem layouts.
    • Mitigation: the checks are limited to symlinked/non-directory targets in the LaunchAgents path and have focused test coverage.
  • Risk: abort propagation could alter restart timing on platforms that previously ignored cancellation.
    • Mitigation: the change is additive, targeted tests pass, and the underlying timeout behavior is preserved.

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime size: M labels Mar 14, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 14, 2026

Greptile Summary

This PR hardens launchd plist handling and service restart paths by: (1) resolving the user home directory through os.userInfo() instead of the potentially attacker-controlled HOME environment variable, (2) rejecting symlinked LaunchAgents directories and plist targets to prevent path-hijack attacks, (3) writing plists atomically via O_NOFOLLOW | O_CREAT | O_TRUNC temp-file + rename with 0600 permissions (down from 0644), and (4) threading AbortSignal through all three service-control backends (launchd, systemd, schtasks) via runCommandWithTimeout. The changes are well-scoped, well-tested, and backward-compatible.

  • The env parameter of resolveLaunchAgentPlistPathForLabel is now unused after the trusted-home refactor and should be renamed to _env or removed to avoid misleading future readers.
  • writeLaunchAgentPlistSecure does not clean up the temp file if fs.rename() fails after a successful handle.close() — a try/finally around the rename would prevent orphaned temp files.
  • AbortSignal wiring is clean and consistent across all three backends; the signal variable shadowing in the exec.ts close callback is handled correctly by accessing options.signal inside that scope.

Confidence Score: 4/5

  • Safe to merge; the hardening is well-scoped with solid test coverage and only minor cleanup items remaining.
  • All security-sensitive changes (symlink rejection, trusted home resolution, atomic writes, permission tightening, abort propagation) are covered by targeted unit tests. The two flagged items are style/best-practice issues rather than correctness bugs, and neither creates a security regression relative to the previous code.
  • src/daemon/launchd.ts — temp file cleanup on rename failure and unused env parameter in resolveLaunchAgentPlistPathForLabel.

Comments Outside Diff (1)

  1. src/daemon/launchd.ts, line 62-68 (link)

    Unused env parameter

    The env parameter is declared but never used in the function body — resolveTrustedLaunchAgentHome() is called unconditionally and ignores the environment entirely. This could mislead future maintainers into believing the environment still influences the resolved path. Consider either removing the parameter (and updating callers) or adding a lint-suppression comment to make the intentional ignore explicit.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/daemon/launchd.ts
    Line: 62-68
    
    Comment:
    **Unused `env` parameter**
    
    The `env` parameter is declared but never used in the function body — `resolveTrustedLaunchAgentHome()` is called unconditionally and ignores the environment entirely. This could mislead future maintainers into believing the environment still influences the resolved path. Consider either removing the parameter (and updating callers) or adding a lint-suppression comment to make the intentional ignore explicit.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/daemon/launchd.ts
Line: 248-249

Comment:
**Temp file orphaned on rename failure**

If `fs.rename(tempPath, plistPath)` fails (e.g. due to a permissions error or an unexpected race), the temp file at `tempPath` will be permanently orphaned in the `LaunchAgents` directory. Even though the file has `0600` permissions and is in the same directory as the final target (making a cross-device failure unlikely), it's still best practice to clean it up on failure:

```suggestion
  await handle.close();
  try {
    await fs.rename(tempPath, plistPath);
  } catch (error) {
    await fs.unlink(tempPath).catch(() => undefined);
    throw error;
  }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/daemon/launchd.ts
Line: 62-68

Comment:
**Unused `env` parameter**

The `env` parameter is declared but never used in the function body — `resolveTrustedLaunchAgentHome()` is called unconditionally and ignores the environment entirely. This could mislead future maintainers into believing the environment still influences the resolved path. Consider either removing the parameter (and updating callers) or adding a lint-suppression comment to make the intentional ignore explicit.

```suggestion
function resolveLaunchAgentPlistPathForLabel(
  _env: Record<string, string | undefined>,
  label: string,
): string {
  const home = resolveTrustedLaunchAgentHome();
  return path.posix.join(home, "Library", "LaunchAgents", `${label}.plist`);
}
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: c83aca2

Comment thread src/daemon/launchd.ts
Comment on lines +248 to +249
await handle.close();
await fs.rename(tempPath, plistPath);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temp file orphaned on rename failure

If fs.rename(tempPath, plistPath) fails (e.g. due to a permissions error or an unexpected race), the temp file at tempPath will be permanently orphaned in the LaunchAgents directory. Even though the file has 0600 permissions and is in the same directory as the final target (making a cross-device failure unlikely), it's still best practice to clean it up on failure:

Suggested change
await handle.close();
await fs.rename(tempPath, plistPath);
await handle.close();
try {
await fs.rename(tempPath, plistPath);
} catch (error) {
await fs.unlink(tempPath).catch(() => undefined);
throw error;
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/daemon/launchd.ts
Line: 248-249

Comment:
**Temp file orphaned on rename failure**

If `fs.rename(tempPath, plistPath)` fails (e.g. due to a permissions error or an unexpected race), the temp file at `tempPath` will be permanently orphaned in the `LaunchAgents` directory. Even though the file has `0600` permissions and is in the same directory as the final target (making a cross-device failure unlikely), it's still best practice to clean it up on failure:

```suggestion
  await handle.close();
  try {
    await fs.rename(tempPath, plistPath);
  } catch (error) {
    await fs.unlink(tempPath).catch(() => undefined);
    throw error;
  }
```

How can I resolve this? If you propose a fix, please make it concise.

@shichangs
Copy link
Copy Markdown
Contributor Author

Superseded by #46499.

I split the original rescue-watchdog work again and folded the daemon hardening into the new core PR so reviewers can see the actual watchdog runtime value in one place instead of starting from a standalone low-level daemon PR.

@shichangs shichangs closed this Mar 14, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c83aca214e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/daemon/schtasks.ts
const taskName = resolveTaskName(effectiveEnv);
await execSchtasks(["/End", "/TN", taskName]);
const sig = signal ? { signal } : undefined;
await execSchtasks(["/End", "/TN", taskName], sig);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Abort Windows restart immediately when /End is canceled

After adding signal support, execSchtasks(["/End", ...], sig) can now return an aborted/non-zero result, but that result is ignored and the restart flow continues into listener termination and port-wait logic before failing later. In watchdog cancellation scenarios this defeats prompt cancelation and can still tear down runtime state after the caller has already aborted. Handle the /End result the same way stopScheduledTask does (allow only "not running"), and abort the restart path immediately on other non-zero outcomes.

Useful? React with 👍 / 👎.

Comment thread src/daemon/launchd.ts
Comment on lines +238 to +239
fsConstants.O_WRONLY | fsConstants.O_CREAT | fsConstants.O_TRUNC | fsConstants.O_NOFOLLOW,
LAUNCH_AGENT_PLIST_MODE,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guarantee 0600 when writing launchd plist atomically

The secure writer opens the temp file with O_CREAT | O_TRUNC and then renames it, but never chmods the final plist. If that temp path already exists, POSIX ignores the passed mode on open, so the file can keep broader preexisting permissions and the resulting plist may not actually be 0600. Using O_EXCL (or a random unique temp filename) and/or enforcing mode after rename avoids this permission regression.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant