Skip to content

feat(desktop): launchd attach, partial attach, dev isolation & hot reload#519

Merged
lefarcen merged 79 commits intomainfrom
refactor/launchd-process-architecture
Mar 24, 2026
Merged

feat(desktop): launchd attach, partial attach, dev isolation & hot reload#519
lefarcen merged 79 commits intomainfrom
refactor/launchd-process-architecture

Conversation

@lefarcen
Copy link
Copy Markdown
Collaborator

@lefarcen lefarcen commented Mar 24, 2026

Summary

Incremental improvements to the launchd-based process architecture (#405):

Service Attach Mechanism

  • Full attach: Detect already-running launchd services and reuse them (~200ms vs ~7s cold start)
  • Partial attach: If only one service survived (e.g. OpenClaw), reuse it and only cold-start the missing one (~2-3s)
  • Port recovery via runtime-ports.json persisted between sessions
  • Environment validation: dev (.dev) never attaches to packaged services and vice versa
  • NEXU_HOME validation prevents cross-environment attach

Dev/Packaged State Isolation

  • Dev state fully repo-scoped under .tmp/desktop/nexu-home/ (config, OpenClaw state, logs)
  • Packaged state unchanged at ~/.nexu/
  • OPENCLAW_STATE_DIR now derives from NEXU_HOME instead of hardcoded ~/.nexu
  • Dev logs go to NEXU_HOME/logs/ not ~/.nexu/logs/

Hot Reload

  • pnpm start auto-watches controller (tsc --watch + launchctl kickstart) and web (polling + rebuild)
  • Controller changes apply in ~2-3s without restarting Electron or OpenClaw

Port Architecture

  • OPENCLAW_GATEWAY_PORT passed to controller plist for consistent port allocation
  • Config compiler uses env.openclawGatewayPort (not config store) to avoid stale port persistence
  • Auto-increment ports when defaults are occupied

Reliability

  • OpenClaw sidecar extraction retries up to 3x with rm -rf fallback (macOS ENOTEMPTY fix)
  • Port occupier detection kills rogue processes on openclaw port during cleanup
  • Dev mode uses auth=none (no gateway token) to avoid token mismatch on hot reload

Documentation

  • Comprehensive English startup flow guide at specs/guides/desktop-startup-flow.md
  • Updated AGENTS.md with correct directory layout

Test plan

  • pnpm typecheck passes
  • pnpm test passes
  • Dev: pnpm start → cold start → stop → start (attach)
  • Dev: hot reload (change controller file → auto-restart)
  • Dev: all default ports occupied → auto-increment works
  • Packaged: cold start with ports occupied → works
  • Packaged: kill controller, keep OpenClaw → partial attach on relaunch
  • Packaged: full attach after "Run in Background" → ~200ms
  • Dev state isolated in .tmp/desktop/nexu-home/

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added service attach flow for faster startup when services are already running.
    • Added automatic watchers for controller and web changes during development.
  • Bug Fixes

    • Improved port conflict resolution by proactively clearing blocked ports.
    • Enhanced sidecar extraction with retry logic for reliability.
  • Documentation

    • New comprehensive desktop startup flow guide.
    • Updated development guidance with state and configuration locations.
    • Removed duplicate npm script aliases.

lefarcen and others added 30 commits March 23, 2026 16:10
Replace reserved close code 1008 (Policy Violation) with private code
4008 when closing WebSocket connections on error. Code 1008 is reserved
for server use and Node.js 22's native WebSocket throws
DOMException [InvalidAccessError] when clients attempt to use it,
causing the controller process to crash on authentication failures.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Cherry-pick WebSocket close code fix from PR #365
- Change launchd namespace from com.nexu.* to io.nexu.*
- Add progress tracking directory with STATUS, DECISIONS, ISSUES

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 1-3 of launchd architecture refactor:

- LaunchdManager: wrapper for launchctl commands (install, start,
  stop, status, graceful shutdown)
- PlistGenerator: generates launchd plist XML for Controller and
  OpenClaw services with proper env vars and dependencies
- EmbeddedWebServer: serves static files and proxies API requests
  to Controller, replacing the web sidecar process

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- launchd-bootstrap.ts: complete bootstrap flow for launchd-based
  startup (install services, start controller, start openclaw,
  start embedded web server)
- Feature flag NEXU_USE_LAUNCHD=1 for gradual rollout
- Unified log directory at ~/.nexu/logs/
- Path resolution for packaged vs dev environments
- Index file exporting all services

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- quit-handler.ts: handles before-quit event with dialog
- Options: Quit Completely (stop services), Run in Background, Cancel
- Graceful shutdown of launchd services
- Exported via services/index.ts

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
scripts/dev-launchd.sh provides:
- start: generate plists, bootstrap and start services
- stop: gracefully stop services
- restart: stop then start
- status: show launchd service status
- logs: tail all log files

Uses io.nexu.*.dev labels and ~/.nexu/logs/ for logging.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add launchd service imports
- Add runLaunchdColdStart function that uses bootstrapWithLaunchd
- Check NEXU_USE_LAUNCHD=1 flag to choose bootstrap mode
- Install launchd quit handler after successful launchd bootstrap
- Modify before-quit handler to skip orchestrator cleanup in launchd mode
- Derive openclaw paths from nexuHome config

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- launchd-manager.test.ts: tests for LaunchdManager class and SERVICE_LABELS
- plist-generator.test.ts: tests for generatePlist function

Tests cover:
- Platform check (darwin only)
- Default and custom plist directories
- UID-based domain construction
- Dev vs prod label generation
- Plist XML generation with correct structure
- XML character escaping
- Log path configuration
- Service dependencies

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix OpenClaw config paths to match controller defaults in env.ts
  (OPENCLAW_STATE_DIR=~/.nexu/runtime/openclaw/state)
- Add `gateway` subcommand to OpenClaw plist generation
- Use OPENCLAW_CONFIG_PATH env var instead of --config argument
- Add --auth none for dev mode to simplify local development
- Update tests to verify OPENCLAW_CONFIG_PATH env var presence

Tested with ./scripts/dev-launchd.sh - Controller and OpenClaw
WebSocket connection verified working.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… workflow

- Add "starting" RuntimeStatus: when OpenClaw gateway is unreachable but
  process is alive, show "启动中" instead of "已离线"
- Parallelize launchd service install/start + web server (Promise.all)
- Use adaptive readiness polling (50ms→250ms) instead of fixed 250ms
- Fix dev-launchd.sh stop: use bootout directly instead of SIGTERM+bootout
  race with KeepAlive; use SIGKILL for Electron to bypass quit handler
- Dev quit handler keeps services running (run-in-background) so vite HMR
  restarts don't kill launchd services
- Add tool progress prompt to nexu-platform-bootstrap plugin
- Disable humanDelay in config compiler
- Cold start time reduced from ~5s to ~2s

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Stop: wait for ports to free after bootout, SIGKILL orphans including
  chrome_crashpad_handler
- Fix resolveLaunchdPaths for packaged mode: OpenClaw is at
  runtime/openclaw/node_modules/openclaw/openclaw.mjs, not
  runtime/openclaw-runtime/openclaw.mjs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tartup

When the WebSocket to OpenClaw gateway isn't connected yet (during
startup), channels were shown as "disconnected" (red). Now they show
as "connecting" (yellow pulse) when the runtime is still starting,
giving users a much less alarming startup experience.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add explicit "booting" → "ready" lifecycle to ControllerRuntimeState.
During boot, gateway-unreachable is always treated as "starting" (not
"unhealthy"), regardless of whether the process manager owns the
OpenClaw process (fixes launchd mode where processManager.isAlive()
returns false). Channel live status also uses bootPhase to show
"connecting" during startup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The embedded web server serves static files from apps/web/dist, so
code changes to the web app require a build step before starting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bootPhase was set to "ready" immediately after wsClient.connect(),
but the WS handshake hadn't completed yet. Health loop then saw
gateway-unreachable + bootPhase=ready → "unhealthy" → UI showed
"已离线" during startup.

Now bootPhase transitions to "ready" inside the onConnected callback,
so the entire startup shows "starting" → "active" cleanly.

Also adds temporary debug logs to home.tsx for startup diagnostics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Channel error status now shows translated lastError (e.g. "会话已过期"
  instead of generic "错误")
- Controller maps WeChat "not configured" + not running to
  "session expired" for better UX
- Add i18n keys for common channel errors (session expired, not
  configured, disabled)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use warning color (orange) instead of danger (red) for known
recoverable errors like session expired, with actionable label
"请重新连接" / "Reconnect required".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The exponential backoff for OpenClaw WebSocket reconnection could
reach 16s+ during startup, causing the UI to stay in "starting"
state for 20+ seconds. Cap at 4s so retry sequence is 1→2→4→4→4s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…way up

When the health loop detects the gateway HTTP endpoint becomes
reachable, it calls wsClient.retryNow() to cancel the backoff timer
and connect immediately. This eliminates the 4-16s gap between
gateway ready and WS connected during startup.

Also replaces the ugly "Starting local services..." loading screen
with a minimal Nexu logo pulse animation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- main.tsx had a duplicate SurfaceFrame with old loading text; replaced
  with Nexu logo pulse animation matching surface-frame.tsx
- dev-launchd.sh now checks for dist/index.html and rebuilds desktop
  if missing, preventing blank screen after accidental dist deletion

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace plain loading screen with animated Nexu logo matching the
design system prototype (NexuLoader.tsx). Four quadrants light up
sequentially in brand colors: orange, green, pink, gold.
Pure CSS animation, no framer-motion dependency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
main.tsx had its own SurfaceFrame copy with the old loading screen.
Now imports from components/surface-frame.tsx so both Runtime Console
and Desktop Shell views use the same 4-color Nexu loader.

Background updated to dark radial gradient matching desktop theme.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Loader now overlays on top of the webview instead of replacing it.
The webview loads silently in the background while the Nexu logo
animation plays. When the webview fires dom-ready, the loader
disappears — no blank frames, no intermediate Loader2 spinner.

Background uses warm radial gradient for polished appearance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The spinning circle was briefly visible between the Nexu splash
loader and the actual UI. Replace with an empty div since the
desktop splash overlay already covers the loading period.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Run openclawProcess.prepare(), ensureRuntimeModelPlugin(), and
  prepareDesktopCloudModelsForBootstrap() in parallel (were sequential)
- Remove redundant compileCurrentConfig() call for preSeedConfigHash —
  doSync() already seeds the hash via noteConfigWritten()
- Reduce WS initial backoff from 1000ms to 500ms (sequence: 500→1000→2000→4000→4000...)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Quit dialog now shows Chinese or English based on app.getLocale().
Chinese users see "完全退出 / 后台运行 / 取消".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Packaged macOS builds now use launchd mode by default (no env var
needed). This enables the quit dialog, crash recovery, and background
service support in production. Can be explicitly disabled with
NEXU_USE_LAUNCHD=0.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix path traversal vulnerability in embedded-web-server (sanitize
  URL pathname, reject paths outside webRoot)
- Install launchd quit handler after try/catch so it works even if
  auth bootstrap fails
- Add error handling to quitWithDecision (was missing try/catch)
- Fix isAlive() handling undefined pid from failed spawn
- Rename NODE_PATH to NODE_BIN in dev script to avoid Node.js
  env var conflict

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- openclaw-config-writer: additional config writing logic
- plist-generator: updated plist generation + tests
- package.json: script updates for launchd dev workflow
- openclaw-weixin accounts.ts: prior session changes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
lefarcen and others added 6 commits March 24, 2026 20:27
When Electron allocates a non-default port (e.g. 18790 because 18789
is occupied), the controller needs to know this port for both the
config compiler and WS/health connections.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Node.js rmSync with recursive+force can fail with ENOTEMPTY on macOS.
Fall back to execFileSync rm -rf which handles this reliably.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Node.js rmSync can silently fail on macOS (ENOTEMPTY race). Now uses
rm -rf exclusively with existence check, and retries up to 3 times
with 1s pause between attempts. This ensures first-launch sidecar
extraction succeeds reliably.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace binary attach-or-cold-start with unified per-service flow:
- Each service independently checked: running+healthy → keep, else restart
- Ports recovered from runtime-ports.json when any service is still running
- NEXU_HOME validated to prevent cross-environment attach
- Missing services started with correct recovered ports
- Unhealthy running services torn down and restarted

This enables partial attach: if only OpenClaw survived a crash, the
next launch reuses its port and only cold-starts the controller.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
getLogDir() now accepts nexuHome param so dev mode writes launchd
service logs to .tmp/desktop/nexu-home/logs/ instead of ~/.nexu/logs/.
Also updates AGENTS.md with correct directory layout for dev vs packaged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rewrites desktop-startup-flow.md with full implementation details:
- Directory layout for dev vs packaged (with tree diagrams)
- Label isolation between dev (.dev) and packaged modes
- Unified bootstrap flow (attach + cold start per-service)
- Port architecture and auto-allocation
- Attach mechanism (full, partial, fallback)
- Status display timeline
- File watch hot reload
- Exit behavior
- OpenClaw sidecar extraction
- Complete key files reference

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Mar 24, 2026

Deploying nexu-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: cdda66f
Status: ✅  Deploy successful!
Preview URL: https://b1950477.nexu-docs.pages.dev
Branch Preview URL: https://refactor-launchd-process-arc.nexu-docs.pages.dev

View logs

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 24, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 60f2af45-c20e-4455-b059-d2d499bd6ed4

📥 Commits

Reviewing files that changed from the base of the PR and between c5da48e and af8cb5e.

📒 Files selected for processing (7)
  • apps/controller/src/app/env.ts
  • apps/desktop/main/index.ts
  • apps/desktop/main/runtime/manifests.ts
  • apps/desktop/main/services/launchd-bootstrap.ts
  • apps/desktop/main/services/quit-handler.ts
  • scripts/dev-launchd.sh
  • specs/guides/desktop-startup-flow.md

📝 Walkthrough

Walkthrough

This PR enhances desktop startup flow by implementing launchd service attach/port-recovery, dev/packaged mode isolation with separate NEXU_HOME directories, runtime-ports persistence, and improved service health checking. It includes sidecar extraction retries, dev-mode watchers for controller/web rebuilds, and comprehensive startup flow documentation.

Changes

Cohort / File(s) Summary
Documentation
AGENTS.md, specs/guides/desktop-startup-flow.md
Updated dev guidance on runtime state storage locations (dev: .tmp/desktop/nexu-home/, packaged: ~/.nexu/), clarified pnpm reset-state scope, and added detailed 311-line desktop startup flow specification covering dev vs packaged modes, attach/cold-start logic, port persistence, and service architecture.
Environment & Configuration
apps/controller/src/app/env.ts, apps/controller/src/lib/openclaw-config-compiler.ts
Changed OPENCLAW_STATE_DIR from required fixed default to optional with fallback resolution, and sourced OpenClaw gateway port from env instead of runtime config.
Launchd Bootstrap & Service Management
apps/desktop/main/services/launchd-bootstrap.ts, apps/desktop/main/services/launchd-manager.ts, apps/desktop/main/services/plist-generator.ts, apps/desktop/main/services/quit-handler.ts
Implemented attach/port-recovery flow with runtime-ports.json persistence, health probing, and conditional port reuse. Added environment variable extraction in service status, optional nexuHome in plist env, and runtime-ports cleanup on quit. Extended interfaces with optional nexuHome, attachedPorts, and plistDir fields.
Electron Startup & Sidecar
apps/desktop/main/index.ts, apps/desktop/main/runtime/manifests.ts
Modified cold-start flow to handle recovered ports and reattach to running services. Added 3-attempt retry loop with exponential backoff for sidecar extraction cleanup.
Development Scripts & Configuration
scripts/dev-launchd.sh, package.json
Enhanced dev startup with port-conflict resolution, controller/web source watchers for hot reload, repo-scoped dev state storage, and removed duplicate *:launchd npm script aliases.

Sequence Diagram(s)

sequenceDiagram
    participant Electron as Electron App
    participant Runtime as runtime-ports.json
    participant LC as launchctl
    participant Health as Health Checks
    participant Web as Web Server
    
    Electron->>Runtime: Read persisted ports (if exist)
    alt Ports found & dev mode match
        Runtime-->>Electron: Return recovered ports
        Electron->>LC: Check controller/openclaw status
        LC-->>Electron: Service running status
        alt Services still running
            Electron->>Health: Probe controller /health
            Health-->>Electron: Port validity + env check
            alt Health OK & NEXU_HOME match
                Electron->>Web: Start with recovered ports
                Web-->>Electron: attachedPorts result
            else Health check failed
                Electron->>Electron: Fall through to cold-start
            end
        else Services stopped
            Electron->>Electron: Fall through to cold-start
        end
    else No ports or mismatch
        Electron->>Electron: Proceed to cold-start
    end
    
    alt Cold-start flow
        Electron->>LC: Install & start controller/openclaw
        LC-->>Electron: Services initializing
        Electron->>Health: Poll controller readiness
        Health-->>Electron: Controller ready
        Electron->>Web: Start with fresh ports
        Web-->>Electron: Persist ports to runtime-ports.json
    end
    
    Electron->>Electron: Render UI (gated by controller ready)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • PR #519: Implements the same launchd attach/port-recovery, dev/packaged NEXU_HOME isolation, runtime-ports persistence, and service bootstrap enhancements across identical modified files.
  • PR #356: Establishes the foundational launchd-based desktop architecture, runtime/plist conventions, and service abstractions that this PR extends with attach/reuse and port recovery.
  • PR #405: Introduces the initial launchd service modules (launchd-bootstrap, launchd-manager, plist-generator, quit-handler) that this PR significantly modifies with optional env persistence and port management.

Suggested reviewers

  • PerishCode
  • mrcfps

Poem

🐰 Hoppy startup flows, with ports held tight,
Dev and packaged homes, both shining bright,
Attach to running services with speed,
Retry the sidecar when extraction's in need,
Hot watchers rebuild while the app takes flight!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.91% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main changes: launchd attach/partial attach, dev isolation, and hot reload features implemented across the desktop codebase.
Description check ✅ Passed The description covers all required template sections: What (service attach, state isolation, hot reload), Why (improved startup performance and development experience), How (implementation details and architecture), Affected areas (Desktop app selected), test plan with checkmarks, and notes on reliability improvements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/launchd-process-architecture

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@lefarcen lefarcen marked this pull request as draft March 24, 2026 14:26
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c5da48e829

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread apps/desktop/main/services/launchd-bootstrap.ts
Comment thread apps/desktop/main/services/launchd-bootstrap.ts Outdated
Comment thread scripts/dev-launchd.sh Outdated
Resolve conflicts keeping our attach/isolation/hot-reload improvements
while incorporating main's changes:
- Brand name: "Nexu" → "nexu" (lowercase, from bb1ddcc)
- UI polish from 2fb384c
- surface-frame: white background from main's design update

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lefarcen lefarcen force-pushed the refactor/launchd-process-architecture branch from c5da48e to cdda66f Compare March 24, 2026 14:32
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
apps/desktop/main/services/launchd-bootstrap.ts (1)

473-481: ⚠️ Potential issue | 🟠 Major

Update plist when environment variables change, not just on first installation.

The ensureService function only generates and installs the plist when isServiceInstalled(label) returns false. Once the service is installed and registered, the condition is never met again, so the plist is never regenerated even if effectivePorts, nexuHome, or gatewayToken change. This causes the running service to use stale values for PORT, OPENCLAW_GATEWAY_PORT, NEXU_HOME, and OPENCLAW_GATEWAY_TOKEN from the original plist, while the bootstrap logic operates with new values, leading to mismatches on cold starts.

Additionally, installService only calls launchctl bootstrap if the service is not already registered. If it were called again with an updated plist, the service would still not pick up the changes because launchctl kickstart (used in ensureRunning and startService) starts the already-loaded config in memory rather than reloading from disk.

Fix by regenerating the plist when environment values change, or by using bootout followed by bootstrap to reload the configuration from disk.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/desktop/main/services/launchd-bootstrap.ts` around lines 473 - 481,
ensureService currently only generates and installs a plist when
launchd.isServiceInstalled(label) is false, causing stale env vars; update
ensureService to detect when generatePlist(type, plistEnv) produces different
content than the installed plist and on change either (a) call
launchd.bootout(label) then launchd.installService(label, newPlist) to bootstrap
the updated plist, or (b) overwrite the on-disk plist and call launchctl
bootstrap of the service so the running instance picks up new
PORT/OPENCLAW_GATEWAY_PORT/NEXU_HOME/OPENCLAW_GATEWAY_TOKEN values; use
generatePlist, installService, and the launchd bootout/bootstrap/ensureRunning
or startService flows to perform the replace-and-restart sequence when plist
content differs.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/desktop/main/index.ts`:
- Around line 499-507: RuntimeOrchestrator was built earlier from
createRuntimeUnitManifests(..., runtimeConfig) but this branch only mutates
runtimeConfig when launchdResult.attachedPorts is present, so the existing
RuntimeOrchestrator still has old port endpoints; update/resync the orchestrator
after updating runtimeConfig by regenerating the manifests and rebuilding or
calling the appropriate refresh method on RuntimeOrchestrator (e.g., recreate or
call RuntimeOrchestrator.sync/update with createRuntimeUnitManifests(...,
runtimeConfig)) so all components use the attachedPorts values from
launchdResult.

In `@apps/desktop/main/runtime/manifests.ts`:
- Around line 191-206: The retry loop in ensurePackagedOpenclawSidecar() shells
out to PATH-dependent binaries ("rm", "tar", "sleep") which can fail in packaged
apps; replace those execFileSync calls so the extraction is self-contained: use
fs.rmSync or fs.rmdirSync with { recursive: true, force: true } to remove
extractedSidecarRoot, use a Node-based tar extractor (e.g.
require('tar').extract or another bundled tar library resolved via
require.resolve) to unpack archivePath into extractedSidecarRoot, and replace
the execFileSync("sleep", ["1"]) pause with a programmatic delay (convert the
function to async and use await new Promise(r => setTimeout(r, 1000)) or a
synchronous Atomics.wait alternative). Update MAX_RETRIES loop and
writeFileSync(stampPath, archiveStamp) usage accordingly so no PATH-dependent
binaries are invoked.

In `@apps/desktop/main/services/launchd-bootstrap.ts`:
- Around line 516-529: The fallback start for startEmbeddedWebServer uses port:
0 but never updates effectivePorts.webPort afterwards, so capture the actual
bound port from webServer (e.g. via the server object's listening address/port)
immediately after each successful start (both the try/catch block around
webServer creation and the similar block at the later occurrence) and assign it
to effectivePorts.webPort before any logging, attachedPorts updates, or writing
runtime-ports.json so persistence reflects the real listening port.
- Around line 497-505: Currently the code unconditionally SIGKILLs the PID
returned by detectPortOccupier for effectivePorts.openclawPort; instead, call a
verification step (e.g. inspect the occupier's command/args or process name via
the same detectPortOccupier result or a helper like getProcessInfo) and only
call process.kill(occupier.pid, "SIGKILL") when the command string or binary
matches known Nexu/OpenClaw signatures; if it does not match, do not kill—either
pick a new free port (update effectivePorts.openclawPort) or throw/return a
clear error indicating the port is occupied by a non-OpenClaw process so the
caller can decide. Ensure you reference detectPortOccupier, occupier.pid,
effectivePorts.openclawPort and process.kill in the change so reviewers can find
and update the kill path.
- Around line 323-337: detectPortOccupier currently calls execFileAsync("lsof",
...) which depends on PATH and swallows errors; change the call in
detectPortOccupier to use the absolute binary path "/usr/sbin/lsof" and update
the catch to surface or log the error (rather than silently returning null) so
missing-binary failures are visible; also scan related launchd service methods
that call launchctl or other system binaries and switch to absolute paths (e.g.,
"/bin/launchctl" or "/usr/bin/launchctl") and ensure their error handlers log or
propagate execution errors instead of swallowing them.

In `@apps/desktop/main/services/plist-generator.ts`:
- Around line 30-31: The OpenClaw plist generation does not include the nexuHome
-> NEXU_HOME environment variable, so when launchd-bootstrap.ts falls back to
openclawStatus.env?.NEXU_HOME the value is missing; update the OpenClaw plist
generation code to thread the nexuHome property into the plist's environment
(set NEXU_HOME to the provided nexuHome) the same way the controller plist does,
and make the same change in the second OpenClaw plist-generation block (the
other spot referenced around 155-183) so both generated OpenClaw plists include
NEXU_HOME for runtime partial-attach checks.
- Around line 81-89: The plist is missing a WEB_URL entry so launchd falls back
to http://localhost:5173; update the plist generation in plist-generator.ts (the
block that writes OPENCLAW_GATEWAY_PORT) to also emit a
<key>WEB_URL</key><string>...</string> when env.webUrl is present (use the same
escapeXml(env.webUrl) helper), placing it alongside OPENCLAW_GATEWAY_PORT so
compileOpenClawConfig() and the launchd controller receive the same origin
value.

In `@apps/desktop/main/services/quit-handler.ts`:
- Around line 146-149: Make the runtime-ports cleanup best-effort so it cannot
abort the quit sequence: call deleteRuntimePorts(opts.plistDir) in a try/catch
or fire-and-forget (Promise.catch) so any thrown error is swallowed/logged but
does not prevent setting __nexuForceQuit and calling app.quit(); ensure the
branch that currently checks opts.plistDir still runs the quit logic
unconditionally after scheduling or catching the deleteRuntimePorts call so
__nexuForceQuit and app.quit() always run even if deleteRuntimePorts fails.

In `@scripts/dev-launchd.sh`:
- Around line 156-157: The EXIT/INT/TERM trap currently references
CONTROLLER_WATCH_PID and WEB_WATCH_PID while they may be unset under set -euo
pipefail; update the trap to guard those variables (e.g. use parameter expansion
with defaults like ${CONTROLLER_WATCH_PID:-} and ${WEB_WATCH_PID:-} or build a
local pid string only if set) so kill is never given an unbound parameter, and
ensure stop_services is still invoked; locate the trap line that installs the
handler and replace the direct $CONTROLLER_WATCH_PID/$WEB_WATCH_PID references
with safe expansions or a conditional check before calling kill.
- Around line 175-183: The web watcher background subshell is inheriting set
-euo pipefail so a failed build command (the pipeline containing "pnpm --filter
`@nexu/web` build 2>&1 | tail -1 | sed 's/^/[web] /'") will terminate the loop; to
fix it, ensure the subshell does not exit on a non-zero build status by either
disabling errexit at the start of that subshell (e.g., run "set +e" inside the
subshell that contains the while loop) or by making the build pipeline tolerant
to failure (e.g., append "|| true" to the pnpm...|tail...|sed pipeline), and
preserve the existing echo messages so the loop continues watching for future
changes.

In `@specs/guides/desktop-startup-flow.md`:
- Around line 161-163: The startup docs are outdated: update the description of
waitForControllerReadiness() to state it polls /api/auth/get-session (not
/health) and note that the embedded web server port is allocated directly in
Electron via startEmbeddedWebServer() rather than via the controller config
chain; edit the text that references waitForControllerReadiness(), the polling
cadence, and the bootstrap flow (including lines referenced around 239–241) to
point readers to apps/desktop/main/services/launchd-bootstrap.ts for readiness
behavior and to the Electron startEmbeddedWebServer() implementation for port
allocation and debugging.

---

Outside diff comments:
In `@apps/desktop/main/services/launchd-bootstrap.ts`:
- Around line 473-481: ensureService currently only generates and installs a
plist when launchd.isServiceInstalled(label) is false, causing stale env vars;
update ensureService to detect when generatePlist(type, plistEnv) produces
different content than the installed plist and on change either (a) call
launchd.bootout(label) then launchd.installService(label, newPlist) to bootstrap
the updated plist, or (b) overwrite the on-disk plist and call launchctl
bootstrap of the service so the running instance picks up new
PORT/OPENCLAW_GATEWAY_PORT/NEXU_HOME/OPENCLAW_GATEWAY_TOKEN values; use
generatePlist, installService, and the launchd bootout/bootstrap/ensureRunning
or startService flows to perform the replace-and-restart sequence when plist
content differs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a1159815-06ea-4eca-9350-cbd4ee32f343

📥 Commits

Reviewing files that changed from the base of the PR and between 7f06028 and c5da48e.

📒 Files selected for processing (13)
  • AGENTS.md
  • apps/controller/src/app/env.ts
  • apps/controller/src/lib/openclaw-config-compiler.ts
  • apps/desktop/main/index.ts
  • apps/desktop/main/runtime/manifests.ts
  • apps/desktop/main/services/launchd-bootstrap.ts
  • apps/desktop/main/services/launchd-manager.ts
  • apps/desktop/main/services/plist-generator.ts
  • apps/desktop/main/services/quit-handler.ts
  • apps/desktop/src/components/surface-frame.tsx
  • package.json
  • scripts/dev-launchd.sh
  • specs/guides/desktop-startup-flow.md
💤 Files with no reviewable changes (1)
  • package.json

Comment thread apps/desktop/main/index.ts
Comment thread apps/desktop/main/runtime/manifests.ts
Comment thread apps/desktop/main/services/launchd-bootstrap.ts
Comment thread apps/desktop/main/services/launchd-bootstrap.ts
Comment thread apps/desktop/main/services/launchd-bootstrap.ts
Comment thread apps/desktop/main/services/plist-generator.ts
Comment thread apps/desktop/main/services/quit-handler.ts
Comment thread scripts/dev-launchd.sh Outdated
Comment thread scripts/dev-launchd.sh
Comment thread specs/guides/desktop-startup-flow.md
- Fix web port fallback: increment port instead of port 0, record
  actual port in effectivePorts for runtime-ports.json
- Fix quit handler: catch deleteRuntimePorts errors to prevent
  blocking quit
- Fix dev script: initialize watcher PIDs before trap to avoid
  set -u errors
- Fix docs: probe endpoint is /api/auth/get-session not /health

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lefarcen lefarcen marked this pull request as ready for review March 24, 2026 15:03
@lefarcen lefarcen merged commit 1a1c6f6 into main Mar 24, 2026
7 checks passed
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: af8cb5e616

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +500 to +504
console.warn(
`Port ${effectivePorts.openclawPort} occupied by PID ${occupier.pid}, killing...`,
);
try {
process.kill(occupier.pid, "SIGKILL");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid killing arbitrary process on OpenClaw port

When OpenClaw is not healthy, startup now force-kills whatever PID is listening on effectivePorts.openclawPort without verifying ownership. In packaged or dev runs, this can terminate an unrelated local service that happens to use that port, causing unexpected outages/data loss outside Nexu. This should only kill known stale Nexu/OpenClaw processes (or fail with a clear conflict) instead of unconditional SIGKILL on the first lsof result.

Useful? React with 👍 / 👎.

Comment on lines +385 to +387
!expectedNexuHome ||
!runningNexuHome ||
runningNexuHome === expectedNexuHome
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require validated NEXU_HOME before reusing recovered ports

The attach gate treats a missing runningNexuHome as acceptable (!runningNexuHome), so recovered ports are trusted even when environment identity cannot be verified. In partial-survival cases where controller is down, runningNexuHome is typically unavailable, which bypasses the cross-environment guard and can attach to stale launchd state from another dev workspace sharing the same labels. The check should fail closed when expectedNexuHome is set but no matching runtime value is available.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants