Skip to content

fix(onboard): clean up stale openclaw-gateway dashboard listeners#3405

Merged
ericksoa merged 8 commits into
mainfrom
fix/3397-3398-stale-gateway-cleanup
May 13, 2026
Merged

fix(onboard): clean up stale openclaw-gateway dashboard listeners#3405
ericksoa merged 8 commits into
mainfrom
fix/3397-3398-stale-gateway-cleanup

Conversation

@laitingsheng

@laitingsheng laitingsheng commented May 12, 2026

Copy link
Copy Markdown
Contributor

Summary

When openshell forward start is killed unexpectedly (failed onboard, container crash mid-build, upgrade across versions), the host-side process that holds the NemoClaw dashboard port can survive. The next nemoclaw onboard detects the conflict, falls back to a different port, and bakes the wrong port into CHAT_UI_URL — the new sandbox builds but is never reachable. This PR adds a conservative, ownership- and cmdline-checked sweep of the dashboard port range and wires it into the three lifecycle entry points where it can recover safely, with a protectedPorts opt-out so live forwards of other sandboxes are never disrupted.

Related Issue

Fixes #3397.
Fixes #3398.

Changes

  • New module src/lib/onboard/stale-gateway-cleanup.ts: ownership- and cmdline-checked scan of the dashboard port range (18789-18799) with two-phase TERM-then-KILL and bounded waits. Mirrors the proven tryStopOllamaProxyPid pattern from the uninstaller. Exposes a protectedPorts option so callers can opt-out specific ports.
  • The lsof probe is scoped to listening sockets via -sTCP:LISTEN to avoid signalling processes that merely have an in-flight client connection to the port.
  • Unexpected lsof exit statuses (anything other than 0/1) now surface through deps.warn so cleanup failures are visible instead of being swallowed.
  • Wired into cleanupGatewayAfterLastSandbox in src/lib/actions/sandbox/destroy.ts after the cooperative openshell forward stop, into the Stopping services step in src/lib/actions/uninstall/run-plan.ts between stopMatchingPids and stopOrphanedOpenShell, and into the --fresh branch of src/lib/onboard.ts before preflight so a fresh install never has to fight a leftover process from the previous attempt.
  • The --fresh call passes the dashboard ports of OTHER registered sandboxes as protectedPorts, only unprotecting the explicit target when the operator passes --name (or NEMOCLAW_SANDBOX_NAME). Without an explicit target name we cannot tell which sandbox the user means to rebuild, so all registered ports stay protected and the upgrade/same-name recovery becomes opt-in via --name <existing>.
  • Extend isDockerDriverGatewayPid in src/lib/actions/sandbox/destroy.ts to match openclaw-gateway cmdlines as well as openshell-gateway, so the pid-file cleanup path covers the re-exec'd process name when lsof is unavailable.
  • Conservative by design: only PIDs the current user can signal are considered; only PIDs whose cmdline matches openclaw-gateway / openshell forward markers are killed; the scan degrades silently when lsof is unavailable.
  • Behavioural test cases in src/lib/onboard/stale-gateway-cleanup.test.ts: lsof missing -> no-op; no listeners in range -> no-op; user-owned openclaw-gateway -> killed; SIGTERM ignored -> SIGKILL escalation; foreign-owned -> skipped; cmdline mismatch -> skipped; same PID on multiple ports -> looked up once; protected port -> skipped without probing; protected PID on unprotected port -> still skipped.
  • Documented the automatic stale-gateway cleanup in the Port already in use section of docs/reference/troubleshooting.md.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • `npx prek run --all-files` passes
  • `npm test` passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • `make docs` builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • Automatically sweep and stop stale host processes that can block the dashboard port during uninstall, sandbox destruction, and fresh onboarding; respects ports owned by active sandboxes and escalates termination if needed.
    • Improved detection of gateway processes so more stale listeners are identified and cleared.
  • Tests

    • Added comprehensive tests covering stale-listener detection, safe termination behavior, and fresh-onboard ordering.
  • Documentation

    • Updated troubleshooting notes for dashboard port conflicts.

Review Change Stack

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented May 12, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds a stale gateway cleanup module that detects and terminates orphaned openclaw-gateway/openshell-gateway host listeners and integrates it into onboard --fresh (with protected ports), sandbox destroy (after stopping forwards), and uninstall (stopping services). Troubleshooting docs and tests were added; Docker gateway PID detection now recognizes openclaw-gateway.

Changes

Stale Gateway Cleanup

Layer / File(s) Summary
Stale Gateway Cleanup Module
src/lib/onboard/stale-gateway-cleanup.ts
Exports DI types and defaults (RunResult, StaleGatewayDeps, CleanupResult, StaleGatewayOptions), helpers for PID discovery/ownership/cmdline checks, and stopStaleDashboardListeners which scans dashboard ports via lsof, deduplicates PIDs, filters by owner/cmdline/protected status, and performs bounded SIGTERM->SIGKILL termination.
Stale Gateway Cleanup Tests
src/lib/onboard/stale-gateway-cleanup.test.ts
Vitest suite stubbing run to verify missing lsof, empty scans, SIGTERM success, SIGKILL escalation, foreign-owner/protected/cmdline skips, PID deduplication, and signal ordering.
Integration into Cleanup Flows
src/lib/onboard.ts, src/lib/actions/sandbox/destroy.ts, src/lib/actions/uninstall/run-plan.ts, docs/reference/troubleshooting.md, test/onboard.test.ts
Calls cleanup in onboard --fresh preflight (computes protected ports from registry excluding target sandbox), invokes it after dashboard forward stop in sandbox destroy, and during uninstall stopping-services. Extends Docker PID detection to match openclaw-gateway and updates troubleshooting docs and onboarding tests.

Sequence Diagram(s)

sequenceDiagram
  participant CLI
  participant Stopper
  participant LSOF
  participant PS
  participant Kernel
  CLI->>Stopper: invoke stopStaleDashboardListeners({protectedPorts})
  Stopper->>LSOF: run "lsof -iTCP:PORT_RANGE -sTCP:LISTEN -n -P"
  LSOF-->>Stopper: return ports and PIDs
  Stopper->>PS: run "ps -o user= -o args= -p PID"
  PS-->>Stopper: return user and cmdline
  Stopper->>Kernel: send SIGTERM to PID
  Kernel-->>Stopper: signal delivered
  Stopper->>Kernel: send SIGKILL if still alive
  Kernel-->>Stopper: final exit status
  Stopper-->>CLI: return CleanupResult
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A gateway left orphaned upon the port,
Now captured and cleanly escorted—away!
Fresh onboard, destroy, uninstall all report,
Protected ports kept safe throughout the day.
The rabbit hums: "Goodbye stale PID, hooray!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main change: adding cleanup of stale openclaw-gateway dashboard listeners in the onboard flow.
Linked Issues check ✅ Passed All requirements from #3397 and #3398 are met: stale openclaw-gateway processes are detected and killed in uninstall, onboard --fresh, and destroy flows using lsof scanning with SIGTERM/SIGKILL escalation.
Out of Scope Changes check ✅ Passed All changes are directly related to resolving the stale gateway cleanup objectives. No unrelated modifications or feature creep detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3397-3398-stale-gateway-cleanup

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: openshell-gateway-upgrade-e2e, upgrade-stale-sandbox-e2e, double-onboard-e2e, sandbox-operations-e2e, deployment-services-e2e
Optional E2E: onboard-repair-e2e, onboard-resume-e2e, sandbox-survival-e2e, device-auth-health-e2e, cloud-onboard-e2e, gpu-double-onboard-e2e

Dispatch hint: openshell-gateway-upgrade-e2e,upgrade-stale-sandbox-e2e,double-onboard-e2e,sandbox-operations-e2e,deployment-services-e2e

Workflow run

Full advisor summary

Pi Semantic E2E Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • openshell-gateway-upgrade-e2e: Directly exercises the 'stale Linux Docker-driver gateway process must be restarted' scenario that this PR's new openclaw-gateway cmdline matcher and destroy-path sweep are designed for. Closest existing coverage of the [DGX Spark][Upgrade] In-place upgrade leaves stale openclaw-gateway on port 18789 — new sandbox creation fails with 180s timeout #3397/[DGX Spark][Install] Express setup sandbox creation fails — port 18789 occupied by stale openclaw-gateway from previous failed onboard #3398 regression surface.
  • upgrade-stale-sandbox-e2e: Reproduces the upgrade-time scenario (old install → create sandbox → upgrade → re-onboard) which is the canonical setup for stale host-side gateway-forward processes that this PR sweeps.
  • double-onboard-e2e: Covers re-onboard lifecycle recovery, the primary user flow affected by the new --fresh sweep. A bug in protectedPorts seeding or the TERM/KILL loop would surface here as a failed second onboard.
  • sandbox-operations-e2e: Exercises multi-sandbox isolation and per-sandbox dashboard ports (18789 checks for SANDBOX_A/SANDBOX_B). The new protectedPorts logic must not disrupt live forwards of registered sibling sandboxes — this is the only existing E2E that creates multiple sandboxes concurrently.
  • deployment-services-e2e: Includes the uninstall lifecycle (TC-DEPLOY-03), which is the exact path now calling stopStaleDashboardListeners from run-plan.ts during 'Stopping services'. Regression in the uninstall sweep would break uninstall cleanup.

Optional E2E

  • onboard-repair-e2e: Covers repair flows adjacent to --fresh; confirms stale-listener sweep does not interfere with repair.
  • onboard-resume-e2e: Resumes an interrupted onboard (which the PR comment cites as a common source of orphaned gateway-forward processes).
  • sandbox-survival-e2e: Validates sandbox survival across gateway restarts; ensures the destroy-path sweep is not triggered unexpectedly when a sandbox remains.
  • device-auth-health-e2e: Exercises dashboard port reachability / port-forward liveness after onboard — secondary confidence that the port range (18789-18799) is correctly freed/owned after the sweep.
  • cloud-onboard-e2e: General onboard + inference.local probe sanity check on the public installer path.
  • gpu-double-onboard-e2e: Second double-onboard confirmation on a fresh VM (Ollama provider) — re-exercises the --fresh/re-onboard seam.

New E2E recommendations

  • onboard/stale-dashboard-port-recovery (high): No existing E2E specifically plants a stale process listening on the dashboard port (mimicking argv 'openclaw-gateway') before running nemoclaw onboard --fresh and asserts the new sweep (a) kills it, (b) lets the new sandbox bind the ORIGINAL dashboard port, and (c) does NOT touch a fake foreign-user or non-matching process also bound to the range. openshell-gateway-upgrade-e2e covers the Docker-driver PID-file path but not the new lsof-based port range sweep in stopStaleDashboardListeners.
    • Suggested test: test/e2e/test-stale-dashboard-port-sweep.sh: spawn a tester-owned shim bound to 127.0.0.1:18789 whose argv contains 'openclaw-gateway', then run nemoclaw onboard --fresh --name e2e-stale and assert the new sandbox's CHAT_UI_URL is http://127.0.0.1:18789 (port preserved, not fallback). Negative case: spawn a 'python -m http.server 18789' shim and assert onboard --fresh does NOT kill it (cmdline-marker guard) and instead falls back cleanly.
  • onboard/multi-sandbox-fresh-protection (medium): The protectedPorts seeding from registry.listSandboxes() in onboard.ts is new and safety-critical: a regression could kill the port-forward of an unrelated live sandbox during nemoclaw onboard --fresh --name <new>. sandbox-operations-e2e covers multi-sandbox but does not run a --fresh-for-different-name while another sandbox's forward is live.
    • Suggested test: test/e2e/test-fresh-onboard-protects-siblings.sh: onboard sandbox A (port 18789), onboard sandbox B (port 18790), then nemoclaw onboard --fresh --name C and assert A and B's forwards remain healthy (openshell forward list, HTTP 200/401 probes).

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: openshell-gateway-upgrade-e2e,upgrade-stale-sandbox-e2e,double-onboard-e2e,sandbox-operations-e2e,deployment-services-e2e

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/actions/sandbox/destroy.ts`:
- Around line 116-123: The sweep currently bails out when lsof is missing and
the subsequent pid-file cleanup only recognizes "openshell-gateway", which can
leave "openclaw-gateway" processes running; update the cleanup logic (the code
path after stopStaleDashboardListeners and stopDockerDriverGatewayProcess that
calls runOpenshell(["gateway","remove", NEMOCLAW_GATEWAY_NAME], ...)) to also
detect and remove openclaw-gateway PID files and/or attempt to stop any running
openclaw-gateway process when lsof is unavailable, ensuring both gateway names
("openshell-gateway" and "openclaw-gateway") are handled in the pid-file cleanup
routine so re-exec’d openclaw-gateway PIDs cannot survive and hold dashboard
ports.

In `@src/lib/onboard/stale-gateway-cleanup.ts`:
- Around line 161-164: The code currently swallows unexpected lsof exit codes by
returning [] when result.status is not 0/1; change this to surface the failure
by throwing or logging a clear error instead. Update the branch that checks
result.status (referencing result.status) to throw a descriptive Error (or call
the module's error logger) that includes result.status plus result.stderr and/or
result.stdout so unexpected lsof failures are not silently ignored and callers
can react per the function's contract.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 879fe512-a171-4e68-bbda-140fdc05532f

📥 Commits

Reviewing files that changed from the base of the PR and between edb7478 and 4faaac2.

📒 Files selected for processing (5)
  • src/lib/actions/sandbox/destroy.ts
  • src/lib/actions/uninstall/run-plan.ts
  • src/lib/onboard.ts
  • src/lib/onboard/stale-gateway-cleanup.test.ts
  • src/lib/onboard/stale-gateway-cleanup.ts

Comment thread src/lib/actions/sandbox/destroy.ts
Comment thread src/lib/onboard/stale-gateway-cleanup.ts
…tion

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
… errors

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
… listening sockets

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@laitingsheng laitingsheng marked this pull request as ready for review May 12, 2026 15:29
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25747630830
Branch: fix/3397-3398-stale-gateway-cleanup
Requested jobs: openshell-gateway-upgrade-e2e,sandbox-operations-e2e,double-onboard-e2e,deployment-services-e2e
Summary: 4 passed, 0 failed, 0 skipped

Job Result
deployment-services-e2e ✅ success
double-onboard-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
sandbox-operations-e2e ✅ success

laitingsheng added a commit that referenced this pull request May 12, 2026
…te PRs

Status exit-code work (#3386) is now tracked by #3402; stale-gateway cleanup
(#3397, #3398) by #3405. This commit reverts those changes here so that
this PR retains only the dashboard port + rollback work for #3260.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@jyaunches

Copy link
Copy Markdown
Contributor

Targeted E2E validation requested by the E2E Advisor has passed.

Run: https://github.com/NVIDIA/NemoClaw/actions/runs/25747630830

Passed targeted jobs: openshell-gateway-upgrade-e2e, sandbox-operations-e2e, double-onboard-e2e, deployment-services-e2e

PR review can proceed with that E2E signal in mind.

ericksoa added 2 commits May 12, 2026 16:59
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

@ericksoa ericksoa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the current head. Required checks are green, CodeRabbit follow-ups are resolved in code, and the remaining cancelled Pi semantic recommendation check is non-required. Approving for squash merge.

@ericksoa ericksoa merged commit 3aa9a59 into main May 13, 2026
27 of 28 checks passed
@ericksoa ericksoa deleted the fix/3397-3398-stale-gateway-cleanup branch May 13, 2026 00:24
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression and removed NemoClaw CLI labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression

Projects

None yet

4 participants