Skip to content

fix(onboard): auto-cleanup orphaned SSH port-forward on re-onboard (#1950)#1951

Merged
cv merged 3 commits into
mainfrom
fix/cleanup-orphaned-ssh-forward-on-reonboard-1950
Apr 16, 2026
Merged

fix(onboard): auto-cleanup orphaned SSH port-forward on re-onboard (#1950)#1951
cv merged 3 commits into
mainfrom
fix/cleanup-orphaned-ssh-forward-on-reonboard-1950

Conversation

@yanyunl1991

@yanyunl1991 yanyunl1991 commented Apr 16, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #1950

After destroying a sandbox and gateway, the SSH port-forward process for the dashboard (port 18789) is left running. Re-onboard fails at preflight with "Port 18789 is not available. Blocked by: ssh".

Fix

In the preflight port check, when port 18789 is blocked by an ssh process, automatically kill it and retry — similar to the orphaned gateway container cleanup in #1582.

Reproduction & Verification

Before fix:
[1/8] Preflight checks
✓ Docker is running
✓ Container runtime: docker
✓ openshell CLI: openshell 0.0.26
✓ Port 8080 available (OpenShell gateway)

!! Port 18789 is not available.
NemoClaw dashboard needs this port.
Blocked by: ssh (PID 3595441)

After fix:
[1/8] Preflight checks
✓ Docker is running
✓ Container runtime: docker
✓ openshell CLI: openshell 0.0.26
✓ Port 8080 available (OpenShell gateway)
Cleaning up orphaned SSH port-forward on port 18789 (PID 3595441)...
✓ Port 18789 available after orphaned forward cleanup (NemoClaw dashboard)

Test plan

  • Reproduced on Ubuntu 24.04 (NemoClaw v0.1.0, OpenShell 0.0.26)
  • Verified auto-cleanup of orphaned SSH forward and successful re-onboard
  • Verified non-SSH processes on port 18789 still trigger the original error
  • Existing test suite passes

Summary by CodeRabbit

  • Bug Fixes
    • Improved dashboard port conflict handling during onboarding: the installer now attempts to detect and clean up stale processes that block the required port, waits briefly, and retries.
    • If automatic recovery succeeds, setup continues; otherwise previous error reporting and exit behavior remain.

Signed-off-by: Yanyun Liao yanyunl@nvidia.com

…1950)

When a previous NemoClaw session is destroyed, the SSH port-forward
process for the dashboard (port 18789) may be left running. Re-onboard
then fails with "Port 18789 is not available" blocked by ssh.

In the preflight port check, detect when port 18789 is held by an ssh
process and automatically kill it before failing, similar to the
orphaned gateway container cleanup in #1582.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
@coderabbitai

coderabbitai Bot commented Apr 16, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

preflight() in src/lib/onboard.ts now attempts automatic cleanup of orphaned SSH port-forward processes that block the dashboard port: when the DASHBOARD_PORT check fails with an ssh PID, it inspects the process command via ps -p <pid> -o args=, kills the process only if the command contains openshell, sleeps 1s, and rechecks the port before exiting.

Changes

Cohort / File(s) Summary
Orphaned SSH cleanup
src/lib/onboard.ts
Changed portCheck from const to let to allow revalidation after remediation. When DASHBOARD_PORT is blocked by an ssh PID, run ps -p <pid> -o args= and, if the args include openshell, log cleanup, kill <pid> (errors ignored), wait 1s, and re-run checkPortAvailable. If port becomes available continue; otherwise fall back to existing error + process.exit(1) flow.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I found a stray ssh in the night,

peeked at its args by process light,
"openshell" whispered in the line—
a gentle nudge, a one-second rest,
then dashboard woke up, feeling its best. 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The code changes fully address #1950 objectives: detecting SSH blocking port 18789, automatically killing the orphaned SSH process, retrying port checks, and preserving non-SSH error behavior.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the port validation loop in preflight(), implementing only the orphaned SSH cleanup logic without introducing unrelated modifications.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: automatic cleanup of orphaned SSH port-forwards during onboarding, which is the core fix implemented in the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cleanup-orphaned-ssh-forward-on-reonboard-1950

Comment @coderabbitai help to get the list of available commands and usage tips.

…1950)

Check /proc/<PID>/cmdline for "openshell" before killing an SSH
process on port 18789. Avoids accidentally killing unrelated SSH
tunnels the user may have set up on the same port.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

2038-2055: Extract this recovery branch out of preflight().

This adds another decision path to a function that is already complexity-suppressed. Moving the SSH-forward cleanup/retry into a small helper would make the port-check loop easier to reason about and test. As per coding guidelines, "Limit cyclomatic complexity to 20 in JavaScript/TypeScript files, with target of 15"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2038 - 2055, The port-orphan recovery logic
inside preflight() (the branch that detects DASHBOARD_PORT held by an "ssh"
process, kills the pid via run(...), sleeps, and re-checks via
checkPortAvailable) should be extracted to a small helper like
recoverOrphanedSshForward(port: number): Promise<boolean>; move the conditional
that checks port === DASHBOARD_PORT && portCheck.process === "ssh" &&
portCheck.pid into that helper, perform run(...) and sleep(1) there, re-run
checkPortAvailable(port) and return true if the port becomes available (false
otherwise), and replace the inline block in preflight() with a single await call
to recoverOrphanedSshForward(port) to keep preflight()’s loop simpler and lower
cyclomatic complexity; keep references to GATEWAY_PORT, DASHBOARD_PORT,
gatewayReuseState, checkPortAvailable, run, and sleep unchanged so callers still
behave the same.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2044-2055: The current cleanup kills any SSH process on
DASHBOARD_PORT which may terminate legitimate tunnels; modify the logic in
onboard.ts so you only kill when you can positively identify the orphaned
NemoClaw forward or when the gateway is known to be gone: before calling
run(`kill ...`) re-check the gateway state (the same state used elsewhere in
onboarding) to ensure NemoClaw is terminated OR inspect the SSH listener process
details (use ps -o pid,comm,args or lsof for the pid returned by
checkPortAvailable) and only proceed if the command/args or owner match the
known NemoClaw forward pattern (e.g., expected ssh -L args, label or user);
otherwise skip killing and log a warning. Use the existing checkPortAvailable,
portCheck.pid, and run(...) helpers and keep the sleep(1) + re-check via
checkPortAvailable after a confirmed kill.

---

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 2038-2055: The port-orphan recovery logic inside preflight() (the
branch that detects DASHBOARD_PORT held by an "ssh" process, kills the pid via
run(...), sleeps, and re-checks via checkPortAvailable) should be extracted to a
small helper like recoverOrphanedSshForward(port: number): Promise<boolean>;
move the conditional that checks port === DASHBOARD_PORT && portCheck.process
=== "ssh" && portCheck.pid into that helper, perform run(...) and sleep(1)
there, re-run checkPortAvailable(port) and return true if the port becomes
available (false otherwise), and replace the inline block in preflight() with a
single await call to recoverOrphanedSshForward(port) to keep preflight()’s loop
simpler and lower cyclomatic complexity; keep references to GATEWAY_PORT,
DASHBOARD_PORT, gatewayReuseState, checkPortAvailable, run, and sleep unchanged
so callers still behave the same.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: c76de41b-6fde-441a-aba9-2fb48a19cbb5

📥 Commits

Reviewing files that changed from the base of the PR and between 56256a2 and e30c4b9.

📒 Files selected for processing (1)
  • src/lib/onboard.ts

Comment thread src/lib/onboard.ts
Replace Linux-only /proc/<PID>/cmdline with `ps -p <PID> -o args=`
which works on Linux, macOS, and WSL.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/lib/onboard.ts (1)

2048-2056: ⚠️ Potential issue | 🟠 Major

Don't clean up OpenShell SSH forwards unless the NemoClaw gateway is known gone.

This still terminates a foreign active OpenShell dashboard forward: when gatewayReuseState === "foreign-active", the ssh args will also contain openshell, but the surrounding preflight logic says those conflicts should be surfaced rather than remediated. Please gate the cleanup to states that prove the NemoClaw gateway is missing/stale before calling kill.

💡 Suggested guard
-      if (port === DASHBOARD_PORT && portCheck.process === "ssh" && portCheck.pid) {
+      const canCleanupDashboardForward =
+        gatewayReuseState === "missing" ||
+        gatewayReuseState === "stale" ||
+        gatewayReuseState === "active-unnamed";
+      if (
+        port === DASHBOARD_PORT &&
+        canCleanupDashboardForward &&
+        portCheck.process === "ssh" &&
+        portCheck.pid
+      ) {
         // Use `ps` to get the command line — works on Linux, macOS, and WSL.
         const cmdline = runCapture(
           `ps -p ${portCheck.pid} -o args= 2>/dev/null`,
           { ignoreError: true },
         ).trim();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2048 - 2056, The cleanup currently kills
processes whose args include "openshell" regardless of gatewayReuseState; change
the condition so the kill is only run when gatewayReuseState indicates the
NemoClaw gateway is actually missing/stale (i.e., not "foreign-active"). In the
block that checks port === DASHBOARD_PORT && portCheck.process === "ssh" &&
portCheck.pid and computes cmdline, add a guard using the gatewayReuseState
variable (e.g., only proceed with console.log and run(`kill ...`) when
gatewayReuseState !== "foreign-active" and/or equals the specific
"missing"/"stale" values your preflight logic uses) so foreign-active openshell
forwards are not terminated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 2048-2056: The cleanup currently kills processes whose args
include "openshell" regardless of gatewayReuseState; change the condition so the
kill is only run when gatewayReuseState indicates the NemoClaw gateway is
actually missing/stale (i.e., not "foreign-active"). In the block that checks
port === DASHBOARD_PORT && portCheck.process === "ssh" && portCheck.pid and
computes cmdline, add a guard using the gatewayReuseState variable (e.g., only
proceed with console.log and run(`kill ...`) when gatewayReuseState !==
"foreign-active" and/or equals the specific "missing"/"stale" values your
preflight logic uses) so foreign-active openshell forwards are not terminated.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a4995e28-ccf3-470d-9a23-a7eafa87e86c

📥 Commits

Reviewing files that changed from the base of the PR and between 1f30aab and 6b1c1d0.

📒 Files selected for processing (1)
  • src/lib/onboard.ts

@yanyunl1991 yanyunl1991 changed the title fix(onboard): auto-cleanup orphaned SSH port-forward on re-onboard (#… fix(onboard): auto-cleanup orphaned SSH port-forward on re-onboard (#1950) Apr 16, 2026
@yanyunl1991 yanyunl1991 changed the title fix(onboard): auto-cleanup orphaned SSH port-forward on re-onboard (#1950) fix(onboard): auto-cleanup orphaned SSH port-forward on re-onboard (#1950) Apr 16, 2026
@cv cv enabled auto-merge (squash) April 16, 2026 07:28
@cv cv merged commit ec4d058 into main Apr 16, 2026
19 of 25 checks passed
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[All platforms] Re-onboard does not clean up orphaned SSH port-forward from previous session

3 participants