fix(onboard): auto-cleanup orphaned SSH port-forward on re-onboard (#1950)#1951
Conversation
…1950) When a previous NemoClaw session is destroyed, the SSH port-forward process for the dashboard (port 18789) may be left running. Re-onboard then fails with "Port 18789 is not available" blocked by ssh. In the preflight port check, detect when port 18789 is held by an ssh process and automatically kill it before failing, similar to the orphaned gateway container cleanup in #1582. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
📝 WalkthroughWalkthroughpreflight() in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
…1950) Check /proc/<PID>/cmdline for "openshell" before killing an SSH process on port 18789. Avoids accidentally killing unrelated SSH tunnels the user may have set up on the same port. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
2038-2055: Extract this recovery branch out ofpreflight().This adds another decision path to a function that is already complexity-suppressed. Moving the SSH-forward cleanup/retry into a small helper would make the port-check loop easier to reason about and test. As per coding guidelines, "Limit cyclomatic complexity to 20 in JavaScript/TypeScript files, with target of 15"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 2038 - 2055, The port-orphan recovery logic inside preflight() (the branch that detects DASHBOARD_PORT held by an "ssh" process, kills the pid via run(...), sleeps, and re-checks via checkPortAvailable) should be extracted to a small helper like recoverOrphanedSshForward(port: number): Promise<boolean>; move the conditional that checks port === DASHBOARD_PORT && portCheck.process === "ssh" && portCheck.pid into that helper, perform run(...) and sleep(1) there, re-run checkPortAvailable(port) and return true if the port becomes available (false otherwise), and replace the inline block in preflight() with a single await call to recoverOrphanedSshForward(port) to keep preflight()’s loop simpler and lower cyclomatic complexity; keep references to GATEWAY_PORT, DASHBOARD_PORT, gatewayReuseState, checkPortAvailable, run, and sleep unchanged so callers still behave the same.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2044-2055: The current cleanup kills any SSH process on
DASHBOARD_PORT which may terminate legitimate tunnels; modify the logic in
onboard.ts so you only kill when you can positively identify the orphaned
NemoClaw forward or when the gateway is known to be gone: before calling
run(`kill ...`) re-check the gateway state (the same state used elsewhere in
onboarding) to ensure NemoClaw is terminated OR inspect the SSH listener process
details (use ps -o pid,comm,args or lsof for the pid returned by
checkPortAvailable) and only proceed if the command/args or owner match the
known NemoClaw forward pattern (e.g., expected ssh -L args, label or user);
otherwise skip killing and log a warning. Use the existing checkPortAvailable,
portCheck.pid, and run(...) helpers and keep the sleep(1) + re-check via
checkPortAvailable after a confirmed kill.
---
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 2038-2055: The port-orphan recovery logic inside preflight() (the
branch that detects DASHBOARD_PORT held by an "ssh" process, kills the pid via
run(...), sleeps, and re-checks via checkPortAvailable) should be extracted to a
small helper like recoverOrphanedSshForward(port: number): Promise<boolean>;
move the conditional that checks port === DASHBOARD_PORT && portCheck.process
=== "ssh" && portCheck.pid into that helper, perform run(...) and sleep(1)
there, re-run checkPortAvailable(port) and return true if the port becomes
available (false otherwise), and replace the inline block in preflight() with a
single await call to recoverOrphanedSshForward(port) to keep preflight()’s loop
simpler and lower cyclomatic complexity; keep references to GATEWAY_PORT,
DASHBOARD_PORT, gatewayReuseState, checkPortAvailable, run, and sleep unchanged
so callers still behave the same.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: c76de41b-6fde-441a-aba9-2fb48a19cbb5
📒 Files selected for processing (1)
src/lib/onboard.ts
Replace Linux-only /proc/<PID>/cmdline with `ps -p <PID> -o args=` which works on Linux, macOS, and WSL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Yanyun Liao <yanyunl@nvidia.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/lib/onboard.ts (1)
2048-2056:⚠️ Potential issue | 🟠 MajorDon't clean up OpenShell SSH forwards unless the NemoClaw gateway is known gone.
This still terminates a foreign active OpenShell dashboard forward: when
gatewayReuseState === "foreign-active", thesshargs will also containopenshell, but the surrounding preflight logic says those conflicts should be surfaced rather than remediated. Please gate the cleanup to states that prove the NemoClaw gateway is missing/stale before callingkill.💡 Suggested guard
- if (port === DASHBOARD_PORT && portCheck.process === "ssh" && portCheck.pid) { + const canCleanupDashboardForward = + gatewayReuseState === "missing" || + gatewayReuseState === "stale" || + gatewayReuseState === "active-unnamed"; + if ( + port === DASHBOARD_PORT && + canCleanupDashboardForward && + portCheck.process === "ssh" && + portCheck.pid + ) { // Use `ps` to get the command line — works on Linux, macOS, and WSL. const cmdline = runCapture( `ps -p ${portCheck.pid} -o args= 2>/dev/null`, { ignoreError: true }, ).trim();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 2048 - 2056, The cleanup currently kills processes whose args include "openshell" regardless of gatewayReuseState; change the condition so the kill is only run when gatewayReuseState indicates the NemoClaw gateway is actually missing/stale (i.e., not "foreign-active"). In the block that checks port === DASHBOARD_PORT && portCheck.process === "ssh" && portCheck.pid and computes cmdline, add a guard using the gatewayReuseState variable (e.g., only proceed with console.log and run(`kill ...`) when gatewayReuseState !== "foreign-active" and/or equals the specific "missing"/"stale" values your preflight logic uses) so foreign-active openshell forwards are not terminated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 2048-2056: The cleanup currently kills processes whose args
include "openshell" regardless of gatewayReuseState; change the condition so the
kill is only run when gatewayReuseState indicates the NemoClaw gateway is
actually missing/stale (i.e., not "foreign-active"). In the block that checks
port === DASHBOARD_PORT && portCheck.process === "ssh" && portCheck.pid and
computes cmdline, add a guard using the gatewayReuseState variable (e.g., only
proceed with console.log and run(`kill ...`) when gatewayReuseState !==
"foreign-active" and/or equals the specific "missing"/"stale" values your
preflight logic uses) so foreign-active openshell forwards are not terminated.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: a4995e28-ccf3-470d-9a23-a7eafa87e86c
📒 Files selected for processing (1)
src/lib/onboard.ts
Summary
Fixes #1950
After destroying a sandbox and gateway, the SSH port-forward process for the dashboard (port 18789) is left running. Re-onboard fails at preflight with "Port 18789 is not available. Blocked by: ssh".
Fix
In the preflight port check, when port 18789 is blocked by an
sshprocess, automatically kill it and retry — similar to the orphaned gateway container cleanup in #1582.Reproduction & Verification
Before fix:
[1/8] Preflight checks
✓ Docker is running
✓ Container runtime: docker
✓ openshell CLI: openshell 0.0.26
✓ Port 8080 available (OpenShell gateway)
!! Port 18789 is not available.
NemoClaw dashboard needs this port.
Blocked by: ssh (PID 3595441)
After fix:
[1/8] Preflight checks
✓ Docker is running
✓ Container runtime: docker
✓ openshell CLI: openshell 0.0.26
✓ Port 8080 available (OpenShell gateway)
Cleaning up orphaned SSH port-forward on port 18789 (PID 3595441)...
✓ Port 18789 available after orphaned forward cleanup (NemoClaw dashboard)
Test plan
Summary by CodeRabbit
Signed-off-by: Yanyun Liao yanyunl@nvidia.com