fix(onboard): auto-allocate dashboard port on multi-sandbox conflict by cjagwani · Pull Request #2454 · NVIDIA/NemoClaw

cjagwani · 2026-04-24T20:57:41Z

Summary

Second nemoclaw onboard previously crashed mid-flight when port 18789 was held by the first sandbox, blocking 12 P0/P1 QA tests that keep a baseline sandbox alive while onboarding a short-lived second one. This PR auto-allocates the next free port on loopback conflict, persists it, surfaces it via list / status, and adds a --control-ui-port flag for pinning. Also closes the "leaks ghost sandbox" path from the issue title — unrecoverable forward-setup failures now roll back the openshell sandbox before exiting.

Supersedes #2444 (clean history, same changes).

Related Issue

Fixes #2174

Changes

src/lib/ports.ts — findFreeDashboardPort picker with 10-port window, injectable probe seam, and 1024-65535 range clamp. 6 unit tests.
src/lib/onboard.ts — ensureDashboardForward auto-allocates on loopback conflict (real lsof bind probe to avoid ports held by non-openshell processes) and returns the chosen port. On user-pinned CHAT_UI_URL or exhaustion it rolls back the openshell sandbox before exiting. Reuse paths seed chatUiUrl from the sandbox's stored dashboardPort via new reuseChatUiUrlFor helper so re-onboarding an auto-allocated sandbox doesn't ratchet its port upward on each run.
src/lib/registry.ts — dashboardPort?: number on SandboxEntry, persisted by registerSandbox.
src/lib/inventory-commands.ts — dashboard URL in nemoclaw list and nemoclaw status. 2 unit tests.
src/nemoclaw.ts — dashboard URL in nemoclaw <name> status.
src/lib/onboard-command.ts — --control-ui-port <n> flag (precedence: flag > CHAT_UI_URL env > stored port > default). 2 unit tests.
src/lib/onboard.ts (banner) — getDashboardAccessInfo and getDashboardGuidanceLines read stored dashboardPort as a precedence tier so the completion banner shows the port actually allocated.

Type of Change

Code change (feature, bug fix, or refactor)

Verification

npx prek run --all-files passes (pre-commit hooks green — prettier, ESLint, shellcheck, hadolint, gitleaks, markdownlint, Test (plugin), YAML)
npm test passes (10 new unit tests; 1829+ total passing locally)
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed

Manual repro for reviewer: two back-to-back nemoclaw onboard --non-interactive — first takes 18789, second logs Dashboard port 18789 in use by 'alpha'; allocated port 18790 for 'beta', nemoclaw <name> status prints each port, both dashboards reachable.

Notes for reviewer

The picker's default 10-port window (18789–18798) matches the issue's "18790 or higher" hint. Exhaustion rolls back the openshell sandbox and prints an actionable error listing which sandbox holds each port.
ensureDashboardForward exits (matching refactor(cli): extract dashboard delivery chain into contract/health/recover modules #2398's pattern post-[Jetson][Onboard] nemoclaw onboard second sandbox: port 18789 conflict shows raw JS stack trace instead of user-friendly error #2169) rather than throws. The ghost-rollback happens inside before the exit for the unrecoverable cases.
Picker uses lsof for the bind probe so auto-alloc avoids ports bound by non-openshell processes (Docker, etc.). Falls back gracefully if lsof is missing.
End-to-end two-sandbox test and nightly-e2e job are split into a separate follow-up PR to keep this change focused on [All Platform][Onboard][Regression] Second onboard crashes with unhandled exception and leaks ghost sandbox when dashboard port 18789 is already forwarded #2174's minimum close.

AI Disclosure

AI-assisted — tool: Claude Code

Signed-off-by: Charan Jagwani cjagwani@nvidia.com

Summary by CodeRabbit

New Features
- Added a CLI flag to set a custom control UI port.
- Sandboxes now allocate and persist dashboard ports; local dashboard URLs are shown in inventory and status when available.
Tests
- Added coverage for dashboard URL emission and port-allocation behavior.
Chores
- End-to-end install script now captures diagnostics on success and exits with a non-zero code.

Second nemoclaw onboard crashed mid-flight when port 18789 was held by the first sandbox, blocking 12 P0/P1 QA tests that keep a baseline sandbox alive while onboarding a short-lived second one. The v0.0.21 guard stopped silent forward-stealing but left no recovery path — users had to set CHAT_UI_URL manually and the assigned port was not discoverable post-hoc. Changes: - src/lib/ports.ts: findFreeDashboardPort picker with 10-port window, injectable probe seam, and 1024-65535 range clamp. 6 unit tests. - src/lib/onboard.ts: ensureDashboardForward auto-allocates on loopback conflict (with a real lsof bind probe to avoid ports bound by other processes) and returns the chosen port. On user-pinned CHAT_UI_URL or window exhaustion it rolls back the openshell sandbox before exiting, closing the "leaks ghost sandbox" path from the issue title. Reuse paths seed chatUiUrl from the sandbox's stored dashboardPort via new reuseChatUiUrlFor helper so re-onboarding an auto-allocated sandbox doesn't ratchet its port upward. - src/lib/registry.ts: dashboardPort?: number on SandboxEntry, persisted by registerSandbox. - src/lib/inventory-commands.ts: dashboard URL in nemoclaw list and nemoclaw status. 2 unit tests. - src/nemoclaw.ts: dashboard URL in nemoclaw <name> status. - src/lib/onboard-command.ts: --control-ui-port <n> flag (precedence: flag > CHAT_UI_URL env > stored port > default). 2 unit tests. - src/lib/onboard.ts (banner): getDashboardAccessInfo and getDashboardGuidanceLines read stored dashboardPort as a precedence tier so the completion banner shows the port actually allocated. Fixes #2174 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>

coderabbitai · 2026-04-24T20:57:55Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds optional dashboard port allocation and propagation: a new control-ui-port CLI flag, automatic free-port discovery on conflicts, persistence of allocated ports to the sandbox registry, and emission of per-sandbox local dashboard URLs in list/status outputs. Tests for ports, onboarding parsing, and inventory output were added.

Changes

Cohort / File(s)	Summary
Port allocation core `src/lib/ports.ts`, `src/lib/ports.test.ts`	Adds `PortProbe` and `findFreeDashboardPort(preferred, {probe?})` which validates preferred port (1024–65535) and scans up to 10 consecutive ports, skipping forwarded/unavailable ports; tests cover probing, bounds, and exhaustion.
Onboard core & forwarding `src/lib/onboard.ts`	`ensureDashboardForward` now returns the chosen port, accepts rollback/cleanup options, probes for an alternate free port on conflict, persists the allocated `dashboardPort` to registry, and uses reuse/rewind flows to avoid port escalation.
Registry persistence `src/lib/registry.ts`	Adds optional `dashboardPort?: number` to `SandboxEntry`; `registerSandbox` validates and persists only numeric ports within 1024–65535.
CLI flag & command parsing `src/lib/onboard-command.ts`, `src/lib/onboard-command.test.ts`	Adds `--control-ui-port <n>` to `OnboardCommandOptions` (`number
Inventory / status output `src/lib/inventory-commands.ts`, `src/lib/inventory-commands.test.ts`, `src/nemoclaw.ts`	Adds optional `dashboardPort` handling to list/status commands and prints `dashboard: http://127.0.0.1:<port>` when present; tests verify output for multiple sandboxes.
E2E script change `test/e2e/test-sandbox-survival.sh`	Alters success path to emit diagnostic dump and exit with code 99 instead of exiting 0 (script-only change).

Sequence Diagram(s)

sequenceDiagram
  participant CLI as "nemoclaw onboard"
  participant Onboard as "onboard (ensureDashboardForward)"
  participant Probe as "PortProbe / openshell forward list"
  participant Registry as "sandbox registry"
  participant Run as "sandbox runtime (chat UI forward)"

  CLI->>Onboard: parse args (--control-ui-port?)
  Onboard->>Probe: check preferred port ownership & list forwarded ports
  alt preferred free
    Onboard->>Run: start forward at preferred
    Run-->>Onboard: returns allocated port (preferred)
  else conflict
    Onboard->>Probe: findFreeDashboardPort(preferred)
    alt found port
      Onboard->>Run: start forward at chosen port
      Run-->>Onboard: returns allocated port
    else none found
      Onboard-->>CLI: error / exit(1)
    end
  end
  Onboard->>Registry: updateSandbox(..., { dashboardPort: allocatedPort })
  Registry-->>Onboard: persisted
  Onboard-->>CLI: finish (CHAT_UI_URL temporarily set/restored)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through ports and found a spare,
Turned conflicts into numbers with careful care.
A flag to steer the dashboard's door,
Saved in the registry, shown once more.
Hooray—no more ghostly ports to scare!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	The test-e2e script modification that changes success exit code from 0 to 99 and adds file descriptor diagnostics appears out of scope to the core dashboard port allocation fix.	Clarify whether the e2e script changes are necessary for dashboard port verification or move to a separate diagnostic PR.
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and specifically describes the main change: implementing auto-allocation of the dashboard port on multi-sandbox conflicts, which directly addresses the core issue from `#2174`.
Linked Issues check	✅ Passed	The PR implements all primary objectives from `#2174`: auto-allocation of dashboard port [ports.ts, onboard.ts], CLI override via --control-ui-port [onboard-command.ts], port persistence in registry [registry.ts, inventory-commands.ts], and rollback on unrecoverable failures [onboard.ts].

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/2174-dashboard-port-autoalloc-v2

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard-command.ts`:
- Around line 143-149: The code sets process.env.CHAT_UI_URL when
options.controlUiPort is provided but never restores the previous value, causing
cross-invocation leakage; modify the block around deps.runOnboard(options) to
save the original process.env.CHAT_UI_URL (e.g., const _orig =
process.env.CHAT_UI_URL), set it when options.controlUiPort !== null, await
deps.runOnboard(options), and then restore the original value afterwards
(reassign or delete process.env.CHAT_UI_URL if it was undefined) so repeated
calls to runOnboardCommand / deps.runOnboard do not retain the pinned URL.

In `@src/lib/onboard.ts`:
- Around line 7093-7102: printDashboard() is reconstructing
chatUiUrl/chain/guidance from process.env.CHAT_UI_URL and CONTROL_UI_PORT
instead of reusing the stored-port precedence logic, so change printDashboard()
to derive its UI URL and guidance via the same helpers/values used earlier (use
registry.getSandbox(sandboxName)?.dashboardPort, the storedUrl/ storedPort
logic, and the computed chatUiUrl variable or the same precedence:
options.chatUiUrl || process.env.CHAT_UI_URL || storedUrl ||
`http://127.0.0.1:${CONTROL_UI_PORT}`) so the post-onboard banner reflects an
auto-allocated dashboardPort; update any inline uses at printDashboard and the
similar block around lines 7135-7142 to reference those symbols instead of
rebuilding from env/defaults.
- Around line 6793-6827: ensureDashboardForward currently unconditionally
deletes the sandbox on port-forward failures; change its signature to accept an
options flag (e.g., rollbackSandboxOnFailure: boolean = false) and only call
runOpenshell(["sandbox","delete", sandboxName], ...) when
rollbackSandboxOnFailure is true (both in the userPinned/isLoopback early-exit
branch and the chosen===null branch). Leave default behavior false so reuse
paths (the callers at the earlier reuse flows) keep existing sandboxes intact,
and update only the new-sandbox creation path to call
ensureDashboardForward(..., { rollbackSandboxOnFailure: true }) after the
sandbox is freshly created.

🪄 Autofix (Beta)

✅ Autofix completed

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2459036f-c0d1-49c6-ae11-67924d315e1c

📥 Commits

Reviewing files that changed from the base of the PR and between 260b237 and 196ada1.

📒 Files selected for processing (9)

src/lib/inventory-commands.test.ts
src/lib/inventory-commands.ts
src/lib/onboard-command.test.ts
src/lib/onboard-command.ts
src/lib/onboard.ts
src/lib/ports.test.ts
src/lib/ports.ts
src/lib/registry.ts
src/nemoclaw.ts

coderabbitai · 2026-04-24T21:24:02Z

Note

Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it.

Fixes Applied Successfully

Fixed 3 file(s) based on 3 unresolved review comments.

Files modified:

nemoclaw-blueprint/scripts/ws-proxy-fix.js
src/lib/onboard-command.ts
src/lib/onboard.ts

Commit: f13c77602ee703db5682b30f5255d1ef9a9e4a94

The changes have been pushed to the fix/2174-dashboard-port-autoalloc-v2 branch.

Time taken: 3m 55s

Fixed 3 file(s) based on 3 unresolved review comments. Co-authored-by: CodeRabbit <noreply@coderabbit.ai>

copy-pr-bot · 2026-04-24T21:28:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lib/onboard.ts (1)
6786-6873: ⚠️ Potential issue | 🟠 Major

Auto-allocation still misses non-OpenShell port conflicts.

findFreeDashboardPort() only runs when forward list already reports another sandbox owner. If the requested loopback port is held by any other local listener, portOwner stays null, forward start fails, and this code only warns. On the create path at Lines 4496-4520, that still persists dashboardPort and prints a broken URL even though nothing is forwarded. Probe the requested port for local binds before forward start, and treat a rejected start as a hard failure for fresh creates.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 6786 - 6873, When creating a sandbox, probe
the requested loopback port for local listeners before attempting openshell
forwarding and treat a start rejection as a hard failure for fresh creates: add
the same local-bind probe used in this block (isPortBoundLocally using
runCapture + lsof) earlier when portOwner === null so that if requestedPort is
already bound you call findFreeDashboardPort (or allocate a new effectivePort
and forwardTarget using buildChain) instead of proceeding to start;
additionally, if runOpenshell(...) /
runCaptureOpenshell(["forward","start",...]) (fwdResult) returns non-zero during
the create path, perform the same rollback (runOpenshell(["sandbox","delete",
sandboxName], { ignoreError: true })) and exit with an error message rather than
merely warning — reference the symbols requestedPort, portOwner,
isPortBoundLocally, findFreeDashboardPort, effectivePort, forwardTarget,
buildChain, fwdResult, and sandboxName to locate the insertion points.

♻️ Duplicate comments (2)

src/lib/onboard.ts (2)
6805-6839: ⚠️ Potential issue | 🔴 Critical

Scope rollback to newly created sandboxes only.

ensureDashboardForward() is also used by the reuse paths at Lines 3971-3975, 4004-4008, 4052-4056, and 4072-4076. The unconditional sandbox delete here turns a port-forward setup error into destructive data loss for an already-existing sandbox. Gate rollback behind an explicit "fresh create" flag and leave reuse flows intact.
🔧 Suggested scope guard
 function ensureDashboardForward(
   sandboxName: string,
   chatUiUrl = `http://127.0.0.1:${CONTROL_UI_PORT}`,
+  options: { rollbackSandboxOnFailure?: boolean } = {},
 ): number {
+  const rollback = () => {
+    if (options.rollbackSandboxOnFailure !== true) return;
+    runOpenshell(["sandbox", "delete", sandboxName], { ignoreError: true });
+  };
   const chain = buildChain({ chatUiUrl, isWsl: isWsl() });
   ...
     if (userPinnedEnv || !isLoopback) {
-      runOpenshell(["sandbox", "delete", sandboxName], { ignoreError: true });
+      rollback();
       console.error(`  Port ${requestedPort} is already forwarded for sandbox '${portOwner}'.`);
       ...
     }
   ...
     if (chosen === null) {
-      runOpenshell(["sandbox", "delete", sandboxName], { ignoreError: true });
+      rollback();
       console.error(`  All ports ${requestedPort}-${requestedPort + 9} are forwarded.`);
       ...
     }
Pass { rollbackSandboxOnFailure: true } only from the new-sandbox path after creation succeeds.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 6805 - 6839, The current unconditional
rollback in ensureDashboardForward (the runOpenshell(["sandbox","delete",
sandboxName]) call) will delete already-existing sandboxes during reuse flows;
change ensureDashboardForward to accept an options flag (e.g.,
rollbackSandboxOnFailure) and only perform runOpenshell(...sandbox delete...)
when that flag is true, keep the existing error logging and process.exit
behavior otherwise, and update the new-sandbox call site to pass {
rollbackSandboxOnFailure: true } while leaving reuse-path callers untouched.
7189-7206: ⚠️ Potential issue | 🟠 Major

Reuse the stored-port helpers in the completion banner.

printDashboard() still rebuilds chatUiUrl, chain, access URLs, and guidance from process.env.CHAT_UI_URL || CONTROL_UI_PORT instead of using the stored-port precedence added in getDashboardAccessInfo() and getDashboardGuidanceLines(). After auto-allocating 18790, the onboarding banner will still tell the user to open 18789. Route this block through the helpers, or derive the same stored-port-aware chatUiUrl here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 7189 - 7206, The code rebuilds chatUiUrl,
chain, dashboardAccess and guidanceLines manually and can return stale port
info; replace this block with the stored-port-aware helpers: call
getDashboardAccessInfo(...) to obtain the dashboardAccess array (pass token and
sandboxName or the same params those helpers expect) and call
getDashboardGuidanceLines(...) to obtain guidanceLines; remove the manual
buildChain/buildControlUiUrls logic and ensure token is passed through so any
WSL/port-precedence logic remains consistent with getDashboardAccessInfo and
getDashboardGuidanceLines (references: getDashboardAccessInfo,
getDashboardGuidanceLines, buildChain, buildControlUiUrls).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 6786-6873: When creating a sandbox, probe the requested loopback
port for local listeners before attempting openshell forwarding and treat a
start rejection as a hard failure for fresh creates: add the same local-bind
probe used in this block (isPortBoundLocally using runCapture + lsof) earlier
when portOwner === null so that if requestedPort is already bound you call
findFreeDashboardPort (or allocate a new effectivePort and forwardTarget using
buildChain) instead of proceeding to start; additionally, if runOpenshell(...) /
runCaptureOpenshell(["forward","start",...]) (fwdResult) returns non-zero during
the create path, perform the same rollback (runOpenshell(["sandbox","delete",
sandboxName], { ignoreError: true })) and exit with an error message rather than
merely warning — reference the symbols requestedPort, portOwner,
isPortBoundLocally, findFreeDashboardPort, effectivePort, forwardTarget,
buildChain, fwdResult, and sandboxName to locate the insertion points.

---

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 6805-6839: The current unconditional rollback in
ensureDashboardForward (the runOpenshell(["sandbox","delete", sandboxName])
call) will delete already-existing sandboxes during reuse flows; change
ensureDashboardForward to accept an options flag (e.g.,
rollbackSandboxOnFailure) and only perform runOpenshell(...sandbox delete...)
when that flag is true, keep the existing error logging and process.exit
behavior otherwise, and update the new-sandbox call site to pass {
rollbackSandboxOnFailure: true } while leaving reuse-path callers untouched.
- Around line 7189-7206: The code rebuilds chatUiUrl, chain, dashboardAccess and
guidanceLines manually and can return stale port info; replace this block with
the stored-port-aware helpers: call getDashboardAccessInfo(...) to obtain the
dashboardAccess array (pass token and sandboxName or the same params those
helpers expect) and call getDashboardGuidanceLines(...) to obtain guidanceLines;
remove the manual buildChain/buildControlUiUrls logic and ensure token is passed
through so any WSL/port-precedence logic remains consistent with
getDashboardAccessInfo and getDashboardGuidanceLines (references:
getDashboardAccessInfo, getDashboardGuidanceLines, buildChain,
buildControlUiUrls).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8185b0f9-9583-4ca7-844c-11ec32f2218c

📥 Commits

Reviewing files that changed from the base of the PR and between 196ada1 and d56382e.

📒 Files selected for processing (2)

src/lib/onboard.ts
src/lib/registry.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/lib/registry.ts

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lib/onboard.ts (1)

6790-6881: ⚠️ Potential issue | 🟠 Major

Run the port picker for local-bind conflicts too.

The fallback only triggers when forward list already owns requestedPort. If the requested port is held by any other local process, portOwner is null, we skip findFreeDashboardPort(), and then persist/print a dead dashboard port after forward start fails. This misses the exact lsof-backed conflict case the new allocator is meant to handle.

🔧 Suggested change

   const existingForwards = runCaptureOpenshell(["forward", "list"], { ignoreError: true });
   const portOwner = findDashboardForwardOwner(existingForwards, String(requestedPort));
+  const heldPorts = (existingForwards ?? "")
+    .split("\n")
+    .map((l) => l.trim())
+    .filter(Boolean)
+    .map((l) => Number(l.split(/\s+/)[2]))
+    .filter((p) => Number.isFinite(p) && p > 0);
+  const isPortBoundLocally = (p: number): boolean => {
+    const out = runCapture(["lsof", "-i", `:${p}`, "-sTCP:LISTEN", "-P", "-n"], {
+      ignoreError: true,
+    });
+    return typeof out === "string" && out.trim().length > 0;
+  };
+  const requestedPortUnavailable =
+    heldPorts.includes(requestedPort) || isPortBoundLocally(requestedPort);
 
   let effectivePort = requestedPort;
   let forwardTarget = chain.forwardTarget;
-  if (portOwner !== null && portOwner !== sandboxName) {
+  if (requestedPortUnavailable && portOwner !== sandboxName) {
      ...
-    const heldPorts = (existingForwards ?? "")
-      .split("\n")
-      .map((l) => l.trim())
-      .filter(Boolean)
-      .map((l) => Number(l.split(/\s+/)[2]))
-      .filter((p) => Number.isFinite(p) && p > 0);
-    const isPortBoundLocally = (p: number): boolean => {
-      const out = runCapture(["lsof", "-i", `:${p}`, "-sTCP:LISTEN", "-P", "-n"], {
-        ignoreError: true,
-      });
-      return typeof out === "string" && out.trim().length > 0;
-    };
     const chosen = findFreeDashboardPort(requestedPort, {
       probe: {
         listForwardPorts: () => heldPorts,
         probePortFree: (p: number) => !isPortBoundLocally(p),
       },

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 6790 - 6881, The current port-allocation
fallback only runs when findDashboardForwardOwner(...) returns a different
sandbox, but misses the case where portOwner === null yet the requestedPort is
bound locally; move or hoist the local-probe logic so the code also runs
findFreeDashboardPort when portOwner === null but
isPortBoundLocally(requestedPort) is true. Concretely: compute
isLoopback/userPinned as done now, then either (a) merge the condition to enter
the allocator when (portOwner !== null && portOwner !== sandboxName) OR
(portOwner === null && isPortBoundLocally(requestedPort)), or factor out the
allocator block (the heldPorts mapping, isPortBoundLocally probe, and call to
findFreeDashboardPort) into a helper used in both cases; ensure you still
perform rollbackSandboxOnFailure, update forwardTarget via buildChain(...), and
print the same user-facing messages when a free port is chosen or when no ports
are available (preserve use of runCaptureOpenshell, runOpenshell,
findFreeDashboardPort, and buildChain identifiers).

🧹 Nitpick comments (1)

src/lib/onboard-command.ts (1)

146-159: Scope CHAT_UI_URL restoration to the override path only.

The finally block always rewrites process.env.CHAT_UI_URL (Line 154-Line 158), even when --control-ui-port was not provided. Guarding restoration with options.controlUiPort !== null keeps global env mutation strictly tied to this feature path.

♻️ Suggested patch

   const _origChatUiUrl = process.env.CHAT_UI_URL;
   try {
     if (options.controlUiPort !== null) {
       process.env.CHAT_UI_URL = `http://127.0.0.1:${options.controlUiPort}`;
     }
     await deps.runOnboard(options);
   } finally {
-    // Restore original value to prevent cross-invocation leakage
-    if (_origChatUiUrl === undefined) {
-      delete process.env.CHAT_UI_URL;
-    } else {
-      process.env.CHAT_UI_URL = _origChatUiUrl;
+    // Restore original value only if we overrode it in this invocation
+    if (options.controlUiPort !== null) {
+      if (_origChatUiUrl === undefined) {
+        delete process.env.CHAT_UI_URL;
+      } else {
+        process.env.CHAT_UI_URL = _origChatUiUrl;
+      }
     }
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard-command.ts` around lines 146 - 159, The finally block
currently always restores process.env.CHAT_UI_URL even when we never overrode
it; change the restore logic to only run when we previously set the env (i.e.,
when options.controlUiPort !== null). In the finally block around
deps.runOnboard, wrap the existing restore/delete logic in a guard checking
options.controlUiPort !== null so that _origChatUiUrl is only applied back if we
actually changed process.env.CHAT_UI_URL.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard-command.ts`:
- Around line 82-89: The validation for --control-ui-port currently logs an
error and exits in two places (the /^\d+$/ check and the 1024–65535 range check)
but doesn't print the CLI usage like other argument validators; update both
failure branches that call error(...) and exit(1) to also invoke the parser's
usage printing function (the same helper used elsewhere in this file, e.g.,
printUsage() or usage()) before exiting so that when raw or parsed fail
validation (variables raw and parsed, and the branches that call error(...) and
exit(1)) the usage/help text is shown consistently.

---

Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 6790-6881: The current port-allocation fallback only runs when
findDashboardForwardOwner(...) returns a different sandbox, but misses the case
where portOwner === null yet the requestedPort is bound locally; move or hoist
the local-probe logic so the code also runs findFreeDashboardPort when portOwner
=== null but isPortBoundLocally(requestedPort) is true. Concretely: compute
isLoopback/userPinned as done now, then either (a) merge the condition to enter
the allocator when (portOwner !== null && portOwner !== sandboxName) OR
(portOwner === null && isPortBoundLocally(requestedPort)), or factor out the
allocator block (the heldPorts mapping, isPortBoundLocally probe, and call to
findFreeDashboardPort) into a helper used in both cases; ensure you still
perform rollbackSandboxOnFailure, update forwardTarget via buildChain(...), and
print the same user-facing messages when a free port is chosen or when no ports
are available (preserve use of runCaptureOpenshell, runOpenshell,
findFreeDashboardPort, and buildChain identifiers).

---

Nitpick comments:
In `@src/lib/onboard-command.ts`:
- Around line 146-159: The finally block currently always restores
process.env.CHAT_UI_URL even when we never overrode it; change the restore logic
to only run when we previously set the env (i.e., when options.controlUiPort !==
null). In the finally block around deps.runOnboard, wrap the existing
restore/delete logic in a guard checking options.controlUiPort !== null so that
_origChatUiUrl is only applied back if we actually changed
process.env.CHAT_UI_URL.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7421a1da-48bb-42bf-8d4c-35bcef286488

📥 Commits

Reviewing files that changed from the base of the PR and between d56382e and f13c776.

📒 Files selected for processing (3)

nemoclaw-blueprint/scripts/ws-proxy-fix.js
src/lib/onboard-command.ts
src/lib/onboard.ts

✅ Files skipped from review due to trivial changes (1)

nemoclaw-blueprint/scripts/ws-proxy-fix.js

coderabbitai · 2026-04-24T21:33:22Z

+    if (!/^\d+$/.test(raw)) {
+      error(`  --control-ui-port '${raw}' must be an integer between 1024 and 65535`);
+      exit(1);
+    }
+    const parsed = Number(raw);
+    if (parsed < 1024 || parsed > 65535) {
+      error(`  --control-ui-port '${raw}' must be an integer between 1024 and 65535`);
+      exit(1);


⚠️ Potential issue | 🟡 Minor

Print usage text for all --control-ui-port validation failures.

Line 83 and Line 88 exit with an error but skip usage output, while other argument-validation failures in this parser print usage. Keeping this consistent improves CLI guidance.

💡 Suggested patch

if (!/^\d+$/.test(raw)) { error(` --control-ui-port '${raw}' must be an integer between 1024 and 65535`); + printOnboardUsage(error, noticeAcceptFlag); exit(1); } const parsed = Number(raw); if (parsed < 1024 || parsed > 65535) { error(` --control-ui-port '${raw}' must be an integer between 1024 and 65535`); + printOnboardUsage(error, noticeAcceptFlag); exit(1); }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-command.ts` around lines 82 - 89, The validation for --control-ui-port currently logs an error and exits in two places (the /^\d+$/ check and the 1024–65535 range check) but doesn't print the CLI usage like other argument validators; update both failure branches that call error(...) and exit(1) to also invoke the parser's usage printing function (the same helper used elsewhere in this file, e.g., printUsage() or usage()) before exiting so that when raw or parsed fail validation (variables raw and parsed, and the branches that call error(...) and exit(1)) the usage/help text is shown consistently.

…2458) ## Summary Tip of `main` currently fails `npm run typecheck:cli` with 8 errors in `test/onboard.test.ts` — patterns introduced by #2130 (Slack token validation tests) don't satisfy the `strict: true` setting introduced by #2422. Every PR opened against main inherits these errors, which makes the `checks` CI job red on unrelated PRs (e.g. #2454). ## Related Follows from #2130 + #2422 interaction. No linked issue — local-only breakage caught by CI on downstream PRs. ## Changes `test/onboard.test.ts` — 8 annotations, zero behavior change: - **4× `.pop()!`** — `Array.pop()` returns `string | undefined`. Each site is immediately preceded by an `assert.equal(result.status, 0)` that guarantees non-empty stdout, so the non-null assertion is safe. - **3× `(c: { key: string }) => c.key === …`** — `saveCalls` entries only read `.key`. - **1× `(c: { name: string }) => c.name === …`** — `MESSAGING_CHANNELS` entry lookup only reads `.name`. The tests themselves still exercise the same cases; these changes are purely to make the TypeScript compiler happy. ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] `npx tsc -p tsconfig.cli.json` — exits 0 (was 2 with 8 errors) - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>  ## Summary by CodeRabbit ## Tests * Improved reliability of JSON payload extraction in onboarding tests with defensive null-checks before parsing. * Enhanced type safety in test callbacks and iterators with explicit TypeScript type annotations.  Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves conflicts in src/lib/onboard-command.ts and onboard-command.test.ts: combine main's --resume | --fresh mutual-exclusion handling with this branch's --control-ui-port flag and tests.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lib/onboard.ts (1)

6069-6160: ⚠️ Potential issue | 🟠 Major

Handle non-OpenShell listeners before attempting the forward.

findFreeDashboardPort() only runs when openshell forward list already shows another sandbox on requestedPort. If the port is occupied by any other local process, portOwner stays null, the picker is skipped, and Lines 6143-6159 only warn after forward start fails. In the new-sandbox path that still leaves onboarding succeeding with an unreachable dashboard, which misses the auto-allocation/rollback goal.

💡 Suggested fix

   const existingForwards = runCaptureOpenshell(["forward", "list"], { ignoreError: true });
   const portOwner = findDashboardForwardOwner(existingForwards, String(requestedPort));
+  const heldPorts = (existingForwards ?? "")
+    .split("\n")
+    .map((l) => l.trim())
+    .filter(Boolean)
+    .map((l) => Number(l.split(/\s+/)[2]))
+    .filter((p) => Number.isFinite(p) && p > 0);
+  const isPortBoundLocally = (p: number): boolean => {
+    const out = runCapture(["lsof", "-i", `:${p}`, "-sTCP:LISTEN", "-P", "-n"], {
+      ignoreError: true,
+    });
+    return typeof out === "string" && out.trim().length > 0;
+  };
+  const portBusyLocally = isPortBoundLocally(requestedPort);

   let effectivePort = requestedPort;
   let forwardTarget = chain.forwardTarget;
-  if (portOwner !== null && portOwner !== sandboxName) {
+  if ((portOwner !== null && portOwner !== sandboxName) || portBusyLocally) {
     const userPinnedEnv = !!process.env.CHAT_UI_URL;
     let isLoopback = true;
     try {
       const parsed = new URL(
         /^[a-z]+:\/\//i.test(chatUiUrl) ? chatUiUrl : `http://${chatUiUrl}`,
@@
     const chosen = findFreeDashboardPort(requestedPort, {
       probe: {
         listForwardPorts: () => heldPorts,
         probePortFree: (p: number) => !isPortBoundLocally(p),
       },
     });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 6069 - 6160, The code currently only probes
for local listeners when an openshell forward exists; to fix, detect
non-OpenShell listeners on the requested port before attempting forward and
treat that like a conflicting owner: call the existing isPortBoundLocally logic
(reuse the runCapture(["lsof", ...]) pattern) when portOwner === null and if it
returns true then run the same allocation/rollback flow that uses
findFreeDashboardPort, buildChain, and runOpenshell(["sandbox","delete",...]) as
in the portOwner !== sandboxName branch; ensure you reference and reuse the same
heldPorts/probe implementation and then set effectivePort/forwardTarget before
calling runOpenshell(["forward","start",...]) so onboarding fails over to an
allocated loopback port instead of silently producing an unreachable dashboard.

♻️ Duplicate comments (1)

src/lib/onboard-command.ts (1)

84-91: ⚠️ Potential issue | 🟡 Minor

Show usage for invalid --control-ui-port values.

Line 84 and Line 89 still exit without the usage banner, unlike the other parser errors in this function.

💡 Suggested patch

     if (!/^\d+$/.test(raw)) {
       error(`  --control-ui-port '${raw}' must be an integer between 1024 and 65535`);
+      printOnboardUsage(error, noticeAcceptFlag);
       exit(1);
     }
     const parsed = Number(raw);
     if (parsed < 1024 || parsed > 65535) {
       error(`  --control-ui-port '${raw}' must be an integer between 1024 and 65535`);
+      printOnboardUsage(error, noticeAcceptFlag);
       exit(1);
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard-command.ts` around lines 84 - 91, The invalid
--control-ui-port branches currently call error(...) and exit(1) without showing
the usage banner; modify both validation failure branches (the /^\d+$/.test(raw)
check and the range check for parsed) to invoke the same usage display used by
other parser errors (e.g. call showUsage() or the existing usage/display
function in this module) before calling exit(1) so the usage banner is printed
when raw or parsed is invalid.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 6069-6160: The code currently only probes for local listeners when
an openshell forward exists; to fix, detect non-OpenShell listeners on the
requested port before attempting forward and treat that like a conflicting
owner: call the existing isPortBoundLocally logic (reuse the runCapture(["lsof",
...]) pattern) when portOwner === null and if it returns true then run the same
allocation/rollback flow that uses findFreeDashboardPort, buildChain, and
runOpenshell(["sandbox","delete",...]) as in the portOwner !== sandboxName
branch; ensure you reference and reuse the same heldPorts/probe implementation
and then set effectivePort/forwardTarget before calling
runOpenshell(["forward","start",...]) so onboarding fails over to an allocated
loopback port instead of silently producing an unreachable dashboard.

---

Duplicate comments:
In `@src/lib/onboard-command.ts`:
- Around line 84-91: The invalid --control-ui-port branches currently call
error(...) and exit(1) without showing the usage banner; modify both validation
failure branches (the /^\d+$/.test(raw) check and the range check for parsed) to
invoke the same usage display used by other parser errors (e.g. call showUsage()
or the existing usage/display function in this module) before calling exit(1) so
the usage banner is printed when raw or parsed is invalid.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: df140fe6-2930-40ea-aba3-837bda2e3718

📥 Commits

Reviewing files that changed from the base of the PR and between f13c776 and 20ef745.

📒 Files selected for processing (6)

src/lib/inventory-commands.test.ts
src/lib/onboard-command.test.ts
src/lib/onboard-command.ts
src/lib/onboard.ts
src/lib/registry.ts
src/nemoclaw.ts

✅ Files skipped from review due to trivial changes (2)

src/nemoclaw.ts
src/lib/inventory-commands.test.ts

🚧 Files skipped from review as they are similar to previous changes (2)

src/lib/registry.ts
src/lib/onboard-command.test.ts

Temporary instrumentation to root-cause the post-success-message hang on sandbox-survival-e2e for PR #2454. Captures lsof output, ps tree, and every process holding bash's stdout/stderr pipe FDs into the upload artifact path. exit 99 forces the failure step so the artifact uploads. Revert before merge.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/test-sandbox-survival.sh`:
- Around line 718-762: The diagnostic block that writes to DIAG_FILE ends with a
forced "exit 99", which causes the script to fail even when assertions passed;
change the script so the success path exits with 0 (restore "exit 0" instead of
"exit 99") or gate the diagnostic+nonzero exit behind an explicit failure
condition, keeping the DIAG_FILE logging intact but ensuring the normal end of
the script uses exit 0 (look for the DIAG_FILE variable and the block that
prints "=== EXIT-DIAG" and remove or conditionalize the "exit 99").

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e26fe829-a1e6-4326-b66e-243460807484

📥 Commits

Reviewing files that changed from the base of the PR and between 20ef745 and d2a74d2.

📒 Files selected for processing (1)

test/e2e/test-sandbox-survival.sh

v1 saw fd 1 = the diag file (its own redirection) instead of the runner pipe. Save the original to fds 9/8 first, then walk /proc to find every process holding the same pipe inode.

Records when bash actually starts to exit (via EXIT trap, fires right before exit syscall). Comparing that to the runner's Process completed timestamp tells us whether the ~60s post-printf gap is in bash or in the runner's exit-detection path.

If the trap fires immediately after diag (within ms), bash exits cleanly on exit 0 and the 50s+ PR-specific extra tail is runner-side. If trap fires 50s+ later, bash is hanging in fflush() before exit.

The sandbox-survival-e2e job has a 900s internal wall clock and was flaking on PRs whose install hits the slow side of GitHub runner variance. Investigation traced the failure to runner-side log batching: on a clean exit 0 with no follow-up step (if: failure() skipped), the runner sat in its ~60s log-flush window before reporting the step's exit code. A guaranteed follow-up step (if: always()) makes the runner finalize immediately — tail dropped from 60s to 1.7s in a controlled test, reclaiming ~58s of the 900s budget. The artifact upload is harmless on success: the production script doesn't write /tmp/nemoclaw-e2e-install.log, and if-no-files-found: ignore no-ops the upload.

github-actions · 2026-04-28T22:44:51Z

Selective E2E Results — ❌ Some jobs failed

Run: 25080893392
Branch: fix/2174-dashboard-port-autoalloc-v2
Requested jobs: all (no filter)
Summary: 0 passed, 16 failed, 1 skipped

Job	Result
cloud-e2e	❌ failure
cloud-experimental-e2e	❌ failure
deployment-services-e2e	⚠️ cancelled
diagnostics-e2e	❌ failure
gpu-e2e	⏭️ skipped
hermes-e2e	❌ failure
inference-routing-e2e	❌ failure
messaging-providers-e2e	❌ failure
network-policy-e2e	❌ failure
overlayfs-autofix-e2e	⚠️ cancelled
rebuild-hermes-e2e	❌ failure
rebuild-openclaw-e2e	❌ failure
sandbox-operations-e2e	❌ failure
sandbox-survival-e2e	❌ failure
shields-config-e2e	❌ failure
skip-permissions-e2e	❌ failure
snapshot-commands-e2e	❌ failure
token-rotation-e2e	❌ failure
upgrade-stale-sandbox-e2e	❌ failure

Failed jobs: cloud-e2e, cloud-experimental-e2e, messaging-providers-e2e, token-rotation-e2e, sandbox-survival-e2e, hermes-e2e, skip-permissions-e2e, sandbox-operations-e2e, inference-routing-e2e, network-policy-e2e, diagnostics-e2e, snapshot-commands-e2e, shields-config-e2e, rebuild-openclaw-e2e, upgrade-stale-sandbox-e2e, rebuild-hermes-e2e. Check run artifacts for logs.

github-actions · 2026-04-28T23:18:28Z

Selective E2E Results — ❌ Some jobs failed

Run: 25081353877
Branch: fix/2174-dashboard-port-autoalloc-v2
Requested jobs: all (no filter)
Summary: 13 passed, 3 failed, 1 skipped

Job	Result
cloud-e2e	✅ success
cloud-experimental-e2e	❌ failure
deployment-services-e2e	✅ success
diagnostics-e2e	✅ success
gpu-e2e	⏭️ skipped
hermes-e2e	✅ success
inference-routing-e2e	✅ success
messaging-providers-e2e	✅ success
network-policy-e2e	✅ success
overlayfs-autofix-e2e	✅ success
rebuild-hermes-e2e	✅ success
rebuild-openclaw-e2e	❌ failure
sandbox-operations-e2e	⚠️ cancelled
sandbox-survival-e2e	⚠️ cancelled
shields-config-e2e	✅ success
skip-permissions-e2e	✅ success
snapshot-commands-e2e	✅ success
token-rotation-e2e	✅ success
upgrade-stale-sandbox-e2e	❌ failure

Failed jobs: cloud-experimental-e2e, rebuild-openclaw-e2e, upgrade-stale-sandbox-e2e. Check run artifacts for logs.

github-actions · 2026-04-28T23:46:57Z

Selective E2E Results — ❌ Some jobs failed

Run: 25082492604
Branch: fix/2174-dashboard-port-autoalloc-v2
Requested jobs: all (no filter)
Summary: 14 passed, 4 failed, 1 skipped

Job	Result
cloud-e2e	✅ success
cloud-experimental-e2e	❌ failure
deployment-services-e2e	❌ failure
diagnostics-e2e	✅ success
gpu-e2e	⏭️ skipped
hermes-e2e	✅ success
inference-routing-e2e	✅ success
messaging-providers-e2e	✅ success
network-policy-e2e	✅ success
overlayfs-autofix-e2e	❌ failure
rebuild-hermes-e2e	✅ success
rebuild-openclaw-e2e	❌ failure
sandbox-operations-e2e	✅ success
sandbox-survival-e2e	✅ success
shields-config-e2e	✅ success
skip-permissions-e2e	✅ success
snapshot-commands-e2e	✅ success
token-rotation-e2e	✅ success
upgrade-stale-sandbox-e2e	✅ success

Failed jobs: cloud-experimental-e2e, deployment-services-e2e, rebuild-openclaw-e2e, overlayfs-autofix-e2e. Check run artifacts for logs.

github-actions · 2026-04-29T01:44:58Z

Selective E2E Results — ❌ Some jobs failed

Run: 25086059313
Branch: fix/2174-dashboard-port-autoalloc-v2
Requested jobs: all (no filter)
Summary: 17 passed, 1 failed, 1 skipped

Job	Result
cloud-e2e	✅ success
cloud-experimental-e2e	❌ failure
deployment-services-e2e	✅ success
diagnostics-e2e	✅ success
gpu-e2e	⏭️ skipped
hermes-e2e	✅ success
inference-routing-e2e	✅ success
messaging-providers-e2e	✅ success
network-policy-e2e	✅ success
overlayfs-autofix-e2e	✅ success
rebuild-hermes-e2e	✅ success
rebuild-openclaw-e2e	✅ success
sandbox-operations-e2e	✅ success
sandbox-survival-e2e	✅ success
shields-config-e2e	✅ success
skip-permissions-e2e	✅ success
snapshot-commands-e2e	✅ success
token-rotation-e2e	✅ success
upgrade-stale-sandbox-e2e	✅ success

Failed jobs: cloud-experimental-e2e. Check run artifacts for logs.

github-actions · 2026-04-29T06:25:03Z

Selective E2E Results — ❌ Some jobs failed

Run: 25093314104
Branch: fix/2174-dashboard-port-autoalloc-v2
Requested jobs: all (no filter)
Summary: 16 passed, 2 failed, 1 skipped

Job	Result
cloud-e2e	✅ success
cloud-experimental-e2e	❌ failure
deployment-services-e2e	✅ success
diagnostics-e2e	✅ success
gpu-e2e	⏭️ skipped
hermes-e2e	✅ success
inference-routing-e2e	✅ success
messaging-providers-e2e	✅ success
network-policy-e2e	✅ success
overlayfs-autofix-e2e	✅ success
rebuild-hermes-e2e	✅ success
rebuild-openclaw-e2e	✅ success
sandbox-operations-e2e	❌ failure
sandbox-survival-e2e	✅ success
shields-config-e2e	✅ success
skip-permissions-e2e	✅ success
snapshot-commands-e2e	✅ success
token-rotation-e2e	✅ success
upgrade-stale-sandbox-e2e	✅ success

Failed jobs: cloud-experimental-e2e, sandbox-operations-e2e. Check run artifacts for logs.

cjagwani · 2026-04-29T15:59:02Z

Superseded.

The crash half of #2174 was addressed by #2411 (merged 2026-04-28). The "leaks ghost sandbox" half — which this PR also covered — landed separately as #2627 (merged 2026-04-29) on top of main.

Closing this branch in favour of the two-PR sequence that actually shipped:

feat(onboard): auto-allocate dashboard port for multi-sandbox #2411 — --control-ui-port flag + auto-allocation + persistence in nemoclaw list/status
fix(onboard): roll back orphaned sandbox when dashboard alloc fails (#2174 followup) #2627 — rollbackSandboxOnFailure for ensureDashboardForward so range exhaustion no longer leaves orphaned openshell sandboxes

The salvageable hardening from this branch is in #2627. The lsof-bind-probe differentiator turned out to have a reachability bug (CodeRabbit caught it) — Aaron's version in #2411 implemented the same idea correctly, so it's not in #2627. The reuseChatUiUrlFor re-onboard port-stickiness wasn't carried forward; #2411's stale-forward cleanup achieves a different but adequate outcome, and reopening that design discussion is out of scope.

Introduce src/lib/openshell-timeouts.ts with categorized timeout constants for all openshell child-process calls: - OPENSHELL_PROBE_TIMEOUT_MS (15s) — read-only queries - OPENSHELL_OPERATION_TIMEOUT_MS (30s) — mutating commands - OPENSHELL_HEAVY_TIMEOUT_MS (60s) — destructive/long-running ops - OPENSHELL_DOWNLOAD_TIMEOUT_MS (30s) — file transfers over SSH Uses the same OPENSHELL_PROBE_TIMEOUT_MS name introduced locally in PR NVIDIA#2454 so that PR can import from the shared module on rebase. This is Phase 1 of NVIDIA#2562 — pure constants, no behavioral change. Refs: NVIDIA#2562

…2683) ## Summary Introduces a shared module of named timeout constants for all openshell child-process calls. This is Phase 1 of #2562 — pure constants with no behavioral change, establishing the vocabulary that subsequent PRs will use to bound every unbounded call site. ## Related Issue Refs #2562 (partial — Phase 1 of 4) ## Changes - Add `src/lib/openshell-timeouts.ts` with four categorized constants: - `OPENSHELL_PROBE_TIMEOUT_MS` (15s) — read-only queries (list, status, info, ssh-config) - `OPENSHELL_OPERATION_TIMEOUT_MS` (30s) — mutating commands (provider CRUD, forward, gateway select) - `OPENSHELL_HEAVY_TIMEOUT_MS` (60s) — destructive/long-running (sandbox delete, gateway destroy) - `OPENSHELL_DOWNLOAD_TIMEOUT_MS` (30s) — file transfers over SSH - Add `src/lib/openshell-timeouts.test.ts` with unit tests for positivity, ordering, and PR #2454 alignment - Uses the same `OPENSHELL_PROBE_TIMEOUT_MS` name from PR #2454 so it can import from this shared module on rebase ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [ ] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes ## AI Disclosure - [x] AI-assisted — tool: Claude Code (pi) --- Signed-off-by: Jessica Yaunches <jyaunches@nvidia.com>  ## Summary by CodeRabbit * **Chores** * Centralized timeout settings for shell interactions to standardize behavior and improve reliability across operations. * **Tests** * Added tests that validate timeout values are numeric, positive, and follow the expected ordering to prevent regressions. * **Stability** * Applied the refined probe-level timeout to quick checks to make probes and lightweight operations more consistent and predictable.

…VIDIA#2458) ## Summary Tip of `main` currently fails `npm run typecheck:cli` with 8 errors in `test/onboard.test.ts` — patterns introduced by NVIDIA#2130 (Slack token validation tests) don't satisfy the `strict: true` setting introduced by NVIDIA#2422. Every PR opened against main inherits these errors, which makes the `checks` CI job red on unrelated PRs (e.g. NVIDIA#2454). ## Related Follows from NVIDIA#2130 + NVIDIA#2422 interaction. No linked issue — local-only breakage caught by CI on downstream PRs. ## Changes `test/onboard.test.ts` — 8 annotations, zero behavior change: - **4× `.pop()!`** — `Array.pop()` returns `string | undefined`. Each site is immediately preceded by an `assert.equal(result.status, 0)` that guarantees non-empty stdout, so the non-null assertion is safe. - **3× `(c: { key: string }) => c.key === …`** — `saveCalls` entries only read `.key`. - **1× `(c: { name: string }) => c.name === …`** — `MESSAGING_CHANNELS` entry lookup only reads `.name`. The tests themselves still exercise the same cases; these changes are purely to make the TypeScript compiler happy. ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] `npx tsc -p tsconfig.cli.json` — exits 0 (was 2 with 8 errors) - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>  ## Summary by CodeRabbit ## Tests * Improved reliability of JSON payload extraction in onboarding tests with defensive null-checks before parsing. * Enhanced type safety in test callbacks and iterators with explicit TypeScript type annotations.  Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…VIDIA#2683) ## Summary Introduces a shared module of named timeout constants for all openshell child-process calls. This is Phase 1 of NVIDIA#2562 — pure constants with no behavioral change, establishing the vocabulary that subsequent PRs will use to bound every unbounded call site. ## Related Issue Refs NVIDIA#2562 (partial — Phase 1 of 4) ## Changes - Add `src/lib/openshell-timeouts.ts` with four categorized constants: - `OPENSHELL_PROBE_TIMEOUT_MS` (15s) — read-only queries (list, status, info, ssh-config) - `OPENSHELL_OPERATION_TIMEOUT_MS` (30s) — mutating commands (provider CRUD, forward, gateway select) - `OPENSHELL_HEAVY_TIMEOUT_MS` (60s) — destructive/long-running (sandbox delete, gateway destroy) - `OPENSHELL_DOWNLOAD_TIMEOUT_MS` (30s) — file transfers over SSH - Add `src/lib/openshell-timeouts.test.ts` with unit tests for positivity, ordering, and PR NVIDIA#2454 alignment - Uses the same `OPENSHELL_PROBE_TIMEOUT_MS` name from PR NVIDIA#2454 so it can import from this shared module on rebase ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [ ] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes ## AI Disclosure - [x] AI-assisted — tool: Claude Code (pi) --- Signed-off-by: Jessica Yaunches <jyaunches@nvidia.com>  ## Summary by CodeRabbit * **Chores** * Centralized timeout settings for shell interactions to standardize behavior and improve reliability across operations. * **Tests** * Added tests that validate timeout values are numeric, positive, and follow the expected ordering to prevent regressions. * **Stability** * Applied the refined probe-level timeout to quick checks to make probes and lightweight operations more consistent and predictable.

…VIDIA#2458) ## Summary Tip of `main` currently fails `npm run typecheck:cli` with 8 errors in `test/onboard.test.ts` — patterns introduced by NVIDIA#2130 (Slack token validation tests) don't satisfy the `strict: true` setting introduced by NVIDIA#2422. Every PR opened against main inherits these errors, which makes the `checks` CI job red on unrelated PRs (e.g. NVIDIA#2454). ## Related Follows from NVIDIA#2130 + NVIDIA#2422 interaction. No linked issue — local-only breakage caught by CI on downstream PRs. ## Changes `test/onboard.test.ts` — 8 annotations, zero behavior change: - **4× `.pop()!`** — `Array.pop()` returns `string | undefined`. Each site is immediately preceded by an `assert.equal(result.status, 0)` that guarantees non-empty stdout, so the non-null assertion is safe. - **3× `(c: { key: string }) => c.key === …`** — `saveCalls` entries only read `.key`. - **1× `(c: { name: string }) => c.name === …`** — `MESSAGING_CHANNELS` entry lookup only reads `.name`. The tests themselves still exercise the same cases; these changes are purely to make the TypeScript compiler happy. ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] `npx tsc -p tsconfig.cli.json` — exits 0 (was 2 with 8 errors) - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>  ## Summary by CodeRabbit ## Tests * Improved reliability of JSON payload extraction in onboarding tests with defensive null-checks before parsing. * Enhanced type safety in test callbacks and iterators with explicit TypeScript type annotations.  Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cjagwani mentioned this pull request Apr 24, 2026

fix(onboard): auto-allocate dashboard port on multi-sandbox conflict #2444

Closed

6 tasks

cjagwani self-assigned this Apr 24, 2026

cjagwani mentioned this pull request Apr 24, 2026

test(e2e): cover dashboard port auto-alloc in nightly #2455

Closed

7 tasks

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread src/lib/onboard-command.ts Outdated

Comment thread src/lib/onboard.ts Outdated

Comment thread src/lib/onboard.ts

Merge branch 'main' into fix/2174-dashboard-port-autoalloc-v2

d56382e

fix: apply CodeRabbit auto-fixes

f13c776

Fixed 3 file(s) based on 3 unresolved review comments. Co-authored-by: CodeRabbit <noreply@coderabbit.ai>

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

cjagwani mentioned this pull request Apr 24, 2026

fix(test): satisfy strict typecheck on Slack token validation tests #2458

Merged

7 tasks

jyaunches mentioned this pull request Apr 25, 2026

feat(onboard): auto-allocate dashboard port for multi-sandbox #2411

Merged

6 tasks

wscurran added bug Something fails against expected or documented behavior NemoClaw CLI dependencies Pull requests that update a dependency file labels Apr 27, 2026

wscurran mentioned this pull request Apr 28, 2026

[All Platform][Onboard][Regression] Second onboard crashes with unhandled exception and leaks ghost sandbox when dashboard port 18789 is already forwarded #2174

Closed

Merge branch 'main' into fix/2174-dashboard-port-autoalloc-v2

20ef745

Resolves conflicts in src/lib/onboard-command.ts and onboard-command.test.ts: combine main's --resume | --fresh mutual-exclusion handling with this branch's --control-ui-port flag and tests.

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread test/e2e/test-sandbox-survival.sh Outdated

ericksoa added 7 commits April 28, 2026 10:08

diag v2: save orig stdout/stderr before redirect

d119338

v1 saw fd 1 = the diag file (its own redirection) instead of the runner pipe. Save the original to fds 9/8 first, then walk /proc to find every process holding the same pipe inode.

diag v4: exit 0 + EXIT trap + always-upload artifact

d99edca

If the trap fires immediately after diag (within ms), bash exits cleanly on exit 0 and the 50s+ PR-specific extra tail is runner-side. If trap fires 50s+ later, bash is hanging in fflush() before exit.

ci: restore sandbox-survival failure artifact upload

c7c6125

fix(cli): bound OpenShell probes in status and destroy

98c41ec

Merge branch 'main' into fix/2174-dashboard-port-autoalloc-v2

09f4643

Merge branch 'main' into fix/2174-dashboard-port-autoalloc-v2

46a37cf

fix(onboard): preserve active dashboard forwards on reuse

8bcfb66

Merge branch 'main' into fix/2174-dashboard-port-autoalloc-v2

2d89584

test(onboard): allow WSL dashboard forward target

7264a1d

cjagwani closed this Apr 29, 2026

jyaunches mentioned this pull request Apr 29, 2026

refactor(arch): add named timeout constants for openshell execution #2683

Merged

7 tasks

Conversation

cjagwani commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Notes for reviewer

AI Disclosure

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Review ran into problems

Uh oh!

coderabbitai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes Applied Successfully

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Apr 28, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

github-actions Bot commented Apr 28, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

github-actions Bot commented Apr 28, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

github-actions Bot commented Apr 29, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

github-actions Bot commented Apr 29, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

cjagwani commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cjagwani commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

coderabbitai Bot left a comment •

edited

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading