fix(onboard): auto-allocate dashboard port on multi-sandbox conflict#2444
fix(onboard): auto-allocate dashboard port on multi-sandbox conflict#2444cjagwani wants to merge 6 commits into
Conversation
Root cause: install.sh sources scripts/install.sh (1374-line payload), which runs resolve_installer_version() → resolve_repo_root() → git describe at source time. In isolation the test takes ~0.9s. Under vitest's default parallelism on macOS (up to 14 workers), fork()/execve() serializes and each subprocess spawn balloons 50-200x, pushing total runtime to 5-7s and blowing past the default 5000ms testTimeout. The default timeout is correct for every other test in the file; this one has a legitimately different workload (real bash + real subprocess chain). Bumping its per-test timeout is the minimum-scope fix; caching the payload's init subshells in scripts/install.sh is the general improvement and is tracked for follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Second onboard previously crashed mid-flight when port 18789 was held by the first sandbox, blocking 12 P0/P1 QA tests that keep a baseline sandbox alive while onboarding a short-lived second one. The guard from v0.0.21 fixed silent forward stealing but left no recovery path — users had to set CHAT_UI_URL manually and the assigned port was not discoverable post-hoc. Changes: - src/lib/ports.ts: add findFreeDashboardPort picker with 10-port window and injectable probe seam for tests. - src/lib/onboard.ts: ensureDashboardForward now auto-allocates on loopback conflict and returns the chosen port; throws only on user-pinned CHAT_UI_URL or window exhaustion. Rollback try/catch around the forward + registry block deletes the openshell sandbox on failure so nemoclaw list and openshell sandbox list no longer drift. - src/lib/registry.ts: persist dashboardPort on SandboxEntry so the assigned port survives across CLI invocations. - src/lib/inventory-commands.ts: surface dashboard URL in nemoclaw list and nemoclaw status. - src/nemoclaw.ts: surface dashboard URL in nemoclaw <name> status. - src/lib/onboard-command.ts: add --control-ui-port <n> flag with precedence flag > CHAT_UI_URL env > auto-alloc > default. Fixes #2174 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
📝 WalkthroughWalkthroughAdds CLI flag and runtime logic to select, persist, and surface a local dashboard port during sandbox onboarding; implements port-finding/probing, updates registry and display commands, and adds tests and e2e coverage for allocated dashboard URLs. Rollback/cleanup is added on onboarding failure. Changes
Sequence DiagramsequenceDiagram
participant User
participant CLI as "Onboard CLI"
participant Onboard as "Onboard Core"
participant Port as "findFreeDashboardPort"
participant OS as "OS / probe / lsof"
participant Shell as "OpenShell (forward)"
participant Registry as "Registry"
User->>CLI: nemoclaw onboard [--control-ui-port N]
CLI->>CLI: parse & validate controlUiPort
CLI->>Onboard: runOnboardCommand(controlUiPort)
Onboard->>Port: findFreeDashboardPort(preferred)
Port->>OS: list forwarded ports & probe availability
OS-->>Port: existing forwards / probe results
Port-->>Onboard: effectivePort (or null)
Onboard->>Shell: openshell forward start effectivePort <sandbox>
Shell-->>Onboard: forward active (allocated port)
Onboard->>Registry: updateSandbox(name, { dashboardPort: effectivePort })
Registry-->>Onboard: persisted
Onboard-->>CLI: onboard complete
CLI-->>User: prints dashboard: http://127.0.0.1:effectivePort
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
src/lib/registry.ts (1)
25-26: KeepdashboardPortround-trippable through registration.
SandboxEntrynow supportsdashboardPort, butregisterSandbox(...)currently drops it. Persisting it there keeps the type/storage contract consistent and avoids silent loss if any caller starts setting it during initial registration.Proposed patch
diff --git a/src/lib/registry.ts b/src/lib/registry.ts @@ data.sandboxes[entry.name] = { name: entry.name, @@ disabledChannels: Array.isArray(entry.disabledChannels) && entry.disabledChannels.length > 0 ? [...entry.disabledChannels] : undefined, + dashboardPort: + Number.isInteger(entry.dashboardPort) && + (entry.dashboardPort as number) >= 1024 && + (entry.dashboardPort as number) <= 65535 + ? entry.dashboardPort + : undefined, };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/registry.ts` around lines 25 - 26, registerSandbox currently drops the new SandboxEntry.dashboardPort field during registration; update the registerSandbox implementation to persist dashboardPort when creating or merging the stored SandboxEntry. Specifically, ensure the code paths in registerSandbox that construct the saved entry (and any merge/update logic) copy sandbox.dashboardPort into the stored object (preserving nullable/optional semantics), and update any object spread/assign calls that assemble the entry so they include dashboardPort alongside existing fields to keep the type/storage contract round-trippable.src/lib/ports.test.ts (1)
96-105: Add an upper-bound test for preferred port near 65535.A boundary regression test here would protect against accidental allocation of out-of-range ports when the preferred port is at/near the max.
Suggested test case
diff --git a/src/lib/ports.test.ts b/src/lib/ports.test.ts @@ it("returns null when the 10-port window is exhausted", () => { @@ expect(port).toBeNull(); }); + + it("does not allocate ports above 65535", () => { + const port = findFreeDashboardPort(65535, { + probe: { + listForwardPorts: () => [65535], + probePortFree: () => true, + }, + }); + expect(port).toBeNull(); + }); });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/ports.test.ts` around lines 96 - 105, Add a boundary test for findFreeDashboardPort that uses a preferred port at/near 65535 to ensure the function does not return or consider ports >65535; create a test similar to the existing "10-port window exhausted" case but set preferred port to 65535 (or 65534) and mock probe.listForwardPorts and probe.probePortFree so all candidate ports in the search window are "held" or unavailable, then assert the result is null (or within valid port range) to verify no out-of-range port is returned by findFreeDashboardPort.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 3457-3458: The code calls ensureDashboardForward(sandboxName,
chatUiUrl) with chatUiUrl built from the default port instead of using the
sandbox’s recorded dashboard port; change the logic so the requested URL is
seeded from registry.getSandbox(sandboxName)?.dashboardPort (use that port to
build chatUiUrl) unless an explicit user override is provided, then pass that
URL into ensureDashboardForward and continue to call
registry.updateSandbox(sandboxName, { dashboardPort: reusePort }); apply the
same fix for the other occurrences that call ensureDashboardForward (the blocks
around where registry.updateSandbox is used).
- Around line 3958-3984: The allocated dashboard port from
ensureDashboardForward (allocatedDashboardPort) is registered in
registry.registerSandbox but not propagated into the live onboarding flow that
builds the completion banner URL from CHAT_UI_URL/CONTROL_UI_PORT; update
createSandbox (or the surrounding onboarding function) to return the chosen port
or set the in-memory dashboard URL/CONTROL_UI_PORT to allocatedDashboardPort
before the final banner is rendered so the printed URL matches the port saved in
registry.registerSandbox; touch the code paths that call createSandbox (and the
completion banner formatter) to consume the returned port or read the updated
in-memory variable so the completion output uses allocatedDashboardPort
consistently.
- Around line 6104-6107: The code is passing a stub probe (probePortFree: () =>
true) to findFreeDashboardPort which ignores non-OpenShell listeners and can
select ports already used by other local processes; update the call so
findFreeDashboardPort performs a real availability check by either removing the
custom probe argument (so it uses its normal port-availability probe) or replace
probePortFree with an actual local-bind probe implementation that attempts to
bind the candidate port and returns true only if bind succeeds; apply the same
fix to the other occurrence around the block referencing
heldPorts/forwardEntries at the later lines.
In `@src/lib/ports.ts`:
- Around line 70-80: The loop in findFreeDashboardPort can generate ports above
65535 when preferred is near the upper bound; guard and clamp candidates to the
valid port range by first validating preferred (return null if > 65535 or < 0),
compute a safe max offset like min(PORT_WINDOW, 65535 - preferred) and iterate
only up to that offset, and ensure you use the same probe helpers
(probe.listForwardPorts and probe.probePortFree) when checking held/probed
ports; this prevents returning out‑of‑range ports while preserving existing
behavior.
---
Nitpick comments:
In `@src/lib/ports.test.ts`:
- Around line 96-105: Add a boundary test for findFreeDashboardPort that uses a
preferred port at/near 65535 to ensure the function does not return or consider
ports >65535; create a test similar to the existing "10-port window exhausted"
case but set preferred port to 65535 (or 65534) and mock probe.listForwardPorts
and probe.probePortFree so all candidate ports in the search window are "held"
or unavailable, then assert the result is null (or within valid port range) to
verify no out-of-range port is returned by findFreeDashboardPort.
In `@src/lib/registry.ts`:
- Around line 25-26: registerSandbox currently drops the new
SandboxEntry.dashboardPort field during registration; update the registerSandbox
implementation to persist dashboardPort when creating or merging the stored
SandboxEntry. Specifically, ensure the code paths in registerSandbox that
construct the saved entry (and any merge/update logic) copy
sandbox.dashboardPort into the stored object (preserving nullable/optional
semantics), and update any object spread/assign calls that assemble the entry so
they include dashboardPort alongside existing fields to keep the type/storage
contract round-trippable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6c7f0cd6-d20b-4e98-a1d9-55f03842de20
📒 Files selected for processing (10)
src/lib/inventory-commands.test.tssrc/lib/inventory-commands.tssrc/lib/onboard-command.test.tssrc/lib/onboard-command.tssrc/lib/onboard.tssrc/lib/ports.test.tssrc/lib/ports.tssrc/lib/registry.tssrc/nemoclaw.tstest/install-preflight.test.ts
This reverts commit 682320a. Investigation showed the flake only reproduces when the local machine is already CPU-saturated (a stuck OneDrive sync service was burning a full core on my Mac). Raising the test timeout papers over that rather than fixing anything. Keeping the flake fix out of #2174 so the PR is purely about dashboard port auto-allocation. Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Five fixes from CodeRabbit's review, all within #2174 scope: - src/lib/ports.ts: clamp findFreeDashboardPort to valid range — if preferred is out of 1024-65535 return null, and break the loop when candidates would exceed 65535. + 2 boundary tests. - src/lib/registry.ts: registerSandbox now persists dashboardPort in the stored entry (body was silently dropping the field). - src/lib/onboard.ts: reuse paths seed chatUiUrl from the sandbox's stored dashboardPort via new reuseChatUiUrlFor helper, so re-onboarding an auto-allocated sandbox no longer ratchets its port upward on each run. - src/lib/onboard.ts: picker now uses a real lsof-based bind probe instead of () => true, so auto-alloc won't hand out a port held by a non-openshell process (Docker container, etc.). - src/lib/onboard.ts: getDashboardAccessInfo and getDashboardGuidanceLines read the stored dashboardPort as a precedence tier so the completion banner shows the port ensureDashboardForward actually allocated (precedence: caller option > CHAT_UI_URL env > stored > default). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
There was a problem hiding this comment.
♻️ Duplicate comments (2)
src/lib/onboard.ts (2)
3543-3544:⚠️ Potential issue | 🟠 MajorUse
reuseChatUiUrlFor()on this fallback reuse path too.This is the one remaining reuse branch that still seeds
ensureDashboardForward()fromchatUiUrl, so a sandbox already recorded on an auto-allocated port can still ratchet upward here when the backup path aborts.Suggested fix
- const reusePort = ensureDashboardForward(sandboxName, chatUiUrl); + const reuseUrl = reuseChatUiUrlFor(sandboxName, chatUiUrl); + const reusePort = ensureDashboardForward(sandboxName, reuseUrl);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 3543 - 3544, The fallback path currently calls ensureDashboardForward(sandboxName, chatUiUrl) and seeds it directly with chatUiUrl which can cause an auto-allocated port to be overwritten; instead, call reuseChatUiUrlFor(sandboxName, chatUiUrl) to compute the proper URL/port reuse value and pass that result into ensureDashboardForward, then update the registry (registry.updateSandbox) with the returned dashboardPort as before—i.e., replace the direct chatUiUrl argument with the value from reuseChatUiUrlFor for the ensureDashboardForward invocation.
6099-6134:⚠️ Potential issue | 🟠 MajorRequested-port probing still misses local listeners.
The availability check only runs when another sandbox already owns
requestedPort, and it relies onlsofalone. If--control-ui-portor a storeddashboardPortpoints at a non-OpenShell listener, or non-rootlsofreturns empty for a root-owned socket,forward startcan fail at Line 6168 while this function still returns and persists the dead port. Probe the requested port beforeforward start, and use a real bind fallback instead of treating emptylsofoutput as authoritative. Based on learnings, "Empty lsof string output is NOT treated as authoritative — non-root lsof silently drops root-owned sockets ... Empty lsof output must fall through to the net probe."Also applies to: 6156-6177
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 6099 - 6134, The port-probing logic currently only runs when another sandbox owns requestedPort and treats an empty lsof result as authoritative; update the flow so the requestedPort is always probed before calling `forward start` (i.e., call the same probe used by `findFreeDashboardPort` for `requestedPort` up-front) and change `isPortBoundLocally` (and the `probePortFree` you pass into `findFreeDashboardPort`) to return a tri-state/unknown (e.g., null/undefined) when `runCapture([... "lsof" ...])` yields an empty string so the caller falls back to a real bind/net-probe instead of assuming free; ensure the `probePortFree` implementation then attempts an actual bind probe when lsof is inconclusive so `chosen` never points at a dead/non-bindable port.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 3543-3544: The fallback path currently calls
ensureDashboardForward(sandboxName, chatUiUrl) and seeds it directly with
chatUiUrl which can cause an auto-allocated port to be overwritten; instead,
call reuseChatUiUrlFor(sandboxName, chatUiUrl) to compute the proper URL/port
reuse value and pass that result into ensureDashboardForward, then update the
registry (registry.updateSandbox) with the returned dashboardPort as
before—i.e., replace the direct chatUiUrl argument with the value from
reuseChatUiUrlFor for the ensureDashboardForward invocation.
- Around line 6099-6134: The port-probing logic currently only runs when another
sandbox owns requestedPort and treats an empty lsof result as authoritative;
update the flow so the requestedPort is always probed before calling `forward
start` (i.e., call the same probe used by `findFreeDashboardPort` for
`requestedPort` up-front) and change `isPortBoundLocally` (and the
`probePortFree` you pass into `findFreeDashboardPort`) to return a
tri-state/unknown (e.g., null/undefined) when `runCapture([... "lsof" ...])`
yields an empty string so the caller falls back to a real bind/net-probe instead
of assuming free; ensure the `probePortFree` implementation then attempts an
actual bind probe when lsof is inconclusive so `chosen` never points at a
dead/non-bindable port.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f65396b5-5843-47e3-ab22-8a085ab5fe69
📒 Files selected for processing (4)
src/lib/onboard.tssrc/lib/ports.test.tssrc/lib/ports.tssrc/lib/registry.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/registry.ts
Extends test/e2e/test-double-onboard.sh with three #2174 regression checks: - Second-sandbox onboard output must log "allocated port" (auto-alloc fired) - nemoclaw list must show two distinct dashboard ports for A and B The existing script already exercises the two-sandbox flow (Phase 4) but didn't assert anything port-specific — a silent regression of the auto-alloc path would have passed. Also wires the script into nightly-e2e.yaml as a new double-onboard-e2e job. Uses the script's fake OpenAI endpoint so no NVIDIA_API_KEY is required. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/test-double-onboard.sh`:
- Around line 365-367: The test currently suppresses failures from the command
"run_nemoclaw list" by appending "|| true" which masks real regressions; change
the block around LIST_LOG/run_nemoclaw/list_output so that you run "run_nemoclaw
list >\"$LIST_LOG\" 2>&1" without "|| true", capture its exit code (e.g.,
rc=$?), read list_output="$(cat "$LIST_LOG")", and if rc is non-zero print or
dump "$LIST_LOG" for diagnostics and exit with that non-zero code to fail the
test; keep the temporary log capture (LIST_LOG) and the list_output parsing but
do not swallow the command's non-zero exit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9a1953a5-16df-44e0-96aa-05e8b0e36aae
📒 Files selected for processing (2)
.github/workflows/nightly-e2e.yamltest/e2e/test-double-onboard.sh
| LIST_LOG="$(mktemp)" | ||
| run_nemoclaw list >"$LIST_LOG" 2>&1 || true | ||
| list_output="$(cat "$LIST_LOG")" |
There was a problem hiding this comment.
Do not suppress nemoclaw list failures here.
Ignoring the exit code with || true can hide a real list regression while this block still passes on partial output. Keep parsing for diagnostics, but fail on non-zero exit.
Suggested fix
LIST_LOG="$(mktemp)"
-run_nemoclaw list >"$LIST_LOG" 2>&1 || true
+if run_nemoclaw list >"$LIST_LOG" 2>&1; then
+ list_exit=0
+else
+ list_exit=$?
+fi
list_output="$(cat "$LIST_LOG")"
rm -f "$LIST_LOG"
+
+if [ "$list_exit" -eq 0 ]; then
+ pass "nemoclaw list succeeded"
+else
+ fail "nemoclaw list exited $list_exit"
+fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/test-double-onboard.sh` around lines 365 - 367, The test currently
suppresses failures from the command "run_nemoclaw list" by appending "|| true"
which masks real regressions; change the block around
LIST_LOG/run_nemoclaw/list_output so that you run "run_nemoclaw list
>\"$LIST_LOG\" 2>&1" without "|| true", capture its exit code (e.g., rc=$?),
read list_output="$(cat "$LIST_LOG")", and if rc is non-zero print or dump
"$LIST_LOG" for diagnostics and exit with that non-zero code to fail the test;
keep the temporary log capture (LIST_LOG) and the list_output parsing but do not
swallow the command's non-zero exit.
…to nightly" This reverts commit d877ec2. Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Summary
Second
nemoclaw onboardpreviously crashed mid-flight when port 18789 was held by the first sandbox, blocking 12 P0/P1 QA tests that keep a baseline sandbox alive while onboarding a short-lived second one. This PR auto-allocates the next free port on loopback conflict, persists it, surfaces it vialist/status, and adds a--control-ui-portflag for pinning. Also closes the "leaks ghost sandbox" path by rolling back the openshell sandbox if forward setup or registry write throws.Related Issue
Fixes #2174
Changes
src/lib/ports.ts—findFreeDashboardPortpicker with 10-port window and injectable probe seam. Clamped to 1024-65535 with boundary guard. 6 unit tests.src/lib/onboard.ts—ensureDashboardForwardauto-allocates on loopback conflict and returns the chosen port; throws only on user-pinnedCHAT_UI_URLor window exhaustion. Rollbacktry/catcharound forward + registry deletes the openshell sandbox on failure sonemoclaw listandopenshell sandbox listno longer drift. Picker uses a reallsofbind probe. Reuse paths seedchatUiUrlfrom the sandbox's storeddashboardPortviareuseChatUiUrlForso re-onboarding an auto-allocated sandbox doesn't ratchet its port upward.src/lib/registry.ts—dashboardPort?: numberonSandboxEntry, persisted in the stored entry byregisterSandbox.src/lib/inventory-commands.ts— dashboard URL innemoclaw listandnemoclaw status. 2 unit tests.src/nemoclaw.ts— dashboard URL innemoclaw <name> status.src/lib/onboard-command.ts—--control-ui-port <n>flag (precedence: flag >CHAT_UI_URLenv > stored port > auto-alloc > default). 2 unit tests.src/lib/onboard.ts(dashboard display) —getDashboardAccessInfoandgetDashboardGuidanceLinesread the storeddashboardPortas a precedence tier so the onboard completion banner shows the actually-allocated port, not the default.test/e2e/test-double-onboard.sh— three new assertions in the existing two-sandbox phase: second onboard must log "allocated port", andnemoclaw listmust show two distinct dashboard URLs..github/workflows/nightly-e2e.yaml— newdouble-onboard-e2ejob wires the script into nightly so regressions land within 24h.Verification
Unit suite: 10 new tests, 1829 total passing locally (picker × 6, list/status display × 2, CLI flag × 2).
E2E:
test/e2e/test-double-onboard.shnow asserts the #2174 auto-alloc path end-to-end. Runs nightly on an Ubuntu runner against a fake OpenAI endpoint — noNVIDIA_API_KEYneeded. Regressions in the two-sandbox dance will auto-file a GitHub issue via the existingnotify-on-failurejob.Manual repro: two back-to-back
nemoclaw onboard --non-interactive— first takes 18789, second logsDashboard port 18789 in use by 'alpha'; allocated port 18790 for 'beta',nemoclaw <name> statusprints each port, both dashboards reachable.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesNotes for reviewer
ensureDashboardForwardthrows rather thanprocess.exit(1)(unlike refactor(cli): extract dashboard delivery chain into contract/health/recover modules #2398's pattern) because the ghost-rollbacktry/catchincreateSandboxneeds the exception to know it should clean up.lsoffor the bind probe so auto-alloc avoids ports bound by non-openshell processes (Docker, etc.). Falls back gracefully if lsof is missing.AI Disclosure
Signed-off-by: Charan Jagwani cjagwani@nvidia.com