fix(onboard): auto-allocate dashboard port on multi-sandbox conflict#2454
fix(onboard): auto-allocate dashboard port on multi-sandbox conflict#2454cjagwani wants to merge 16 commits into
Conversation
Second nemoclaw onboard crashed mid-flight when port 18789 was held by the first sandbox, blocking 12 P0/P1 QA tests that keep a baseline sandbox alive while onboarding a short-lived second one. The v0.0.21 guard stopped silent forward-stealing but left no recovery path — users had to set CHAT_UI_URL manually and the assigned port was not discoverable post-hoc. Changes: - src/lib/ports.ts: findFreeDashboardPort picker with 10-port window, injectable probe seam, and 1024-65535 range clamp. 6 unit tests. - src/lib/onboard.ts: ensureDashboardForward auto-allocates on loopback conflict (with a real lsof bind probe to avoid ports bound by other processes) and returns the chosen port. On user-pinned CHAT_UI_URL or window exhaustion it rolls back the openshell sandbox before exiting, closing the "leaks ghost sandbox" path from the issue title. Reuse paths seed chatUiUrl from the sandbox's stored dashboardPort via new reuseChatUiUrlFor helper so re-onboarding an auto-allocated sandbox doesn't ratchet its port upward. - src/lib/registry.ts: dashboardPort?: number on SandboxEntry, persisted by registerSandbox. - src/lib/inventory-commands.ts: dashboard URL in nemoclaw list and nemoclaw status. 2 unit tests. - src/nemoclaw.ts: dashboard URL in nemoclaw <name> status. - src/lib/onboard-command.ts: --control-ui-port <n> flag (precedence: flag > CHAT_UI_URL env > stored port > default). 2 unit tests. - src/lib/onboard.ts (banner): getDashboardAccessInfo and getDashboardGuidanceLines read stored dashboardPort as a precedence tier so the completion banner shows the port actually allocated. Fixes #2174 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds optional dashboard port allocation and propagation: a new control-ui-port CLI flag, automatic free-port discovery on conflicts, persistence of allocated ports to the sandbox registry, and emission of per-sandbox local dashboard URLs in list/status outputs. Tests for ports, onboarding parsing, and inventory output were added. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as "nemoclaw onboard"
participant Onboard as "onboard (ensureDashboardForward)"
participant Probe as "PortProbe / openshell forward list"
participant Registry as "sandbox registry"
participant Run as "sandbox runtime (chat UI forward)"
CLI->>Onboard: parse args (--control-ui-port?)
Onboard->>Probe: check preferred port ownership & list forwarded ports
alt preferred free
Onboard->>Run: start forward at preferred
Run-->>Onboard: returns allocated port (preferred)
else conflict
Onboard->>Probe: findFreeDashboardPort(preferred)
alt found port
Onboard->>Run: start forward at chosen port
Run-->>Onboard: returns allocated port
else none found
Onboard-->>CLI: error / exit(1)
end
end
Onboard->>Registry: updateSandbox(..., { dashboardPort: allocatedPort })
Registry-->>Onboard: persisted
Onboard-->>CLI: finish (CHAT_UI_URL temporarily set/restored)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard-command.ts`:
- Around line 143-149: The code sets process.env.CHAT_UI_URL when
options.controlUiPort is provided but never restores the previous value, causing
cross-invocation leakage; modify the block around deps.runOnboard(options) to
save the original process.env.CHAT_UI_URL (e.g., const _orig =
process.env.CHAT_UI_URL), set it when options.controlUiPort !== null, await
deps.runOnboard(options), and then restore the original value afterwards
(reassign or delete process.env.CHAT_UI_URL if it was undefined) so repeated
calls to runOnboardCommand / deps.runOnboard do not retain the pinned URL.
In `@src/lib/onboard.ts`:
- Around line 7093-7102: printDashboard() is reconstructing
chatUiUrl/chain/guidance from process.env.CHAT_UI_URL and CONTROL_UI_PORT
instead of reusing the stored-port precedence logic, so change printDashboard()
to derive its UI URL and guidance via the same helpers/values used earlier (use
registry.getSandbox(sandboxName)?.dashboardPort, the storedUrl/ storedPort
logic, and the computed chatUiUrl variable or the same precedence:
options.chatUiUrl || process.env.CHAT_UI_URL || storedUrl ||
`http://127.0.0.1:${CONTROL_UI_PORT}`) so the post-onboard banner reflects an
auto-allocated dashboardPort; update any inline uses at printDashboard and the
similar block around lines 7135-7142 to reference those symbols instead of
rebuilding from env/defaults.
- Around line 6793-6827: ensureDashboardForward currently unconditionally
deletes the sandbox on port-forward failures; change its signature to accept an
options flag (e.g., rollbackSandboxOnFailure: boolean = false) and only call
runOpenshell(["sandbox","delete", sandboxName], ...) when
rollbackSandboxOnFailure is true (both in the userPinned/isLoopback early-exit
branch and the chosen===null branch). Leave default behavior false so reuse
paths (the callers at the earlier reuse flows) keep existing sandboxes intact,
and update only the new-sandbox creation path to call
ensureDashboardForward(..., { rollbackSandboxOnFailure: true }) after the
sandbox is freshly created.
🪄 Autofix (Beta)
✅ Autofix completed
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2459036f-c0d1-49c6-ae11-67924d315e1c
📒 Files selected for processing (9)
src/lib/inventory-commands.test.tssrc/lib/inventory-commands.tssrc/lib/onboard-command.test.tssrc/lib/onboard-command.tssrc/lib/onboard.tssrc/lib/ports.test.tssrc/lib/ports.tssrc/lib/registry.tssrc/nemoclaw.ts
|
Note Autofix is a beta feature. Expect some limitations and changes as we gather feedback and continue to improve it. Fixes Applied SuccessfullyFixed 3 file(s) based on 3 unresolved review comments. Files modified:
Commit: The changes have been pushed to the Time taken: |
Fixed 3 file(s) based on 3 unresolved review comments. Co-authored-by: CodeRabbit <noreply@coderabbit.ai>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)
6786-6873:⚠️ Potential issue | 🟠 MajorAuto-allocation still misses non-OpenShell port conflicts.
findFreeDashboardPort()only runs whenforward listalready reports another sandbox owner. If the requested loopback port is held by any other local listener,portOwnerstaysnull,forward startfails, and this code only warns. On the create path at Lines 4496-4520, that still persistsdashboardPortand prints a broken URL even though nothing is forwarded. Probe the requested port for local binds beforeforward start, and treat a rejected start as a hard failure for fresh creates.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 6786 - 6873, When creating a sandbox, probe the requested loopback port for local listeners before attempting openshell forwarding and treat a start rejection as a hard failure for fresh creates: add the same local-bind probe used in this block (isPortBoundLocally using runCapture + lsof) earlier when portOwner === null so that if requestedPort is already bound you call findFreeDashboardPort (or allocate a new effectivePort and forwardTarget using buildChain) instead of proceeding to start; additionally, if runOpenshell(...) / runCaptureOpenshell(["forward","start",...]) (fwdResult) returns non-zero during the create path, perform the same rollback (runOpenshell(["sandbox","delete", sandboxName], { ignoreError: true })) and exit with an error message rather than merely warning — reference the symbols requestedPort, portOwner, isPortBoundLocally, findFreeDashboardPort, effectivePort, forwardTarget, buildChain, fwdResult, and sandboxName to locate the insertion points.
♻️ Duplicate comments (2)
src/lib/onboard.ts (2)
6805-6839:⚠️ Potential issue | 🔴 CriticalScope rollback to newly created sandboxes only.
ensureDashboardForward()is also used by the reuse paths at Lines 3971-3975, 4004-4008, 4052-4056, and 4072-4076. The unconditionalsandbox deletehere turns a port-forward setup error into destructive data loss for an already-existing sandbox. Gate rollback behind an explicit "fresh create" flag and leave reuse flows intact.🔧 Suggested scope guard
function ensureDashboardForward( sandboxName: string, chatUiUrl = `http://127.0.0.1:${CONTROL_UI_PORT}`, + options: { rollbackSandboxOnFailure?: boolean } = {}, ): number { + const rollback = () => { + if (options.rollbackSandboxOnFailure !== true) return; + runOpenshell(["sandbox", "delete", sandboxName], { ignoreError: true }); + }; const chain = buildChain({ chatUiUrl, isWsl: isWsl() }); ... if (userPinnedEnv || !isLoopback) { - runOpenshell(["sandbox", "delete", sandboxName], { ignoreError: true }); + rollback(); console.error(` Port ${requestedPort} is already forwarded for sandbox '${portOwner}'.`); ... } ... if (chosen === null) { - runOpenshell(["sandbox", "delete", sandboxName], { ignoreError: true }); + rollback(); console.error(` All ports ${requestedPort}-${requestedPort + 9} are forwarded.`); ... }Pass
{ rollbackSandboxOnFailure: true }only from the new-sandbox path after creation succeeds.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 6805 - 6839, The current unconditional rollback in ensureDashboardForward (the runOpenshell(["sandbox","delete", sandboxName]) call) will delete already-existing sandboxes during reuse flows; change ensureDashboardForward to accept an options flag (e.g., rollbackSandboxOnFailure) and only perform runOpenshell(...sandbox delete...) when that flag is true, keep the existing error logging and process.exit behavior otherwise, and update the new-sandbox call site to pass { rollbackSandboxOnFailure: true } while leaving reuse-path callers untouched.
7189-7206:⚠️ Potential issue | 🟠 MajorReuse the stored-port helpers in the completion banner.
printDashboard()still rebuildschatUiUrl,chain, access URLs, and guidance fromprocess.env.CHAT_UI_URL || CONTROL_UI_PORTinstead of using the stored-port precedence added ingetDashboardAccessInfo()andgetDashboardGuidanceLines(). After auto-allocating18790, the onboarding banner will still tell the user to open18789. Route this block through the helpers, or derive the same stored-port-awarechatUiUrlhere.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 7189 - 7206, The code rebuilds chatUiUrl, chain, dashboardAccess and guidanceLines manually and can return stale port info; replace this block with the stored-port-aware helpers: call getDashboardAccessInfo(...) to obtain the dashboardAccess array (pass token and sandboxName or the same params those helpers expect) and call getDashboardGuidanceLines(...) to obtain guidanceLines; remove the manual buildChain/buildControlUiUrls logic and ensure token is passed through so any WSL/port-precedence logic remains consistent with getDashboardAccessInfo and getDashboardGuidanceLines (references: getDashboardAccessInfo, getDashboardGuidanceLines, buildChain, buildControlUiUrls).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 6786-6873: When creating a sandbox, probe the requested loopback
port for local listeners before attempting openshell forwarding and treat a
start rejection as a hard failure for fresh creates: add the same local-bind
probe used in this block (isPortBoundLocally using runCapture + lsof) earlier
when portOwner === null so that if requestedPort is already bound you call
findFreeDashboardPort (or allocate a new effectivePort and forwardTarget using
buildChain) instead of proceeding to start; additionally, if runOpenshell(...) /
runCaptureOpenshell(["forward","start",...]) (fwdResult) returns non-zero during
the create path, perform the same rollback (runOpenshell(["sandbox","delete",
sandboxName], { ignoreError: true })) and exit with an error message rather than
merely warning — reference the symbols requestedPort, portOwner,
isPortBoundLocally, findFreeDashboardPort, effectivePort, forwardTarget,
buildChain, fwdResult, and sandboxName to locate the insertion points.
---
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 6805-6839: The current unconditional rollback in
ensureDashboardForward (the runOpenshell(["sandbox","delete", sandboxName])
call) will delete already-existing sandboxes during reuse flows; change
ensureDashboardForward to accept an options flag (e.g.,
rollbackSandboxOnFailure) and only perform runOpenshell(...sandbox delete...)
when that flag is true, keep the existing error logging and process.exit
behavior otherwise, and update the new-sandbox call site to pass {
rollbackSandboxOnFailure: true } while leaving reuse-path callers untouched.
- Around line 7189-7206: The code rebuilds chatUiUrl, chain, dashboardAccess and
guidanceLines manually and can return stale port info; replace this block with
the stored-port-aware helpers: call getDashboardAccessInfo(...) to obtain the
dashboardAccess array (pass token and sandboxName or the same params those
helpers expect) and call getDashboardGuidanceLines(...) to obtain guidanceLines;
remove the manual buildChain/buildControlUiUrls logic and ensure token is passed
through so any WSL/port-precedence logic remains consistent with
getDashboardAccessInfo and getDashboardGuidanceLines (references:
getDashboardAccessInfo, getDashboardGuidanceLines, buildChain,
buildControlUiUrls).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8185b0f9-9583-4ca7-844c-11ec32f2218c
📒 Files selected for processing (2)
src/lib/onboard.tssrc/lib/registry.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/registry.ts
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)
6790-6881:⚠️ Potential issue | 🟠 MajorRun the port picker for local-bind conflicts too.
The fallback only triggers when
forward listalready ownsrequestedPort. If the requested port is held by any other local process,portOwnerisnull, we skipfindFreeDashboardPort(), and then persist/print a dead dashboard port afterforward startfails. This misses the exactlsof-backed conflict case the new allocator is meant to handle.🔧 Suggested change
const existingForwards = runCaptureOpenshell(["forward", "list"], { ignoreError: true }); const portOwner = findDashboardForwardOwner(existingForwards, String(requestedPort)); + const heldPorts = (existingForwards ?? "") + .split("\n") + .map((l) => l.trim()) + .filter(Boolean) + .map((l) => Number(l.split(/\s+/)[2])) + .filter((p) => Number.isFinite(p) && p > 0); + const isPortBoundLocally = (p: number): boolean => { + const out = runCapture(["lsof", "-i", `:${p}`, "-sTCP:LISTEN", "-P", "-n"], { + ignoreError: true, + }); + return typeof out === "string" && out.trim().length > 0; + }; + const requestedPortUnavailable = + heldPorts.includes(requestedPort) || isPortBoundLocally(requestedPort); let effectivePort = requestedPort; let forwardTarget = chain.forwardTarget; - if (portOwner !== null && portOwner !== sandboxName) { + if (requestedPortUnavailable && portOwner !== sandboxName) { ... - const heldPorts = (existingForwards ?? "") - .split("\n") - .map((l) => l.trim()) - .filter(Boolean) - .map((l) => Number(l.split(/\s+/)[2])) - .filter((p) => Number.isFinite(p) && p > 0); - const isPortBoundLocally = (p: number): boolean => { - const out = runCapture(["lsof", "-i", `:${p}`, "-sTCP:LISTEN", "-P", "-n"], { - ignoreError: true, - }); - return typeof out === "string" && out.trim().length > 0; - }; const chosen = findFreeDashboardPort(requestedPort, { probe: { listForwardPorts: () => heldPorts, probePortFree: (p: number) => !isPortBoundLocally(p), },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 6790 - 6881, The current port-allocation fallback only runs when findDashboardForwardOwner(...) returns a different sandbox, but misses the case where portOwner === null yet the requestedPort is bound locally; move or hoist the local-probe logic so the code also runs findFreeDashboardPort when portOwner === null but isPortBoundLocally(requestedPort) is true. Concretely: compute isLoopback/userPinned as done now, then either (a) merge the condition to enter the allocator when (portOwner !== null && portOwner !== sandboxName) OR (portOwner === null && isPortBoundLocally(requestedPort)), or factor out the allocator block (the heldPorts mapping, isPortBoundLocally probe, and call to findFreeDashboardPort) into a helper used in both cases; ensure you still perform rollbackSandboxOnFailure, update forwardTarget via buildChain(...), and print the same user-facing messages when a free port is chosen or when no ports are available (preserve use of runCaptureOpenshell, runOpenshell, findFreeDashboardPort, and buildChain identifiers).
🧹 Nitpick comments (1)
src/lib/onboard-command.ts (1)
146-159: ScopeCHAT_UI_URLrestoration to the override path only.The
finallyblock always rewritesprocess.env.CHAT_UI_URL(Line 154-Line 158), even when--control-ui-portwas not provided. Guarding restoration withoptions.controlUiPort !== nullkeeps global env mutation strictly tied to this feature path.♻️ Suggested patch
const _origChatUiUrl = process.env.CHAT_UI_URL; try { if (options.controlUiPort !== null) { process.env.CHAT_UI_URL = `http://127.0.0.1:${options.controlUiPort}`; } await deps.runOnboard(options); } finally { - // Restore original value to prevent cross-invocation leakage - if (_origChatUiUrl === undefined) { - delete process.env.CHAT_UI_URL; - } else { - process.env.CHAT_UI_URL = _origChatUiUrl; + // Restore original value only if we overrode it in this invocation + if (options.controlUiPort !== null) { + if (_origChatUiUrl === undefined) { + delete process.env.CHAT_UI_URL; + } else { + process.env.CHAT_UI_URL = _origChatUiUrl; + } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-command.ts` around lines 146 - 159, The finally block currently always restores process.env.CHAT_UI_URL even when we never overrode it; change the restore logic to only run when we previously set the env (i.e., when options.controlUiPort !== null). In the finally block around deps.runOnboard, wrap the existing restore/delete logic in a guard checking options.controlUiPort !== null so that _origChatUiUrl is only applied back if we actually changed process.env.CHAT_UI_URL.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard-command.ts`:
- Around line 82-89: The validation for --control-ui-port currently logs an
error and exits in two places (the /^\d+$/ check and the 1024–65535 range check)
but doesn't print the CLI usage like other argument validators; update both
failure branches that call error(...) and exit(1) to also invoke the parser's
usage printing function (the same helper used elsewhere in this file, e.g.,
printUsage() or usage()) before exiting so that when raw or parsed fail
validation (variables raw and parsed, and the branches that call error(...) and
exit(1)) the usage/help text is shown consistently.
---
Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 6790-6881: The current port-allocation fallback only runs when
findDashboardForwardOwner(...) returns a different sandbox, but misses the case
where portOwner === null yet the requestedPort is bound locally; move or hoist
the local-probe logic so the code also runs findFreeDashboardPort when portOwner
=== null but isPortBoundLocally(requestedPort) is true. Concretely: compute
isLoopback/userPinned as done now, then either (a) merge the condition to enter
the allocator when (portOwner !== null && portOwner !== sandboxName) OR
(portOwner === null && isPortBoundLocally(requestedPort)), or factor out the
allocator block (the heldPorts mapping, isPortBoundLocally probe, and call to
findFreeDashboardPort) into a helper used in both cases; ensure you still
perform rollbackSandboxOnFailure, update forwardTarget via buildChain(...), and
print the same user-facing messages when a free port is chosen or when no ports
are available (preserve use of runCaptureOpenshell, runOpenshell,
findFreeDashboardPort, and buildChain identifiers).
---
Nitpick comments:
In `@src/lib/onboard-command.ts`:
- Around line 146-159: The finally block currently always restores
process.env.CHAT_UI_URL even when we never overrode it; change the restore logic
to only run when we previously set the env (i.e., when options.controlUiPort !==
null). In the finally block around deps.runOnboard, wrap the existing
restore/delete logic in a guard checking options.controlUiPort !== null so that
_origChatUiUrl is only applied back if we actually changed
process.env.CHAT_UI_URL.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7421a1da-48bb-42bf-8d4c-35bcef286488
📒 Files selected for processing (3)
nemoclaw-blueprint/scripts/ws-proxy-fix.jssrc/lib/onboard-command.tssrc/lib/onboard.ts
✅ Files skipped from review due to trivial changes (1)
- nemoclaw-blueprint/scripts/ws-proxy-fix.js
| if (!/^\d+$/.test(raw)) { | ||
| error(` --control-ui-port '${raw}' must be an integer between 1024 and 65535`); | ||
| exit(1); | ||
| } | ||
| const parsed = Number(raw); | ||
| if (parsed < 1024 || parsed > 65535) { | ||
| error(` --control-ui-port '${raw}' must be an integer between 1024 and 65535`); | ||
| exit(1); |
There was a problem hiding this comment.
Print usage text for all --control-ui-port validation failures.
Line 83 and Line 88 exit with an error but skip usage output, while other argument-validation failures in this parser print usage. Keeping this consistent improves CLI guidance.
💡 Suggested patch
if (!/^\d+$/.test(raw)) {
error(` --control-ui-port '${raw}' must be an integer between 1024 and 65535`);
+ printOnboardUsage(error, noticeAcceptFlag);
exit(1);
}
const parsed = Number(raw);
if (parsed < 1024 || parsed > 65535) {
error(` --control-ui-port '${raw}' must be an integer between 1024 and 65535`);
+ printOnboardUsage(error, noticeAcceptFlag);
exit(1);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/lib/onboard-command.ts` around lines 82 - 89, The validation for
--control-ui-port currently logs an error and exits in two places (the /^\d+$/
check and the 1024–65535 range check) but doesn't print the CLI usage like other
argument validators; update both failure branches that call error(...) and
exit(1) to also invoke the parser's usage printing function (the same helper
used elsewhere in this file, e.g., printUsage() or usage()) before exiting so
that when raw or parsed fail validation (variables raw and parsed, and the
branches that call error(...) and exit(1)) the usage/help text is shown
consistently.
…2458) ## Summary Tip of `main` currently fails `npm run typecheck:cli` with 8 errors in `test/onboard.test.ts` — patterns introduced by #2130 (Slack token validation tests) don't satisfy the `strict: true` setting introduced by #2422. Every PR opened against main inherits these errors, which makes the `checks` CI job red on unrelated PRs (e.g. #2454). ## Related Follows from #2130 + #2422 interaction. No linked issue — local-only breakage caught by CI on downstream PRs. ## Changes `test/onboard.test.ts` — 8 annotations, zero behavior change: - **4× `.pop()!`** — `Array.pop()` returns `string | undefined`. Each site is immediately preceded by an `assert.equal(result.status, 0)` that guarantees non-empty stdout, so the non-null assertion is safe. - **3× `(c: { key: string }) => c.key === …`** — `saveCalls` entries only read `.key`. - **1× `(c: { name: string }) => c.name === …`** — `MESSAGING_CHANNELS` entry lookup only reads `.name`. The tests themselves still exercise the same cases; these changes are purely to make the TypeScript compiler happy. ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] `npx tsc -p tsconfig.cli.json` — exits 0 (was 2 with 8 errors) - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Tests * Improved reliability of JSON payload extraction in onboarding tests with defensive null-checks before parsing. * Enhanced type safety in test callbacks and iterators with explicit TypeScript type annotations. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves conflicts in src/lib/onboard-command.ts and onboard-command.test.ts: combine main's --resume | --fresh mutual-exclusion handling with this branch's --control-ui-port flag and tests.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)
6069-6160:⚠️ Potential issue | 🟠 MajorHandle non-OpenShell listeners before attempting the forward.
findFreeDashboardPort()only runs whenopenshell forward listalready shows another sandbox onrequestedPort. If the port is occupied by any other local process,portOwnerstaysnull, the picker is skipped, and Lines 6143-6159 only warn afterforward startfails. In the new-sandbox path that still leaves onboarding succeeding with an unreachable dashboard, which misses the auto-allocation/rollback goal.💡 Suggested fix
const existingForwards = runCaptureOpenshell(["forward", "list"], { ignoreError: true }); const portOwner = findDashboardForwardOwner(existingForwards, String(requestedPort)); + const heldPorts = (existingForwards ?? "") + .split("\n") + .map((l) => l.trim()) + .filter(Boolean) + .map((l) => Number(l.split(/\s+/)[2])) + .filter((p) => Number.isFinite(p) && p > 0); + const isPortBoundLocally = (p: number): boolean => { + const out = runCapture(["lsof", "-i", `:${p}`, "-sTCP:LISTEN", "-P", "-n"], { + ignoreError: true, + }); + return typeof out === "string" && out.trim().length > 0; + }; + const portBusyLocally = isPortBoundLocally(requestedPort); let effectivePort = requestedPort; let forwardTarget = chain.forwardTarget; - if (portOwner !== null && portOwner !== sandboxName) { + if ((portOwner !== null && portOwner !== sandboxName) || portBusyLocally) { const userPinnedEnv = !!process.env.CHAT_UI_URL; let isLoopback = true; try { const parsed = new URL( /^[a-z]+:\/\//i.test(chatUiUrl) ? chatUiUrl : `http://${chatUiUrl}`, @@ const chosen = findFreeDashboardPort(requestedPort, { probe: { listForwardPorts: () => heldPorts, probePortFree: (p: number) => !isPortBoundLocally(p), }, });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 6069 - 6160, The code currently only probes for local listeners when an openshell forward exists; to fix, detect non-OpenShell listeners on the requested port before attempting forward and treat that like a conflicting owner: call the existing isPortBoundLocally logic (reuse the runCapture(["lsof", ...]) pattern) when portOwner === null and if it returns true then run the same allocation/rollback flow that uses findFreeDashboardPort, buildChain, and runOpenshell(["sandbox","delete",...]) as in the portOwner !== sandboxName branch; ensure you reference and reuse the same heldPorts/probe implementation and then set effectivePort/forwardTarget before calling runOpenshell(["forward","start",...]) so onboarding fails over to an allocated loopback port instead of silently producing an unreachable dashboard.
♻️ Duplicate comments (1)
src/lib/onboard-command.ts (1)
84-91:⚠️ Potential issue | 🟡 MinorShow usage for invalid
--control-ui-portvalues.Line 84 and Line 89 still exit without the usage banner, unlike the other parser errors in this function.
💡 Suggested patch
if (!/^\d+$/.test(raw)) { error(` --control-ui-port '${raw}' must be an integer between 1024 and 65535`); + printOnboardUsage(error, noticeAcceptFlag); exit(1); } const parsed = Number(raw); if (parsed < 1024 || parsed > 65535) { error(` --control-ui-port '${raw}' must be an integer between 1024 and 65535`); + printOnboardUsage(error, noticeAcceptFlag); exit(1); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard-command.ts` around lines 84 - 91, The invalid --control-ui-port branches currently call error(...) and exit(1) without showing the usage banner; modify both validation failure branches (the /^\d+$/.test(raw) check and the range check for parsed) to invoke the same usage display used by other parser errors (e.g. call showUsage() or the existing usage/display function in this module) before calling exit(1) so the usage banner is printed when raw or parsed is invalid.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 6069-6160: The code currently only probes for local listeners when
an openshell forward exists; to fix, detect non-OpenShell listeners on the
requested port before attempting forward and treat that like a conflicting
owner: call the existing isPortBoundLocally logic (reuse the runCapture(["lsof",
...]) pattern) when portOwner === null and if it returns true then run the same
allocation/rollback flow that uses findFreeDashboardPort, buildChain, and
runOpenshell(["sandbox","delete",...]) as in the portOwner !== sandboxName
branch; ensure you reference and reuse the same heldPorts/probe implementation
and then set effectivePort/forwardTarget before calling
runOpenshell(["forward","start",...]) so onboarding fails over to an allocated
loopback port instead of silently producing an unreachable dashboard.
---
Duplicate comments:
In `@src/lib/onboard-command.ts`:
- Around line 84-91: The invalid --control-ui-port branches currently call
error(...) and exit(1) without showing the usage banner; modify both validation
failure branches (the /^\d+$/.test(raw) check and the range check for parsed) to
invoke the same usage display used by other parser errors (e.g. call showUsage()
or the existing usage/display function in this module) before calling exit(1) so
the usage banner is printed when raw or parsed is invalid.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: df140fe6-2930-40ea-aba3-837bda2e3718
📒 Files selected for processing (6)
src/lib/inventory-commands.test.tssrc/lib/onboard-command.test.tssrc/lib/onboard-command.tssrc/lib/onboard.tssrc/lib/registry.tssrc/nemoclaw.ts
✅ Files skipped from review due to trivial changes (2)
- src/nemoclaw.ts
- src/lib/inventory-commands.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- src/lib/registry.ts
- src/lib/onboard-command.test.ts
Temporary instrumentation to root-cause the post-success-message hang on sandbox-survival-e2e for PR #2454. Captures lsof output, ps tree, and every process holding bash's stdout/stderr pipe FDs into the upload artifact path. exit 99 forces the failure step so the artifact uploads. Revert before merge.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/test-sandbox-survival.sh`:
- Around line 718-762: The diagnostic block that writes to DIAG_FILE ends with a
forced "exit 99", which causes the script to fail even when assertions passed;
change the script so the success path exits with 0 (restore "exit 0" instead of
"exit 99") or gate the diagnostic+nonzero exit behind an explicit failure
condition, keeping the DIAG_FILE logging intact but ensuring the normal end of
the script uses exit 0 (look for the DIAG_FILE variable and the block that
prints "=== EXIT-DIAG" and remove or conditionalize the "exit 99").
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e26fe829-a1e6-4326-b66e-243460807484
📒 Files selected for processing (1)
test/e2e/test-sandbox-survival.sh
v1 saw fd 1 = the diag file (its own redirection) instead of the runner pipe. Save the original to fds 9/8 first, then walk /proc to find every process holding the same pipe inode.
Records when bash actually starts to exit (via EXIT trap, fires right before exit syscall). Comparing that to the runner's Process completed timestamp tells us whether the ~60s post-printf gap is in bash or in the runner's exit-detection path.
If the trap fires immediately after diag (within ms), bash exits cleanly on exit 0 and the 50s+ PR-specific extra tail is runner-side. If trap fires 50s+ later, bash is hanging in fflush() before exit.
The sandbox-survival-e2e job has a 900s internal wall clock and was flaking on PRs whose install hits the slow side of GitHub runner variance. Investigation traced the failure to runner-side log batching: on a clean exit 0 with no follow-up step (if: failure() skipped), the runner sat in its ~60s log-flush window before reporting the step's exit code. A guaranteed follow-up step (if: always()) makes the runner finalize immediately — tail dropped from 60s to 1.7s in a controlled test, reclaiming ~58s of the 900s budget. The artifact upload is harmless on success: the production script doesn't write /tmp/nemoclaw-e2e-install.log, and if-no-files-found: ignore no-ops the upload.
Selective E2E Results — ❌ Some jobs failedRun: 25080893392
|
Selective E2E Results — ❌ Some jobs failedRun: 25081353877
|
Selective E2E Results — ❌ Some jobs failedRun: 25082492604
|
Selective E2E Results — ❌ Some jobs failedRun: 25086059313
|
Selective E2E Results — ❌ Some jobs failedRun: 25093314104
|
|
Superseded. The crash half of #2174 was addressed by #2411 (merged 2026-04-28). The "leaks ghost sandbox" half — which this PR also covered — landed separately as #2627 (merged 2026-04-29) on top of main. Closing this branch in favour of the two-PR sequence that actually shipped:
The salvageable hardening from this branch is in #2627. The |
Introduce src/lib/openshell-timeouts.ts with categorized timeout constants for all openshell child-process calls: - OPENSHELL_PROBE_TIMEOUT_MS (15s) — read-only queries - OPENSHELL_OPERATION_TIMEOUT_MS (30s) — mutating commands - OPENSHELL_HEAVY_TIMEOUT_MS (60s) — destructive/long-running ops - OPENSHELL_DOWNLOAD_TIMEOUT_MS (30s) — file transfers over SSH Uses the same OPENSHELL_PROBE_TIMEOUT_MS name introduced locally in PR NVIDIA#2454 so that PR can import from the shared module on rebase. This is Phase 1 of NVIDIA#2562 — pure constants, no behavioral change. Refs: NVIDIA#2562
…2683) ## Summary Introduces a shared module of named timeout constants for all openshell child-process calls. This is Phase 1 of #2562 — pure constants with no behavioral change, establishing the vocabulary that subsequent PRs will use to bound every unbounded call site. ## Related Issue Refs #2562 (partial — Phase 1 of 4) ## Changes - Add `src/lib/openshell-timeouts.ts` with four categorized constants: - `OPENSHELL_PROBE_TIMEOUT_MS` (15s) — read-only queries (list, status, info, ssh-config) - `OPENSHELL_OPERATION_TIMEOUT_MS` (30s) — mutating commands (provider CRUD, forward, gateway select) - `OPENSHELL_HEAVY_TIMEOUT_MS` (60s) — destructive/long-running (sandbox delete, gateway destroy) - `OPENSHELL_DOWNLOAD_TIMEOUT_MS` (30s) — file transfers over SSH - Add `src/lib/openshell-timeouts.test.ts` with unit tests for positivity, ordering, and PR #2454 alignment - Uses the same `OPENSHELL_PROBE_TIMEOUT_MS` name from PR #2454 so it can import from this shared module on rebase ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [ ] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes ## AI Disclosure - [x] AI-assisted — tool: Claude Code (pi) --- Signed-off-by: Jessica Yaunches <jyaunches@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Centralized timeout settings for shell interactions to standardize behavior and improve reliability across operations. * **Tests** * Added tests that validate timeout values are numeric, positive, and follow the expected ordering to prevent regressions. * **Stability** * Applied the refined probe-level timeout to quick checks to make probes and lightweight operations more consistent and predictable. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…VIDIA#2458) ## Summary Tip of `main` currently fails `npm run typecheck:cli` with 8 errors in `test/onboard.test.ts` — patterns introduced by NVIDIA#2130 (Slack token validation tests) don't satisfy the `strict: true` setting introduced by NVIDIA#2422. Every PR opened against main inherits these errors, which makes the `checks` CI job red on unrelated PRs (e.g. NVIDIA#2454). ## Related Follows from NVIDIA#2130 + NVIDIA#2422 interaction. No linked issue — local-only breakage caught by CI on downstream PRs. ## Changes `test/onboard.test.ts` — 8 annotations, zero behavior change: - **4× `.pop()!`** — `Array.pop()` returns `string | undefined`. Each site is immediately preceded by an `assert.equal(result.status, 0)` that guarantees non-empty stdout, so the non-null assertion is safe. - **3× `(c: { key: string }) => c.key === …`** — `saveCalls` entries only read `.key`. - **1× `(c: { name: string }) => c.name === …`** — `MESSAGING_CHANNELS` entry lookup only reads `.name`. The tests themselves still exercise the same cases; these changes are purely to make the TypeScript compiler happy. ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] `npx tsc -p tsconfig.cli.json` — exits 0 (was 2 with 8 errors) - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Tests * Improved reliability of JSON payload extraction in onboarding tests with defensive null-checks before parsing. * Enhanced type safety in test callbacks and iterators with explicit TypeScript type annotations. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…VIDIA#2683) ## Summary Introduces a shared module of named timeout constants for all openshell child-process calls. This is Phase 1 of NVIDIA#2562 — pure constants with no behavioral change, establishing the vocabulary that subsequent PRs will use to bound every unbounded call site. ## Related Issue Refs NVIDIA#2562 (partial — Phase 1 of 4) ## Changes - Add `src/lib/openshell-timeouts.ts` with four categorized constants: - `OPENSHELL_PROBE_TIMEOUT_MS` (15s) — read-only queries (list, status, info, ssh-config) - `OPENSHELL_OPERATION_TIMEOUT_MS` (30s) — mutating commands (provider CRUD, forward, gateway select) - `OPENSHELL_HEAVY_TIMEOUT_MS` (60s) — destructive/long-running (sandbox delete, gateway destroy) - `OPENSHELL_DOWNLOAD_TIMEOUT_MS` (30s) — file transfers over SSH - Add `src/lib/openshell-timeouts.test.ts` with unit tests for positivity, ordering, and PR NVIDIA#2454 alignment - Uses the same `OPENSHELL_PROBE_TIMEOUT_MS` name from PR NVIDIA#2454 so it can import from this shared module on rebase ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [ ] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes ## AI Disclosure - [x] AI-assisted — tool: Claude Code (pi) --- Signed-off-by: Jessica Yaunches <jyaunches@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Centralized timeout settings for shell interactions to standardize behavior and improve reliability across operations. * **Tests** * Added tests that validate timeout values are numeric, positive, and follow the expected ordering to prevent regressions. * **Stability** * Applied the refined probe-level timeout to quick checks to make probes and lightweight operations more consistent and predictable. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…VIDIA#2458) ## Summary Tip of `main` currently fails `npm run typecheck:cli` with 8 errors in `test/onboard.test.ts` — patterns introduced by NVIDIA#2130 (Slack token validation tests) don't satisfy the `strict: true` setting introduced by NVIDIA#2422. Every PR opened against main inherits these errors, which makes the `checks` CI job red on unrelated PRs (e.g. NVIDIA#2454). ## Related Follows from NVIDIA#2130 + NVIDIA#2422 interaction. No linked issue — local-only breakage caught by CI on downstream PRs. ## Changes `test/onboard.test.ts` — 8 annotations, zero behavior change: - **4× `.pop()!`** — `Array.pop()` returns `string | undefined`. Each site is immediately preceded by an `assert.equal(result.status, 0)` that guarantees non-empty stdout, so the non-null assertion is safe. - **3× `(c: { key: string }) => c.key === …`** — `saveCalls` entries only read `.key`. - **1× `(c: { name: string }) => c.name === …`** — `MESSAGING_CHANNELS` entry lookup only reads `.name`. The tests themselves still exercise the same cases; these changes are purely to make the TypeScript compiler happy. ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] `npx tsc -p tsconfig.cli.json` — exits 0 (was 2 with 8 errors) - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Tests * Improved reliability of JSON payload extraction in onboarding tests with defensive null-checks before parsing. * Enhanced type safety in test callbacks and iterators with explicit TypeScript type annotations. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Second
nemoclaw onboardpreviously crashed mid-flight when port 18789 was held by the first sandbox, blocking 12 P0/P1 QA tests that keep a baseline sandbox alive while onboarding a short-lived second one. This PR auto-allocates the next free port on loopback conflict, persists it, surfaces it vialist/status, and adds a--control-ui-portflag for pinning. Also closes the "leaks ghost sandbox" path from the issue title — unrecoverable forward-setup failures now roll back the openshell sandbox before exiting.Supersedes #2444 (clean history, same changes).
Related Issue
Fixes #2174
Changes
src/lib/ports.ts—findFreeDashboardPortpicker with 10-port window, injectable probe seam, and 1024-65535 range clamp. 6 unit tests.src/lib/onboard.ts—ensureDashboardForwardauto-allocates on loopback conflict (reallsofbind probe to avoid ports held by non-openshell processes) and returns the chosen port. On user-pinnedCHAT_UI_URLor exhaustion it rolls back the openshell sandbox before exiting. Reuse paths seedchatUiUrlfrom the sandbox's storeddashboardPortvia newreuseChatUiUrlForhelper so re-onboarding an auto-allocated sandbox doesn't ratchet its port upward on each run.src/lib/registry.ts—dashboardPort?: numberonSandboxEntry, persisted byregisterSandbox.src/lib/inventory-commands.ts— dashboard URL innemoclaw listandnemoclaw status. 2 unit tests.src/nemoclaw.ts— dashboard URL innemoclaw <name> status.src/lib/onboard-command.ts—--control-ui-port <n>flag (precedence: flag >CHAT_UI_URLenv > stored port > default). 2 unit tests.src/lib/onboard.ts(banner) —getDashboardAccessInfoandgetDashboardGuidanceLinesread storeddashboardPortas a precedence tier so the completion banner shows the port actually allocated.Type of Change
Verification
npx prek run --all-filespasses (pre-commit hooks green — prettier, ESLint, shellcheck, hadolint, gitleaks, markdownlint, Test (plugin), YAML)npm testpasses (10 new unit tests; 1829+ total passing locally)Manual repro for reviewer: two back-to-back
nemoclaw onboard --non-interactive— first takes 18789, second logsDashboard port 18789 in use by 'alpha'; allocated port 18790 for 'beta',nemoclaw <name> statusprints each port, both dashboards reachable.Notes for reviewer
ensureDashboardForwardexits (matching refactor(cli): extract dashboard delivery chain into contract/health/recover modules #2398's pattern post-[Jetson][Onboard] nemoclaw onboard second sandbox: port 18789 conflict shows raw JS stack trace instead of user-friendly error #2169) rather than throws. The ghost-rollback happens inside before the exit for the unrecoverable cases.lsoffor the bind probe so auto-alloc avoids ports bound by non-openshell processes (Docker, etc.). Falls back gracefully if lsof is missing.nightly-e2ejob are split into a separate follow-up PR to keep this change focused on [All Platform][Onboard][Regression] Second onboard crashes with unhandled exception and leaks ghost sandbox when dashboard port 18789 is already forwarded #2174's minimum close.AI Disclosure
Signed-off-by: Charan Jagwani cjagwani@nvidia.com
Summary by CodeRabbit
New Features
Tests
Chores