fix(onboard): strengthen dashboard port detection and roll back on forward-start failure#3313
Conversation
… forward-start failure Closes #3260. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a dedicated dashboard-port allocator with layered host probing, integrates it into onboard.ts, tightens rollback for TOCTOU and forward-start port-conflict failures, and expands tests for allocation, host-probe sequencing, and rollback behaviors. ChangesDashboard Port Allocation with Multi-Stage Conflict Detection
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)
8980-8999:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winInclude host-bound ports in the “range exhausted” error.
If every candidate is skipped by
isPortBoundCheck, the failure text built below still only lists OpenShell forwards, so users can get “all dashboard ports are occupied” with an empty owner list. Please track host-bound candidates during the scan and include them in the thrown error so the remediation stays actionable.Suggested direction
function findAvailableDashboardPort( sandboxName: string, preferredPort: number, forwardListOutput: string | null, isPortBoundCheck: (port: number) => boolean = isPortBoundOnHost, ): number { const occupied = getOccupiedPorts(forwardListOutput); + const hostBoundPorts = new Set<number>(); const preferredStr = String(preferredPort); const owner = occupied.get(preferredStr) ?? null; // If this sandbox already owns the forward, keep it. if (owner === sandboxName) return preferredPort; // If no forward claims the port, also check the host so we don't collide // with non-OpenShell processes. - if (owner === null && !isPortBoundCheck(preferredPort)) return preferredPort; + if (owner === null) { + if (!isPortBoundCheck(preferredPort)) return preferredPort; + hostBoundPorts.add(preferredPort); + } for (let p = DASHBOARD_PORT_RANGE_START; p <= DASHBOARD_PORT_RANGE_END; p++) { const pStr = String(p); const pOwner = occupied.get(pStr) ?? null; if (pOwner === sandboxName) return p; - if (pOwner === null && !isPortBoundCheck(p)) return p; + if (pOwner === null) { + if (!isPortBoundCheck(p)) return p; + hostBoundPorts.add(p); + } } - const owners = [...occupied.entries()] + const owners = [ + ...[...occupied.entries()] + .filter( + ([p]) => Number(p) >= DASHBOARD_PORT_RANGE_START && Number(p) <= DASHBOARD_PORT_RANGE_END, + ) + .map(([p, s]) => ` ${p} → ${s}`), + ...[...hostBoundPorts].map((p) => ` ${p} → non-OpenShell host listener`), + ] - .filter( - ([p]) => Number(p) >= DASHBOARD_PORT_RANGE_START && Number(p) <= DASHBOARD_PORT_RANGE_END, - ) - .map(([p, s]) => ` ${p} → ${s}`) .join("\n");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 8980 - 8999, In findAvailableDashboardPort, currently only OpenShell-forward owners from occupied (map from getOccupiedPorts) are collected for the "range exhausted" error, so when candidates are skipped by isPortBoundCheck(host-bound) the owner list can be empty; update the scanning loop in findAvailableDashboardPort to also record host-bound ports (e.g., push p into a hostBoundPorts array when pOwner === null && isPortBoundCheck(p) returns true) as you iterate, and when throwing the final error include both the occupied owners (from occupied) and the collected hostBoundPorts so the message shows which ports are bound by the host and which are claimed by sandboxes (use the existing occupied/owner variables and isPortBoundCheck to determine and report both lists).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 9105-9129: The rollback branch unconditionally fabricates a
port-conflict Error losing the real `forward start` output; change it to
preserve and inspect the actual `fwdResult` output (from the result returned by
`runOpenshell`/the `forward start` call) and only create the EADDRINUSE-style
Error when the captured stderr/stdout contains an EADDRINUSE/EADDRINUSE-like
text; otherwise construct `err` using the original command output (include
stdout/stderr or result.message) so
`buildOrphanedSandboxRollbackMessage(sandboxName, err, ...)` shows the real
forward failure; keep existing uses of `rollbackSandboxOnFailure`, `actualPort`,
`cliName()`, `sandboxName`, and `runOpenshell()` and ensure you propagate the
preserved output into the delete/rollback logging path.
---
Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 8980-8999: In findAvailableDashboardPort, currently only
OpenShell-forward owners from occupied (map from getOccupiedPorts) are collected
for the "range exhausted" error, so when candidates are skipped by
isPortBoundCheck(host-bound) the owner list can be empty; update the scanning
loop in findAvailableDashboardPort to also record host-bound ports (e.g., push p
into a hostBoundPorts array when pOwner === null && isPortBoundCheck(p) returns
true) as you iterate, and when throwing the final error include both the
occupied owners (from occupied) and the collected hostBoundPorts so the message
shows which ports are bound by the host and which are claimed by sandboxes (use
the existing occupied/owner variables and isPortBoundCheck to determine and
report both lists).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0dd8bf16-a464-470a-ac2c-ffd730358a4f
📒 Files selected for processing (2)
src/lib/onboard.tstest/onboard.test.ts
…d ports Addresses CodeRabbit findings on #3260. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
cv
left a comment
There was a problem hiding this comment.
Can you extract these functions out of onboard.ts, please? This file is getting far too big.
…dashboard-port.ts Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…ilure Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…ecomes host-bound during build Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryPi Semantic E2E AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/onboard.test.ts (1)
9265-9341: ⚡ Quick winHarden source-shape assertions to reduce regex brittleness.
A few assertions are tightly coupled to formatting/tokenization (e.g., exact array literal text) or use broad multiline capture (
actualPort !== preferredPortblock), which can become flaky on harmless refactors.Proposed tightening (scope-first + semantic checks)
-const mismatchBranch = source.match( - /if \(actualPort !== preferredPort\) \{[\s\S]*?\n \}/, -); -assert.ok(mismatchBranch, "Expected actualPort !== preferredPort branch in ensureDashboardForward"); -const branchBody = mismatchBranch[0]; +const ensureFnStart = source.indexOf("function ensureDashboardForward"); +assert.ok(ensureFnStart !== -1, "ensureDashboardForward not found"); +const ensureFnBody = source.slice(ensureFnStart, source.indexOf("\nfunction ", ensureFnStart + 1) === -1 ? source.length : source.indexOf("\nfunction ", ensureFnStart + 1)); +const mismatchStart = ensureFnBody.indexOf("if (actualPort !== preferredPort)"); +assert.ok(mismatchStart !== -1, "Expected actualPort !== preferredPort branch in ensureDashboardForward"); +const branchBody = ensureFnBody.slice(mismatchStart, ensureFnBody.indexOf("return actualPort;", mismatchStart));-assert.match(source, /\["lsof", "-i", `:\$\{port\}`, "-sTCP:LISTEN", "-P", "-n"\]/); +assert.match(source, /lsof/); +assert.match(source, /-sTCP:LISTEN/); +assert.match(source, /probePortBoundSync/);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/onboard.test.ts` around lines 9265 - 9341, The tests use brittle, formatting-sensitive regexes; relax them by matching semantic tokens instead of exact array literal formatting and by narrowing the multiline capture for the actualPort mismatch branch. Update assertions in this file to (1) for isPortBoundOnHost/probePortBoundSync/EADDRINUSE, assert presence of "isPortBoundOnHost", "lsof" and each flag ("-i", "-sTCP:LISTEN", "-P", "-n") separately (and a separate assertion for the sudo variant) rather than the exact array text, (2) for ensureDashboardForward rollback checks, locate the actualPort !== preferredPort branch by matching the conditional anchor "if (actualPort !== preferredPort)" then assert the branchBody contains the key strings/behaviours ("if (rollbackSandboxOnFailure)", "became host-bound during sandbox build", the runOpenshell delete call pattern, "buildOrphanedSandboxRollbackMessage", "process.exit(1)") and separately assert the reuse-path warning text ("is taken. Using port .* instead") without depending on a wide single multiline regex capture. Use case-insensitive checks for EADDRINUSE text where appropriate.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/onboard.test.ts`:
- Around line 9265-9341: The tests use brittle, formatting-sensitive regexes;
relax them by matching semantic tokens instead of exact array literal formatting
and by narrowing the multiline capture for the actualPort mismatch branch.
Update assertions in this file to (1) for
isPortBoundOnHost/probePortBoundSync/EADDRINUSE, assert presence of
"isPortBoundOnHost", "lsof" and each flag ("-i", "-sTCP:LISTEN", "-P", "-n")
separately (and a separate assertion for the sudo variant) rather than the exact
array text, (2) for ensureDashboardForward rollback checks, locate the
actualPort !== preferredPort branch by matching the conditional anchor "if
(actualPort !== preferredPort)" then assert the branchBody contains the key
strings/behaviours ("if (rollbackSandboxOnFailure)", "became host-bound during
sandbox build", the runOpenshell delete call pattern,
"buildOrphanedSandboxRollbackMessage", "process.exit(1)") and separately assert
the reuse-path warning text ("is taken. Using port .* instead") without
depending on a wide single multiline regex capture. Use case-insensitive checks
for EADDRINUSE text where appropriate.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1fad900b-74b4-434c-8a5c-2f1413e3de7c
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/onboard/dashboard-port.tstest/onboard.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/onboard.ts
Selective E2E Results — ✅ All requested jobs passedRun: 25778615451
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 25779926291
|
Selective E2E Results — ✅ All requested jobs passedRun: 25780108062
|
Selective E2E Results — ✅ All requested jobs passedRun: 25780952174
|
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Selective E2E Results — ✅ All requested jobs passedRun: 25781050208
|
Selective E2E Results — ✅ All requested jobs passedRun: 25781126807
|
## Summary Refreshes the NemoClaw documentation for the local `main` changes included in the 0.0.42 release. The update adds release notes, updates the affected user-facing setup and troubleshooting pages, bumps docs metadata to 0.0.42, and regenerates the matching user skills. ## Changes - #3537 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`: Documented host-level status fields, cloudflared state-specific recovery hints, and Local Ollama auth proxy status diagnostics. - #3454 -> `docs/get-started/prerequisites.md`, `docs/get-started/quickstart.md`: Documented macOS Docker-driver onboarding and removed the expectation that standard macOS setup needs a VM driver helper. - #3514 -> `docs/inference/use-local-inference.md`: Documented compatible-endpoint retry behavior for reasoning-only smoke responses. - #3448 -> `docs/reference/commands.md`, `docs/manage-sandboxes/messaging-channels.md`: Documented canonical channel names and policy preset hints after `channels add`. - #3520 -> `docs/about/release-notes.md`: Captured clearer GPU recovery and uninstall wording in the 0.0.42 release notes. - #3313 -> `docs/get-started/quickstart.md`, `docs/reference/troubleshooting.md`: Documented stronger dashboard port detection and rollback when a forward cannot start. - #3502 -> `docs/about/release-notes.md`: Captured batched onboarding policy preset application in the 0.0.42 release notes. - #3505 -> `docs/reference/troubleshooting.md`: Documented the top-level Colima socket path. - #3421 -> `docs/about/release-notes.md`: Captured idempotent installer shim logging in the 0.0.42 release notes. - Updated `docs/project.json`, `docs/versions1.json`, and regenerated `.agents/skills/nemoclaw-user-*` outputs. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [x] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes - v0.0.42 * **Documentation** * Enhanced macOS onboarding guidance for Docker gateway setup * Improved dashboard port conflict handling with automatic rollback * Better local Ollama inference diagnostics and authentication proxy checks * Clarified status command output and recovery procedures * Refined messaging channel setup documentation * **Chores** * Version bump to 0.0.42 <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3540) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Summary
On the createSandbox path,
nemoclaw onboardcan return a dashboard port that was free at preflight but bound by something else by the timeopenshell forward startruns (TOCTOU during the multi-minute build), or a port thatlsoffailed to flag on the first probe. The result is a freshly-built sandbox baked with an unreachableCHAT_UI_URL. This PR strengthens the host-side port detection, fails fast on forward-start error with a clean rollback, and tightens the test seam so the new behaviour is covered without spawning real probes.Related Issue
Fixes #3260.
Changes
isPortBoundOnHostwith a layered probe chain — directlsof,sudo -n lsof(catches root-owned listeners that unprivileged lsof misses on macOS), then a synchronous Nodenetbind viaspawnSync(mirrors whatopenshell forward startactually attempts). Any positive signal short-circuits.ensureDashboardForwardgets a non-zero status fromopenshell forward starton the createSandbox path (rollbackSandboxOnFailure: true), delete the just-built sandbox viabuildOrphanedSandboxRollbackMessageand exit non-zero with a clear retry message instead of returning a port that is already known to be unreachable. The reuse paths keep the existing soft-warn UX.findAvailableDashboardPortand add an injectableisPortBoundChecktest seam so the new unit tests don't have to spawn reallsof/ Node probes.findAvailableDashboardPort(port reservation semantics with stub probes) plus source-shape guards for the strengthened detection chain and rollback branch — extended intest/onboard.test.ts.Type of Change
Verification
Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
New Features
Bug Fixes