fix(onboard): model OpenShell Docker bridge route in preflight#3459
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThis PR introduces a Docker-based probe system to verify sandbox containers can reach the OpenShell gateway within the Docker bridge network. The probe is invoked at three gateway startup/reuse decision points to ensure reachability before the gateway is considered ready. ChangesDocker Gateway Sandbox Reachability
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 275-276: Top-level probe imports (e.g., the const {
verifySandboxBridgeGatewayReachableOrExit } =
require("./onboard/gateway-sandbox-reachability") as typeof import(...) line)
inflate this entrypoint; move these requires out of module scope and into the
functions that actually call them (lazy-load inside the verification/onboard
flow, e.g., inside the function that triggers sandbox/gateway checks) so you can
delete the top-level declaration lines and neutralize the +5-line growth; apply
the same change to the other similar top-level probe requires referenced in this
file so the entrypoint file loses at least five lines total.
In `@src/lib/onboard/gateway-sandbox-reachability.test.ts`:
- Around line 62-79: The test only asserts presence of the --add-host token but
not its position; update the test (within the same it block using seen.args from
runImpl) to assert ordering by finding the index of the element that includes
"host.openshell.internal:10.0.0.1" (e.g., via seen.args.findIndex(x =>
x.includes("host.openshell.internal:10.0.0.1"))) and the index of the probe
payload (e.g., the element that includes "nc -zw7 host.openshell.internal
9090"), and add an assertion that the add-host index is less than the probe
index so --add-host appears before the image/command.
In `@src/lib/onboard/gateway-sandbox-reachability.ts`:
- Around line 160-168: The summarizeProbeResult function currently returns only
the first non-empty diagnostic (details[0]) which can hide important stderr
content used later for DNS classification; update summarizeProbeResult (working
with SandboxBridgeProbeRunResult and outputTail) to preserve and return the full
concatenated diagnostics string (e.g., join all entries in details with " | " or
newline) so that DNS-related messages in stderr aren’t discarded before
classification; ensure the returned value remains a string and still falls back
to the existing "docker run did not complete the probe" message when details is
empty.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b75c6c84-06e5-44ce-bbe0-41f56387f572
📒 Files selected for processing (4)
src/lib/onboard.tssrc/lib/onboard/gateway-sandbox-reachability.test.tssrc/lib/onboard/gateway-sandbox-reachability.tstest/gateway-liveness-probe.test.ts
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryPi Semantic E2E AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
Selective E2E Results — ❌ Some jobs failedRun: 25796444894
|
Selective E2E Results — ✅ All requested jobs passedRun: 25796612494
|
… Linux (#3472) > Re-attributed replacement for #3465. Same code, single squashed commit authored and signed-off by me. Force-push is blocked on the original branch, so this is a fresh branch + new PR; #3465 will be closed in favour of this one. ## Summary Fixes #3340. On Brev VMs (and any Linux host with UFW default-deny), the Ollama auth proxy on port 11435 is unreachable from sandbox containers, causing inference calls to hang. Not fixed by PR #3459 (probes port 8080, which has a Docker DNAT rule that bypasses UFW INPUT) or by v0.0.40 (host-side container check uses `--add-host host-gateway` on the default bridge, never testing the real sandbox path). **Root cause:** Port 11435 has no Docker DNAT rule, so traffic from sandbox containers reaches the host UFW INPUT chain — where a default-deny policy silently drops it. ## What this PR does Adds `src/lib/onboard/ollama-proxy-reachability.ts`, which runs a short-lived `busybox` container on the `openshell-docker` network (the exact network the Docker-driver gateway creates for sandboxes) and performs `nc -zw5 host.openshell.internal:11435`, mirroring the real sandbox route. The probe is called inside `setupInference()` **before** `upsertProvider` and `runOpenshell(["inference", "set", ...])`, so: - A `tcp_failed` result (nc exit 1 on Linux native) prints a targeted `sudo ufw allow from <subnet> to any port 11435 proto tcp` remediation and exits 1. - Because no inference route is committed on failure, `isInferenceRouteReady()` stays false — both a fresh re-run and `--resume` re-enter `setupInference()` and re-probe after the user applies the UFW fix. - A `probe_unavailable` result (Docker Desktop, DNS failure, network not found, non-0/non-1 nc exit) continues silently — these environments either don't have UFW or aren't using the Docker-driver gateway. - Skipped entirely on WSL. ## Why PR #3459 doesn't fix this Port 8080 has a Docker DNAT rule (`DNAT tcp dpt:8080 to:172.18.0.2:30051`) that redirects traffic before it hits UFW INPUT, so that probe passes even with UFW blocking everything. Port 11435 has no such rule. ## Why the previous probe (PR #3441) was reverted PR #3441's probe had no `--add-host`, so `host.openshell.internal` was unresolvable inside the probe container — nc always exited 1 regardless of whether UFW was enabled. This PR adds `--add-host host.openshell.internal:<gatewayIp>` and a `isNameResolutionFailure()` guard that reclassifies DNS errors as `probe_unavailable` (non-fatal) rather than `tcp_failed`. ## Scope Targets **Docker-driver gateway mode** (v0.0.40+) where sandboxes run on the `openshell-docker` bridge. Pre-v0.0.40 K3S deployments (`openshell-cluster-nemoclaw`) don't have this network; the probe returns `probe_unavailable` and continues silently. Users re-onboarding to v0.0.40+ migrate to Docker-driver gateway mode where the probe applies. ## Functional verification Verified end-to-end on a Brev VM: - **Without blocking:** probe container runs `nc -zw5 host.openshell.internal:11435` on `openshell-docker` → `status 0` → `ok` ✓ - **With `iptables -I INPUT -s 172.19.0.0/16 -p tcp --dport 11435 -j DROP`:** same probe → `status 1` → `tcp_failed` → UFW remediation message printed ✓ ## Test plan - [x] 25 unit tests covering all result variants, argument construction, host-gateway mode, DNS failure classification, env var override, and UFW message formatter — all pass - [x] No new test failures vs main baseline (verified after `npm run build:cli`) - [x] Biome lint/format, SPDX headers, shfmt all pass (`make check` — `hadolint` binary missing on Brev VM, pre-existing) - [x] No new TypeScript type errors (`npm run typecheck:cli`) - [x] Functional end-to-end: correct `status 0` / `status 1` behaviour with and without iptables blocking Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Onboarding for Ollama-local now probes sandbox→proxy reachability and will halt onboarding with a clear, actionable error if a TCP connectivity failure is detected. * **Tests** * Added extensive tests covering reachability probing, network parsing, DNS/connection failure cases, Docker run behavior, and user-facing error message formatting. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3472) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolve the two output threads in #3456 left after the core dead-loop fix landed via #3459 + #3434: Sub-bug #3 — `src/lib/onboard.ts` printed `nemoclaw <name> destroy --yes && nemoclaw onboard --gpu` with a literal `<name>` placeholder, and assumed at least one sandbox was registered. When the GPU-passthrough mismatch hit on the State B re-run path with an empty registry (the dead-loop case), the hint was not actionable. Replace with a registry-aware helper at `src/lib/onboard/gpu-recovery.ts` that renders the right shape: - empty registry → suggest `nemoclaw uninstall && nemoclaw onboard --gpu` - one sandbox → suggest destroy --yes --cleanup-gateway for that name - multiple sandboxes → list each, only the last gets --cleanup-gateway Sub-bug #4 — `src/lib/actions/uninstall/run-plan.ts` printed `Destroyed gateway 'nemoclaw' skipped` when the openshell destroy no-op'd (gateway already gone) — the "Destroyed … skipped" wording was self-contradictory. Extend `runOptional` with an `onSkip` option; route the gateway destroy to emit `Gateway 'nemoclaw' already removed or unreachable` on no-op. Tests: - `src/lib/onboard/gpu-recovery.test.ts` (6 tests): forbid literal `<name>` placeholder anywhere in the output; cover empty / single / multi-sandbox cases; defensive filter on whitespace names so a `nemoclaw destroy` rendering can never happen. - `src/lib/actions/uninstall/run-plan.test.ts`: assert the new "already removed or unreachable" wording and the absence of the "Destroyed gateway 'nemoclaw' skipped" string. The core dead loop itself (sub-bugs #1, #2 and State B GPU mismatch) is already addressed by #3459 + #3434 + #3483; #3456 will close once this lands. See the #3456 status comment for the full mapping. Refs #3456. Mirrors (and tightens) the approach in the closed PR #3464, which left the literal `<name>` placeholder in tests per CodeRabbit feedback that was never addressed. Signed-off-by: Charan Jagwani <charjags100@gmail.com>
## Summary - Bump the docs metadata and version switcher to `0.0.41`. - Add v0.0.41 release notes plus operator guidance for OpenShell pinning, Docker bridge reachability, Local Ollama proxy reachability, and Docker GPU onboarding diagnostics. - Refresh generated `nemoclaw-user-*` skills from the updated docs. ## Source summary - #3434 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`, `docs/about/release-notes.md`: Document Linux Docker-driver GPU onboarding behavior, diagnostics, cleanup guidance, and the `NEMOCLAW_DOCKER_GPU_PATCH` troubleshooting escape hatch. - #3483 -> `docs/about/release-notes.md`: Note that `nemoclaw uninstall` removes all installer-managed OpenShell helper binaries unless `--keep-openshell` is passed. - #3446 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`, `docs/about/release-notes.md`: Document blueprint-driven OpenShell install pin resolution and fallback behavior. - #3472 -> `docs/inference/use-local-inference.md`, `docs/reference/troubleshooting.md`, `docs/about/release-notes.md`: Document sandbox-side Local Ollama auth proxy reachability checks and firewall remediation. - #3459 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`, `docs/about/release-notes.md`: Document Docker-driver sandbox-to-gateway reachability checks and firewall remediation. ## Test plan - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user` - `make docs` - `git diff --check` - `npm run build:cli` - `npm run typecheck:cli` - pre-commit hooks during `git commit` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added `nemoclaw inference get` command to check current inference settings * Improved gateway health validation with Linux firewall remediation guidance * **Bug Fixes** * Enhanced proxy readiness validation with sandbox network path probes * Improved local Ollama route onboarding with rerun-safe fixes * Better sandbox-to-gateway connectivity detection * **Documentation** * Expanded troubleshooting guidance for firewall and connectivity issues * Updated CLI reference with new command and environment variable documentation * Added gateway binding and Docker-driver GPU compatibility guidance <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3531) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…3520) > **Draft for visibility.** Issue-autopilot Stages 4-5 of #3456. Will mark ready once batch self-review + CI complete. ## Summary Closes the two remaining output threads in #3456 after the core dead-loop fix already landed on `main` (via #3459, #3434, #3483). Full sub-bug mapping in the [#3456 status comment](#3456 (comment)). - **Sub-bug #3** — `nemoclaw <name> destroy --yes` recovery hint replaced with a registry-aware helper. - **Sub-bug #4** — `Destroyed gateway 'nemoclaw' skipped` self-contradictory wording replaced with `Gateway 'nemoclaw' already removed or unreachable`. ## Acceptance criteria mapping | Sub-bug | Resolution | Evidence | |---|---|---| | #1 dead loop | Already fixed on main (#3459) | out of scope | | #2 firewall diagnostic | Already fixed on main (#3459) | out of scope | | **#3** literal `<name>` placeholder | **This PR** | `src/lib/onboard/gpu-recovery.ts` + `onboard.ts:10387-10405` | | **#4** misleading "skipped" wording | **This PR** | `src/lib/actions/uninstall/run-plan.ts:210-228, 407-414` | | #5 uninstall residuals | Already fixed on main (#3483) | out of scope | ## Behavior matrix `gpuPassthroughRecoveryLines(names)`: | Input | Suggestion | |---|---| | `null` / `[]` | `nemoclaw uninstall && nemoclaw onboard --gpu` | | one sandbox | `nemoclaw <name> destroy --yes --cleanup-gateway && nemoclaw onboard --gpu` | | many sandboxes | each `destroy --yes`, only the last gets `--cleanup-gateway` | ## Test plan ``` npm run typecheck:cli npx vitest run src/lib/onboard/gpu-recovery.test.ts src/lib/actions/uninstall/run-plan.test.ts ``` 22 tests pass (6 new + 16 existing). ## Notes for reviewers - This is the work [#3464 attempted](#3464); that PR was closed without merging after CodeRabbit asked for the `<name>` placeholder to be forbidden in tests via negative assertion. This PR adopts that refinement. - `runOptional` extension is backwards-compatible — existing callers without `onSkip` get the original wording. Closes #3456 once merged. --------- Signed-off-by: Charan Jagwani <charjags100@gmail.com> Co-authored-by: Charan Jagwani <charjags100@gmail.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Summary
probe_unavailableinstead of host-firewall failuresWhy
#3441 tried to catch the real partner/Brev failure from #3439, but its helper container did not actually behave like an OpenShell Docker sandbox after OpenShell #1128. Real sandboxes get explicit
host.openshell.internalrouting: native Linux Docker maps it to theopenshell-dockerbridge gateway IP, while Docker Desktop/VM-backed Docker uses Docker'shost-gatewayroute.This replacement keeps the useful early diagnostic while avoiding the false DNS/host-gateway failures that forced the #3441 revert.
Validation
Local Docker note: this checkout did not have an
openshell-dockernetwork, which now maps toprobe_unavailable/ continue rather than a firewall diagnosis.Summary by CodeRabbit
Improvements
User-facing
Tests