Skip to content

fix(onboard): model OpenShell Docker bridge route in preflight#3459

Merged
cv merged 3 commits into
mainfrom
fix/docker-gateway-bridge-route-preflight
May 13, 2026
Merged

fix(onboard): model OpenShell Docker bridge route in preflight#3459
cv merged 3 commits into
mainfrom
fix/docker-gateway-bridge-route-preflight

Conversation

@ericksoa

@ericksoa ericksoa commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

  • supersede fix(onboard): preflight sandbox-bridge → gateway reachability on Docker-driver #3441 with a Docker-driver sandbox reachability probe that mirrors OpenShell's current Docker routing model
  • inspect the managed Docker network IPAM config, prefer the IPv4 bridge gateway, and inject the same host.openshell.internal mapping real OpenShell sandboxes receive
  • classify probe setup/DNS/network-inspect failures as non-blocking probe_unavailable instead of host-firewall failures
  • keep the UFW remediation only for native bridge-gateway TCP failures after the exact OpenShell route has been modeled

Why

#3441 tried to catch the real partner/Brev failure from #3439, but its helper container did not actually behave like an OpenShell Docker sandbox after OpenShell #1128. Real sandboxes get explicit host.openshell.internal routing: native Linux Docker maps it to the openshell-docker bridge gateway IP, while Docker Desktop/VM-backed Docker uses Docker's host-gateway route.

This replacement keeps the useful early diagnostic while avoiding the false DNS/host-gateway failures that forced the #3441 revert.

Validation

  • npm run build:cli
  • npm run typecheck:cli
  • npx vitest run src/lib/onboard/gateway-sandbox-reachability.test.ts test/gateway-liveness-probe.test.ts
  • npm run checks
  • git diff --check

Local Docker note: this checkout did not have an openshell-docker network, which now maps to probe_unavailable / continue rather than a firewall diagnosis.

Summary by CodeRabbit

  • Improvements

    • Docker gateway startup now performs sandbox-bridge reachability checks before reporting healthy, reducing startup surprises.
  • User-facing

    • Clearer diagnostics and guidance when sandbox-to-gateway connectivity fails (including conditional firewall hints for TCP failures).
  • Tests

    • Added tests covering gateway reachability checks and related messaging to prevent regressions.

Review Change Stack

@ericksoa ericksoa self-assigned this May 13, 2026
@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 558835b4-706a-47d6-9dc2-4841706341e5

📥 Commits

Reviewing files that changed from the base of the PR and between 769d81d and fc3dc36.

📒 Files selected for processing (3)
  • src/lib/onboard.ts
  • src/lib/onboard/gateway-sandbox-reachability.test.ts
  • src/lib/onboard/gateway-sandbox-reachability.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/lib/onboard/gateway-sandbox-reachability.ts
  • src/lib/onboard/gateway-sandbox-reachability.test.ts

📝 Walkthrough

Walkthrough

This PR introduces a Docker-based probe system to verify sandbox containers can reach the OpenShell gateway within the Docker bridge network. The probe is invoked at three gateway startup/reuse decision points to ensure reachability before the gateway is considered ready.

Changes

Docker Gateway Sandbox Reachability

Layer / File(s) Summary
Sandbox Bridge Reachability Probe Implementation
src/lib/onboard/gateway-sandbox-reachability.ts
Exports SandboxBridgeReachabilityResult, SandboxBridgeRouteKind, and DockerBridgeNetworkInfo shapes. Implements Docker IPAM network inspection, host-gateway capability detection, route mode selection (bridge-gateway vs host-gateway with appropriate --add-host aliases), probe output normalization, and the main async isSandboxBridgeGatewayReachable() function that orchestrates network inspection, constructs docker run arguments with nc connectivity test, and maps results into structured reason codes (ok, tcp_failed, probe_unavailable).
Reachability Messaging & Verification Helpers
src/lib/onboard/gateway-sandbox-reachability.ts
formatSandboxBridgeUnreachableMessage() returns diagnostic text for unreachable cases: empty on success, warnings on probe unavailable, UFW firewall guidance for TCP failures in bridge-gateway mode only. verifySandboxBridgeGatewayReachableOrExit() runs the reachability check, logs warnings/errors by reason, and optionally exits the process or throws on failure. Test helpers exposed via __test export.
Gateway Startup Integration
src/lib/onboard.ts
Imports verifySandboxBridgeGatewayReachableOrExit and invokes it at three gateway-ready decision points after confirming health: during gateway reuse following drift checks, after adopting/attaching an existing gateway process, and in the startup health-poll loop before returning success, ensuring sandbox bridge connectivity is verified before the gateway is considered ready.
Unit Tests for Reachability Module
src/lib/onboard/gateway-sandbox-reachability.test.ts
Validates route modeling (bridge vs host-gateway selection with expected host aliases), successful reachability scenarios returning ok: true with subnet/gateway/routeKind, correct --add-host argument ordering before probe image, error classification mapping missing networks and DNS failures to probe_unavailable and TCP failures to tcp_failed, and message formatting emitting UFW commands only for tcp_failed with bridge_gateway routing.
Gateway Liveness Probe Integration Test
test/gateway-liveness-probe.test.ts
Verifies startDockerDriverGateway invokes verifySandboxBridgeGatewayReachableOrExit(exitOnFailure) at least three times and before key markers indicating gateway reuse and health state.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3458: Directly conflicts—removes the same gateway-sandbox-reachability probe module and all call sites this PR adds.
  • NVIDIA/NemoClaw#3441: Adds the same Docker-driver sandbox-bridge reachability gate with identical module exports, helper functions, and gateway startup integration points.
  • NVIDIA/NemoClaw#3378: Updates startDockerDriverGateway readiness logic in the same region; this PR adds a sandbox-bridge reachability gate while that PR adds TCP listen and child-exit verification.

Suggested labels

NemoClaw CLI, fix, Docker, OpenShell, v0.0.40

Suggested reviewers

  • jyaunches
  • prekshivyas

Poem

🐰 I hopped through docker nets with glee,
I knocked on gateways, looked for the key.
A tiny nc, a ping so bright,
The bridge replied — all clear by night.
Now containers sing and gateways agree.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a Docker bridge route reachability model to the onboarding preflight checks, which is the primary focus of the changes across multiple files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/docker-gateway-bridge-route-preflight

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 275-276: Top-level probe imports (e.g., the const {
verifySandboxBridgeGatewayReachableOrExit } =
require("./onboard/gateway-sandbox-reachability") as typeof import(...) line)
inflate this entrypoint; move these requires out of module scope and into the
functions that actually call them (lazy-load inside the verification/onboard
flow, e.g., inside the function that triggers sandbox/gateway checks) so you can
delete the top-level declaration lines and neutralize the +5-line growth; apply
the same change to the other similar top-level probe requires referenced in this
file so the entrypoint file loses at least five lines total.

In `@src/lib/onboard/gateway-sandbox-reachability.test.ts`:
- Around line 62-79: The test only asserts presence of the --add-host token but
not its position; update the test (within the same it block using seen.args from
runImpl) to assert ordering by finding the index of the element that includes
"host.openshell.internal:10.0.0.1" (e.g., via seen.args.findIndex(x =>
x.includes("host.openshell.internal:10.0.0.1"))) and the index of the probe
payload (e.g., the element that includes "nc -zw7 host.openshell.internal
9090"), and add an assertion that the add-host index is less than the probe
index so --add-host appears before the image/command.

In `@src/lib/onboard/gateway-sandbox-reachability.ts`:
- Around line 160-168: The summarizeProbeResult function currently returns only
the first non-empty diagnostic (details[0]) which can hide important stderr
content used later for DNS classification; update summarizeProbeResult (working
with SandboxBridgeProbeRunResult and outputTail) to preserve and return the full
concatenated diagnostics string (e.g., join all entries in details with " | " or
newline) so that DNS-related messages in stderr aren’t discarded before
classification; ensure the returned value remains a string and still falls back
to the existing "docker run did not complete the probe" message when details is
empty.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b75c6c84-06e5-44ce-bbe0-41f56387f572

📥 Commits

Reviewing files that changed from the base of the PR and between 09e1793 and 769d81d.

📒 Files selected for processing (4)
  • src/lib/onboard.ts
  • src/lib/onboard/gateway-sandbox-reachability.test.ts
  • src/lib/onboard/gateway-sandbox-reachability.ts
  • test/gateway-liveness-probe.test.ts

Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/onboard/gateway-sandbox-reachability.test.ts
Comment thread src/lib/onboard/gateway-sandbox-reachability.ts Outdated
@github-actions

github-actions Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: gateway-health-honest-e2e, double-onboard-e2e, onboard-repair-e2e
Optional E2E: onboard-resume-e2e, network-policy-e2e, sandbox-operations-e2e, overlayfs-autofix-e2e, cloud-onboard-e2e

Dispatch hint: gateway-health-honest-e2e

Workflow run

Full advisor summary

Pi Semantic E2E Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • gateway-health-honest-e2e: This regression job already guards the exact log strings ('Docker-driver gateway is healthy', 'Reusing existing Docker-driver gateway') that the PR now gates behind verifySandboxBridgeGatewayReachableOrExit. A regression in startDockerDriverGateway sequencing — e.g. logging healthy before the probe runs, or swallowing a process.exit — would surface here. The PR also touches the comment block adjacent to the [Linux][Install] PR #3001 Docker-driver gateway requires Ubuntu 24.04+ (GLIBC 2.39) — not documented, fails silently on Ubuntu 22.04 with false "healthy" status #3111 fix.
  • double-onboard-e2e: Two of the three new probe call sites are on the gateway-reuse paths (reuse via pid+metadata and reuse via port-listener adoption). A second nemoclaw onboard is the canonical way these branches are exercised. If the new probe wrongly fails-closed against an already-healthy reused gateway, this job will fail.
  • onboard-repair-e2e: onboard repair drives startGatewayForRecovery → startGatewayWithOptions → startDockerDriverGateway with exitOnFailure=false. The new probe's process.exit(1) path is gated on exitOnFailure, so this job exercises the non-exit branch and the throw path that callers must tolerate.

Optional E2E

  • onboard-resume-e2e: Resume goes through the same Docker-driver gateway startup paths and is a good orthogonal confidence check that the probe does not destabilise repeated/partial onboards.
  • network-policy-e2e: Exercises real sandbox container networking out of the OpenShell Docker bridge. Useful confirmation that the new probe (busybox image, --add-host aliases, --network openshell-docker) does not interact poorly with restricted policy tier.
  • sandbox-operations-e2e: End-to-end sandbox lifecycle is the consumer of host.openshell.internal:GATEWAY_PORT reachability that the probe asserts; passing here corroborates that the probe's success criterion matches real sandbox behaviour.
  • overlayfs-autofix-e2e: Another job that drives a fresh Docker-driver gateway start from install.sh; useful to verify the probe does not regress the fresh-start path on hosts with the overlayfs auto-fix in play.
  • cloud-onboard-e2e: Broad onboard smoke that touches the same Docker-driver gateway path; cheap signal that the new probe doesn't false-positive in a default cloud environment.

New E2E recommendations

  • sandbox-bridge-firewall-boundary (high): The whole point of verifySandboxBridgeGatewayReachableOrExit is to fail closed (process.exit(1)) when a host firewall blocks the OpenShell Docker bridge subnet from reaching GATEWAY_PORT. No existing E2E script in test/e2e injects such a block (the nearest, test-network-policy.sh, exercises egress policy from inside sandboxes, not host-side ingress from the bridge). Without coverage, a regression that turns the new exit-on-failure into a no-op or that mis-classifies a real block as 'probe_unavailable' would ship silently.
    • Suggested test: test/e2e/test-sandbox-bridge-firewall.sh — on a Linux runner: run install.sh + onboard once to bring the Docker-driver gateway up, then (a) inject a deny rule for the openshell-docker subnet to GATEWAY_PORT (e.g. iptables -I DOCKER-USER … -j DROP) and assert the next nemoclaw onboard exits non-zero with the UFW remediation hint and route_kind=bridge_gateway message; (b) restore the rule and assert onboard succeeds and prints the reuse log; (c) hide/rename the openshell-docker network and assert the warn-only 'Could not verify sandbox bridge reachability … continuing' path with no UFW command and a zero exit.
  • docker-driver-gateway-reuse-with-probe (medium): Two of the three new probe call sites are explicitly on reuse paths (pid+metadata reuse, port-listener adoption). test-double-onboard.sh covers the happy reuse path but not a scenario where reuse is healthy yet sandbox→gateway routing is broken (e.g. the openshell-docker network was recreated with a different gateway IP after a Docker daemon restart). This is exactly the [macOS][Onboard] nemoclaw onboard step [4/8] fails with "Connection refused" while preflight reports gateway healthy — stale cached health (regression of #2020) #3258 class of bug the PR seems aimed at, and is currently uncovered.
    • Suggested test: test/e2e/test-gateway-reuse-bridge-route-drift.sh — onboard once, recreate the openshell-docker network with a different subnet/gateway while leaving the gateway process alive, then re-onboard and assert the probe rejects the now-unreachable bridge route and surfaces the route-aware diagnostic instead of silently logging '✓ Reusing existing Docker-driver gateway'.

Dispatch hint

  • Workflow: .github/workflows/regression-e2e.yaml
  • jobs input: gateway-health-honest-e2e

@ericksoa ericksoa added v0.0.41 platform: ubuntu Affects Ubuntu Linux environments platform: brev Affects Brev hosted development environments bug Something fails against expected or documented behavior labels May 13, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 25796444894
Branch: fix/docker-gateway-bridge-route-preflight
Requested jobs: all (no filter)
Summary: 2 passed, 1 failed, 2 skipped

Job Result
brave-search-e2e ✅ success
cloud-e2e ⚠️ cancelled
cloud-inference-e2e ⚠️ cancelled
cloud-onboard-e2e ⚠️ cancelled
credential-migration-e2e ⚠️ cancelled
credential-sanitization-e2e ⚠️ cancelled
deployment-services-e2e ⚠️ cancelled
device-auth-health-e2e ⚠️ cancelled
diagnostics-e2e ⚠️ cancelled
docs-validation-e2e ⚠️ cancelled
double-onboard-e2e ⚠️ cancelled
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-discord-e2e ⚠️ cancelled
hermes-e2e ⚠️ cancelled
hermes-inference-switch-e2e ⚠️ cancelled
hermes-slack-e2e ❌ failure
inference-routing-e2e ⚠️ cancelled
issue-2478-crash-loop-recovery-e2e ⚠️ cancelled
kimi-inference-compat-e2e ⚠️ cancelled
launchable-smoke-e2e ⚠️ cancelled
messaging-compatible-endpoint-e2e ⚠️ cancelled
messaging-providers-e2e ⚠️ cancelled
network-policy-e2e ⚠️ cancelled
onboard-repair-e2e ⚠️ cancelled
onboard-resume-e2e ⚠️ cancelled
openclaw-inference-switch-e2e ⚠️ cancelled
openshell-gateway-upgrade-e2e ⚠️ cancelled
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ⚠️ cancelled
rebuild-hermes-stale-base-e2e ⚠️ cancelled
rebuild-openclaw-e2e ⚠️ cancelled
runtime-overrides-e2e ⚠️ cancelled
sandbox-operations-e2e ⚠️ cancelled
sandbox-survival-e2e ⚠️ cancelled
shields-config-e2e ⚠️ cancelled
skill-agent-e2e ⚠️ cancelled
snapshot-commands-e2e ⚠️ cancelled
telegram-injection-e2e ⚠️ cancelled
token-rotation-e2e ⚠️ cancelled
upgrade-stale-sandbox-e2e ⚠️ cancelled

Failed jobs: hermes-slack-e2e. Check run artifacts for logs.

@github-actions

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25796612494
Branch: fix/docker-gateway-bridge-route-preflight
Requested jobs: all (no filter)
Summary: 39 passed, 0 failed, 2 skipped

Job Result
brave-search-e2e ✅ success
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
credential-migration-e2e ✅ success
credential-sanitization-e2e ✅ success
deployment-services-e2e ✅ success
device-auth-health-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
double-onboard-e2e ✅ success
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-discord-e2e ✅ success
hermes-e2e ✅ success
hermes-inference-switch-e2e ✅ success
hermes-slack-e2e ✅ success
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ✅ success
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success
openclaw-inference-switch-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-hermes-stale-base-e2e ✅ success
rebuild-openclaw-e2e ✅ success
runtime-overrides-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
telegram-injection-e2e ✅ success
token-rotation-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success

@ericksoa ericksoa requested a review from cv May 13, 2026 16:06
@cv cv merged commit 11837d8 into main May 13, 2026
25 checks passed
cv pushed a commit that referenced this pull request May 13, 2026
… Linux (#3472)

> Re-attributed replacement for #3465. Same code, single squashed commit
authored and signed-off by me. Force-push is blocked on the original
branch, so this is a fresh branch + new PR; #3465 will be closed in
favour of this one.

## Summary

Fixes #3340.

On Brev VMs (and any Linux host with UFW default-deny), the Ollama auth
proxy on port 11435 is unreachable from sandbox containers, causing
inference calls to hang. Not fixed by PR #3459 (probes port 8080, which
has a Docker DNAT rule that bypasses UFW INPUT) or by v0.0.40 (host-side
container check uses `--add-host host-gateway` on the default bridge,
never testing the real sandbox path).

**Root cause:** Port 11435 has no Docker DNAT rule, so traffic from
sandbox containers reaches the host UFW INPUT chain — where a
default-deny policy silently drops it.

## What this PR does

Adds `src/lib/onboard/ollama-proxy-reachability.ts`, which runs a
short-lived `busybox` container on the `openshell-docker` network (the
exact network the Docker-driver gateway creates for sandboxes) and
performs `nc -zw5 host.openshell.internal:11435`, mirroring the real
sandbox route.

The probe is called inside `setupInference()` **before**
`upsertProvider` and `runOpenshell(["inference", "set", ...])`, so:

- A `tcp_failed` result (nc exit 1 on Linux native) prints a targeted
`sudo ufw allow from <subnet> to any port 11435 proto tcp` remediation
and exits 1.
- Because no inference route is committed on failure,
`isInferenceRouteReady()` stays false — both a fresh re-run and
`--resume` re-enter `setupInference()` and re-probe after the user
applies the UFW fix.
- A `probe_unavailable` result (Docker Desktop, DNS failure, network not
found, non-0/non-1 nc exit) continues silently — these environments
either don't have UFW or aren't using the Docker-driver gateway.
- Skipped entirely on WSL.

## Why PR #3459 doesn't fix this

Port 8080 has a Docker DNAT rule (`DNAT tcp dpt:8080
to:172.18.0.2:30051`) that redirects traffic before it hits UFW INPUT,
so that probe passes even with UFW blocking everything. Port 11435 has
no such rule.

## Why the previous probe (PR #3441) was reverted

PR #3441's probe had no `--add-host`, so `host.openshell.internal` was
unresolvable inside the probe container — nc always exited 1 regardless
of whether UFW was enabled. This PR adds `--add-host
host.openshell.internal:<gatewayIp>` and a `isNameResolutionFailure()`
guard that reclassifies DNS errors as `probe_unavailable` (non-fatal)
rather than `tcp_failed`.

## Scope

Targets **Docker-driver gateway mode** (v0.0.40+) where sandboxes run on
the `openshell-docker` bridge. Pre-v0.0.40 K3S deployments
(`openshell-cluster-nemoclaw`) don't have this network; the probe
returns `probe_unavailable` and continues silently. Users re-onboarding
to v0.0.40+ migrate to Docker-driver gateway mode where the probe
applies.

## Functional verification

Verified end-to-end on a Brev VM:

- **Without blocking:** probe container runs `nc -zw5
host.openshell.internal:11435` on `openshell-docker` → `status 0` → `ok`
✓
- **With `iptables -I INPUT -s 172.19.0.0/16 -p tcp --dport 11435 -j
DROP`:** same probe → `status 1` → `tcp_failed` → UFW remediation
message printed ✓

## Test plan

- [x] 25 unit tests covering all result variants, argument construction,
host-gateway mode, DNS failure classification, env var override, and UFW
message formatter — all pass
- [x] No new test failures vs main baseline (verified after `npm run
build:cli`)
- [x] Biome lint/format, SPDX headers, shfmt all pass (`make check` —
`hadolint` binary missing on Brev VM, pre-existing)
- [x] No new TypeScript type errors (`npm run typecheck:cli`)
- [x] Functional end-to-end: correct `status 0` / `status 1` behaviour
with and without iptables blocking

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Onboarding for Ollama-local now probes sandbox→proxy reachability and
will halt onboarding with a clear, actionable error if a TCP
connectivity failure is detected.

* **Tests**
* Added extensive tests covering reachability probing, network parsing,
DNS/connection failure cases, Docker run behavior, and user-facing error
message formatting.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3472)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
cjagwani pushed a commit that referenced this pull request May 14, 2026
Resolve the two output threads in #3456 left after the core dead-loop fix
landed via #3459 + #3434:

Sub-bug #3 — `src/lib/onboard.ts` printed
  `nemoclaw <name> destroy --yes && nemoclaw onboard --gpu`
with a literal `<name>` placeholder, and assumed at least one sandbox
was registered. When the GPU-passthrough mismatch hit on the State B
re-run path with an empty registry (the dead-loop case), the hint was
not actionable. Replace with a registry-aware helper at
`src/lib/onboard/gpu-recovery.ts` that renders the right shape:
  - empty registry → suggest `nemoclaw uninstall && nemoclaw onboard --gpu`
  - one sandbox → suggest destroy --yes --cleanup-gateway for that name
  - multiple sandboxes → list each, only the last gets --cleanup-gateway

Sub-bug #4 — `src/lib/actions/uninstall/run-plan.ts` printed
  `Destroyed gateway 'nemoclaw' skipped`
when the openshell destroy no-op'd (gateway already gone) — the
"Destroyed … skipped" wording was self-contradictory. Extend
`runOptional` with an `onSkip` option; route the gateway destroy to
emit `Gateway 'nemoclaw' already removed or unreachable` on no-op.

Tests:
- `src/lib/onboard/gpu-recovery.test.ts` (6 tests): forbid literal
  `<name>` placeholder anywhere in the output; cover empty / single /
  multi-sandbox cases; defensive filter on whitespace names so a
  `nemoclaw  destroy` rendering can never happen.
- `src/lib/actions/uninstall/run-plan.test.ts`: assert the new
  "already removed or unreachable" wording and the absence of the
  "Destroyed gateway 'nemoclaw' skipped" string.

The core dead loop itself (sub-bugs #1, #2 and State B GPU mismatch)
is already addressed by #3459 + #3434 + #3483; #3456 will close once
this lands. See the #3456 status comment for the full mapping.

Refs #3456. Mirrors (and tightens) the approach in the closed PR #3464,
which left the literal `<name>` placeholder in tests per CodeRabbit
feedback that was never addressed.

Signed-off-by: Charan Jagwani <charjags100@gmail.com>
cv pushed a commit that referenced this pull request May 14, 2026
## Summary
- Bump the docs metadata and version switcher to `0.0.41`.
- Add v0.0.41 release notes plus operator guidance for OpenShell
pinning, Docker bridge reachability, Local Ollama proxy reachability,
and Docker GPU onboarding diagnostics.
- Refresh generated `nemoclaw-user-*` skills from the updated docs.

## Source summary
- #3434 -> `docs/reference/commands.md`,
`docs/reference/troubleshooting.md`, `docs/about/release-notes.md`:
Document Linux Docker-driver GPU onboarding behavior, diagnostics,
cleanup guidance, and the `NEMOCLAW_DOCKER_GPU_PATCH` troubleshooting
escape hatch.
- #3483 -> `docs/about/release-notes.md`: Note that `nemoclaw uninstall`
removes all installer-managed OpenShell helper binaries unless
`--keep-openshell` is passed.
- #3446 -> `docs/reference/commands.md`,
`docs/reference/troubleshooting.md`, `docs/about/release-notes.md`:
Document blueprint-driven OpenShell install pin resolution and fallback
behavior.
- #3472 -> `docs/inference/use-local-inference.md`,
`docs/reference/troubleshooting.md`, `docs/about/release-notes.md`:
Document sandbox-side Local Ollama auth proxy reachability checks and
firewall remediation.
- #3459 -> `docs/reference/commands.md`,
`docs/reference/troubleshooting.md`, `docs/about/release-notes.md`:
Document Docker-driver sandbox-to-gateway reachability checks and
firewall remediation.

## Test plan
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user`
- `make docs`
- `git diff --check`
- `npm run build:cli`
- `npm run typecheck:cli`
- pre-commit hooks during `git commit`

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added `nemoclaw inference get` command to check current inference
settings
* Improved gateway health validation with Linux firewall remediation
guidance

* **Bug Fixes**
  * Enhanced proxy readiness validation with sandbox network path probes
  * Improved local Ollama route onboarding with rerun-safe fixes
  * Better sandbox-to-gateway connectivity detection

* **Documentation**
* Expanded troubleshooting guidance for firewall and connectivity issues
* Updated CLI reference with new command and environment variable
documentation
  * Added gateway binding and Docker-driver GPU compatibility guidance

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3531)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
cv added a commit that referenced this pull request May 14, 2026
…3520)

> **Draft for visibility.** Issue-autopilot Stages 4-5 of #3456. Will
mark ready once batch self-review + CI complete.

## Summary

Closes the two remaining output threads in #3456 after the core
dead-loop fix already landed on `main` (via #3459, #3434, #3483). Full
sub-bug mapping in the [#3456 status
comment](#3456 (comment)).

- **Sub-bug #3** — `nemoclaw <name> destroy --yes` recovery hint
replaced with a registry-aware helper.
- **Sub-bug #4** — `Destroyed gateway 'nemoclaw' skipped`
self-contradictory wording replaced with `Gateway 'nemoclaw' already
removed or unreachable`.

## Acceptance criteria mapping

| Sub-bug | Resolution | Evidence |
|---|---|---|
| #1 dead loop | Already fixed on main (#3459) | out of scope |
| #2 firewall diagnostic | Already fixed on main (#3459) | out of scope
|
| **#3** literal `<name>` placeholder | **This PR** |
`src/lib/onboard/gpu-recovery.ts` + `onboard.ts:10387-10405` |
| **#4** misleading "skipped" wording | **This PR** |
`src/lib/actions/uninstall/run-plan.ts:210-228, 407-414` |
| #5 uninstall residuals | Already fixed on main (#3483) | out of scope
|

## Behavior matrix

`gpuPassthroughRecoveryLines(names)`:

| Input | Suggestion |
|---|---|
| `null` / `[]` | `nemoclaw uninstall && nemoclaw onboard --gpu` |
| one sandbox | `nemoclaw <name> destroy --yes --cleanup-gateway &&
nemoclaw onboard --gpu` |
| many sandboxes | each `destroy --yes`, only the last gets
`--cleanup-gateway` |

## Test plan

```
npm run typecheck:cli
npx vitest run src/lib/onboard/gpu-recovery.test.ts src/lib/actions/uninstall/run-plan.test.ts
```

22 tests pass (6 new + 16 existing).

## Notes for reviewers

- This is the work [#3464
attempted](#3464); that PR was
closed without merging after CodeRabbit asked for the `<name>`
placeholder to be forbidden in tests via negative assertion. This PR
adopts that refinement.
- `runOptional` extension is backwards-compatible — existing callers
without `onSkip` get the original wording.

Closes #3456 once merged.

---------

Signed-off-by: Charan Jagwani <charjags100@gmail.com>
Co-authored-by: Charan Jagwani <charjags100@gmail.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output area: install Install, setup, prerequisites, or uninstall flow area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow area: packaging Packages, images, registries, installers, or distribution area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression platform: container Affects Docker, containerd, Podman, or images and removed area: packaging Packages, images, registries, installers, or distribution priority: high bug Something fails against expected or documented behavior labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output area: install Install, setup, prerequisites, or uninstall flow area: onboarding Onboarding FSM, provider setup, sandbox launch, or first-run flow area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression platform: brev Affects Brev hosted development environments platform: container Affects Docker, containerd, Podman, or images platform: ubuntu Affects Ubuntu Linux environments

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants