fix(onboard): probe Ollama proxy reachability from sandbox network on Linux#3465
fix(onboard): probe Ollama proxy reachability from sandbox network on Linux#3465prekshivyas wants to merge 3 commits into
Conversation
… Linux Port 11435 (Ollama auth proxy) has no Docker DNAT rule, so traffic from sandbox containers reaches the host UFW INPUT chain — where a default-deny policy silently drops it. The existing host-side container check uses --add-host host-gateway on the default Docker bridge and misses this path. Add a new probe module (ollama-proxy-reachability) that launches a short-lived busybox container on the openshell-docker network (the same network the real sandbox uses) and performs nc -zw5 host.openshell.internal:11435. A tcp_failed result (exit 1) surfaces a targeted ufw remediation command and exits 1; non-fatal probe_unavailable results (Docker Desktop, DNS failure, network missing) log a warning and continue. Skipped entirely on WSL. Fixes #3340 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a Docker-network TCP probe and message formatter used during ChangesOllama Proxy Reachability Probing
Sequence Diagram(s)sequenceDiagram
participant Onboard as OnboardingWizard
participant Probe as probeOllamaProxySandboxReachability
participant DockerNet as DockerNetworkInspect
participant DockerRun as DockerRun(container)
participant Busybox as busybox nc
participant Host as HostAuthProxy
Onboard->>Probe: invoke probe after persisting proxy token
Probe->>DockerNet: docker network inspect (get IPAM/gateway)
alt network/gateway present
Probe->>DockerRun: docker run --network <name> --add-host host-gateway:<ip|host-gateway> busybox nc host-gateway:PORT
DockerRun->>Busybox: spawn nc inside container
Busybox->>Host: TCP connect to host:PORT
alt connection OK
Host-->>Busybox: accept (nc exit 0)
Busybox-->>DockerRun: exit 0
DockerRun-->>Probe: success => {reason: "ok"}
Probe-->>Onboard: returns ok
else connection refused/blocked
Host-->>Busybox: refuse/timeout (nc exit 1)
Busybox-->>DockerRun: exit 1
DockerRun-->>Probe: exit 1 => {reason: "tcp_failed", subnet}
Probe-->>Onboard: returns tcp_failed
end
else probe unavailable
DockerNet-->>Probe: missing/invalid => {reason: "probe_unavailable"}
Probe-->>Onboard: returns probe_unavailable
end
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested Labels
Suggested Reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 64-67: Collapse the inline branching by adding a single exported
helper in the ollama-proxy-reachability module (e.g., a function like
ensureOllamaProxyReachable) that encapsulates the orchestration currently done
with formatOllamaProxyUnreachableMessage and
probeOllamaProxySandboxReachability; have that helper return the same
result/throw the same errors so callers don't change logic. Then update
setupInference() to call only that new helper (replace direct uses of
formatOllamaProxyUnreachableMessage and probeOllamaProxySandboxReachability with
a single require/import of the new wrapper) so the probe logic is fully
contained in the reachability module.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: fdf57210-79a2-4057-ad5d-3fc95b169eda
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/onboard/ollama-proxy-reachability.test.tssrc/lib/onboard/ollama-proxy-reachability.ts
| const { | ||
| formatOllamaProxyUnreachableMessage, | ||
| probeOllamaProxySandboxReachability, | ||
| }: typeof import("./onboard/ollama-proxy-reachability") = require("./onboard/ollama-proxy-reachability"); |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Collapse this probe orchestration behind a single helper.
This behavior looks right, but the extra inline branching is what is tripping onboard-entrypoint-budget right now (src/lib/onboard.ts is net +16). Since the reachability behavior already lives in ./onboard/ollama-proxy-reachability, expose one wrapper there and keep setupInference() to a single call.
♻️ Possible shape
const {
- formatOllamaProxyUnreachableMessage,
- probeOllamaProxySandboxReachability,
+ verifyOllamaProxySandboxReachability,
}: typeof import("./onboard/ollama-proxy-reachability") = require("./onboard/ollama-proxy-reachability");
@@
- if (!isWsl()) {
- const reach = await probeOllamaProxySandboxReachability();
- if (!reach.ok) {
- const msg = formatOllamaProxyUnreachableMessage(reach);
- if (reach.reason === "tcp_failed") {
- console.error(msg);
- process.exit(1);
- } else if (msg) {
- console.warn(msg);
- }
- }
- }
+ if (!isWsl()) {
+ await verifyOllamaProxySandboxReachability({
+ warn: console.warn,
+ error: console.error,
+ exit: process.exit,
+ });
+ }Also applies to: 8014-8025
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/lib/onboard.ts` around lines 64 - 67, Collapse the inline branching by
adding a single exported helper in the ollama-proxy-reachability module (e.g., a
function like ensureOllamaProxyReachable) that encapsulates the orchestration
currently done with formatOllamaProxyUnreachableMessage and
probeOllamaProxySandboxReachability; have that helper return the same
result/throw the same errors so callers don't change logic. Then update
setupInference() to call only that new helper (replace direct uses of
formatOllamaProxyUnreachableMessage and probeOllamaProxySandboxReachability with
a single require/import of the new wrapper) so the probe logic is fully
contained in the reachability module.
…ume retry Previously the probe ran after `runOpenshell(["inference", "set", ...])`, so on failure isInferenceRouteReady() returned true and a --resume retry skipped setupInference() entirely, never re-running the probe. Moving the probe inside the existing if (!isWsl()) block, after the proxy token is persisted but before upsertProvider/inference set, ensures that a failing probe leaves no committed inference route: isInferenceRouteReady() stays false, so both a fresh re-run and --resume re-enter setupInference() and re-probe connectivity after the user has applied the UFW fix. Also removes the now-dead `else if (msg)` warn branch since formatOllamaProxyUnreachableMessage returns "" for probe_unavailable and that branch could never fire. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryPi Semantic E2E AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
|
Superseded by #3472 — same code in a single squashed commit, authored and Signed-off-by me. Force-push is blocked on this branch, so the resign had to land on a fresh branch + new PR. |
… Linux (#3472) > Re-attributed replacement for #3465. Same code, single squashed commit authored and signed-off by me. Force-push is blocked on the original branch, so this is a fresh branch + new PR; #3465 will be closed in favour of this one. ## Summary Fixes #3340. On Brev VMs (and any Linux host with UFW default-deny), the Ollama auth proxy on port 11435 is unreachable from sandbox containers, causing inference calls to hang. Not fixed by PR #3459 (probes port 8080, which has a Docker DNAT rule that bypasses UFW INPUT) or by v0.0.40 (host-side container check uses `--add-host host-gateway` on the default bridge, never testing the real sandbox path). **Root cause:** Port 11435 has no Docker DNAT rule, so traffic from sandbox containers reaches the host UFW INPUT chain — where a default-deny policy silently drops it. ## What this PR does Adds `src/lib/onboard/ollama-proxy-reachability.ts`, which runs a short-lived `busybox` container on the `openshell-docker` network (the exact network the Docker-driver gateway creates for sandboxes) and performs `nc -zw5 host.openshell.internal:11435`, mirroring the real sandbox route. The probe is called inside `setupInference()` **before** `upsertProvider` and `runOpenshell(["inference", "set", ...])`, so: - A `tcp_failed` result (nc exit 1 on Linux native) prints a targeted `sudo ufw allow from <subnet> to any port 11435 proto tcp` remediation and exits 1. - Because no inference route is committed on failure, `isInferenceRouteReady()` stays false — both a fresh re-run and `--resume` re-enter `setupInference()` and re-probe after the user applies the UFW fix. - A `probe_unavailable` result (Docker Desktop, DNS failure, network not found, non-0/non-1 nc exit) continues silently — these environments either don't have UFW or aren't using the Docker-driver gateway. - Skipped entirely on WSL. ## Why PR #3459 doesn't fix this Port 8080 has a Docker DNAT rule (`DNAT tcp dpt:8080 to:172.18.0.2:30051`) that redirects traffic before it hits UFW INPUT, so that probe passes even with UFW blocking everything. Port 11435 has no such rule. ## Why the previous probe (PR #3441) was reverted PR #3441's probe had no `--add-host`, so `host.openshell.internal` was unresolvable inside the probe container — nc always exited 1 regardless of whether UFW was enabled. This PR adds `--add-host host.openshell.internal:<gatewayIp>` and a `isNameResolutionFailure()` guard that reclassifies DNS errors as `probe_unavailable` (non-fatal) rather than `tcp_failed`. ## Scope Targets **Docker-driver gateway mode** (v0.0.40+) where sandboxes run on the `openshell-docker` bridge. Pre-v0.0.40 K3S deployments (`openshell-cluster-nemoclaw`) don't have this network; the probe returns `probe_unavailable` and continues silently. Users re-onboarding to v0.0.40+ migrate to Docker-driver gateway mode where the probe applies. ## Functional verification Verified end-to-end on a Brev VM: - **Without blocking:** probe container runs `nc -zw5 host.openshell.internal:11435` on `openshell-docker` → `status 0` → `ok` ✓ - **With `iptables -I INPUT -s 172.19.0.0/16 -p tcp --dport 11435 -j DROP`:** same probe → `status 1` → `tcp_failed` → UFW remediation message printed ✓ ## Test plan - [x] 25 unit tests covering all result variants, argument construction, host-gateway mode, DNS failure classification, env var override, and UFW message formatter — all pass - [x] No new test failures vs main baseline (verified after `npm run build:cli`) - [x] Biome lint/format, SPDX headers, shfmt all pass (`make check` — `hadolint` binary missing on Brev VM, pre-existing) - [x] No new TypeScript type errors (`npm run typecheck:cli`) - [x] Functional end-to-end: correct `status 0` / `status 1` behaviour with and without iptables blocking Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Onboarding for Ollama-local now probes sandbox→proxy reachability and will halt onboarding with a clear, actionable error if a TCP connectivity failure is detected. * **Tests** * Added extensive tests covering reachability probing, network parsing, DNS/connection failure cases, Docker run behavior, and user-facing error message formatting. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3472) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Fixes #3340.
On Brev VMs (and any Linux host with UFW default-deny), the Ollama auth proxy on port 11435 is unreachable from sandbox containers, causing inference calls to hang. Not fixed by PR #3459 (probes port 8080, which has a Docker DNAT rule that bypasses UFW INPUT) or by v0.0.40 (host-side container check uses `--add-host host-gateway` on the default bridge, never testing the real sandbox path).
Root cause: Port 11435 has no Docker DNAT rule, so traffic from sandbox containers reaches the host UFW INPUT chain — where a default-deny policy silently drops it.
What this PR does
Adds `src/lib/onboard/ollama-proxy-reachability.ts`, which runs a short-lived `busybox` container on the `openshell-docker` network (the exact network the Docker-driver gateway creates for sandboxes) and performs `nc -zw5 host.openshell.internal:11435`, mirroring the real sandbox route.
The probe is called inside `setupInference()` before `upsertProvider` and `runOpenshell(["inference", "set", ...])`, so:
Why PR #3459 doesn't fix this
Port 8080 has a Docker DNAT rule (`DNAT tcp dpt:8080 to:172.18.0.2:30051`) that redirects traffic before it hits UFW INPUT, so that probe passes even with UFW blocking everything. Port 11435 has no such rule.
Why the previous probe (PR #3441) was reverted
PR #3441's probe had no `--add-host`, so `host.openshell.internal` was unresolvable inside the probe container — nc always exited 1 regardless of whether UFW was enabled. This PR adds `--add-host host.openshell.internal:` and a `isNameResolutionFailure()` guard that reclassifies DNS errors as `probe_unavailable` (non-fatal) rather than `tcp_failed`.
Scope
Targets Docker-driver gateway mode (v0.0.40+) where sandboxes run on the `openshell-docker` bridge. Pre-v0.0.40 K3S deployments (`openshell-cluster-nemoclaw`) don't have this network; the probe returns `probe_unavailable` and continues silently. Users re-onboarding to v0.0.40+ migrate to Docker-driver gateway mode where the probe applies.
Functional verification
Verified end-to-end on a Brev VM:
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests