Skip to content

fix(onboard): probe Ollama proxy reachability from sandbox network on Linux#3472

Merged
cv merged 2 commits into
mainfrom
fix/3340-ollama-proxy-ufw-preflight-resigned
May 13, 2026
Merged

fix(onboard): probe Ollama proxy reachability from sandbox network on Linux#3472
cv merged 2 commits into
mainfrom
fix/3340-ollama-proxy-ufw-preflight-resigned

Conversation

@prekshivyas

@prekshivyas prekshivyas commented May 13, 2026

Copy link
Copy Markdown
Contributor

Re-attributed replacement for #3465. Same code, single squashed commit authored and signed-off by me. Force-push is blocked on the original branch, so this is a fresh branch + new PR; #3465 will be closed in favour of this one.

Summary

Fixes #3340.

On Brev VMs (and any Linux host with UFW default-deny), the Ollama auth proxy on port 11435 is unreachable from sandbox containers, causing inference calls to hang. Not fixed by PR #3459 (probes port 8080, which has a Docker DNAT rule that bypasses UFW INPUT) or by v0.0.40 (host-side container check uses --add-host host-gateway on the default bridge, never testing the real sandbox path).

Root cause: Port 11435 has no Docker DNAT rule, so traffic from sandbox containers reaches the host UFW INPUT chain — where a default-deny policy silently drops it.

What this PR does

Adds src/lib/onboard/ollama-proxy-reachability.ts, which runs a short-lived busybox container on the openshell-docker network (the exact network the Docker-driver gateway creates for sandboxes) and performs nc -zw5 host.openshell.internal:11435, mirroring the real sandbox route.

The probe is called inside setupInference() before upsertProvider and runOpenshell(["inference", "set", ...]), so:

  • A tcp_failed result (nc exit 1 on Linux native) prints a targeted sudo ufw allow from <subnet> to any port 11435 proto tcp remediation and exits 1.
  • Because no inference route is committed on failure, isInferenceRouteReady() stays false — both a fresh re-run and --resume re-enter setupInference() and re-probe after the user applies the UFW fix.
  • A probe_unavailable result (Docker Desktop, DNS failure, network not found, non-0/non-1 nc exit) continues silently — these environments either don't have UFW or aren't using the Docker-driver gateway.
  • Skipped entirely on WSL.

Why PR #3459 doesn't fix this

Port 8080 has a Docker DNAT rule (DNAT tcp dpt:8080 to:172.18.0.2:30051) that redirects traffic before it hits UFW INPUT, so that probe passes even with UFW blocking everything. Port 11435 has no such rule.

Why the previous probe (PR #3441) was reverted

PR #3441's probe had no --add-host, so host.openshell.internal was unresolvable inside the probe container — nc always exited 1 regardless of whether UFW was enabled. This PR adds --add-host host.openshell.internal:<gatewayIp> and a isNameResolutionFailure() guard that reclassifies DNS errors as probe_unavailable (non-fatal) rather than tcp_failed.

Scope

Targets Docker-driver gateway mode (v0.0.40+) where sandboxes run on the openshell-docker bridge. Pre-v0.0.40 K3S deployments (openshell-cluster-nemoclaw) don't have this network; the probe returns probe_unavailable and continues silently. Users re-onboarding to v0.0.40+ migrate to Docker-driver gateway mode where the probe applies.

Functional verification

Verified end-to-end on a Brev VM:

  • Without blocking: probe container runs nc -zw5 host.openshell.internal:11435 on openshell-dockerstatus 0ok
  • With iptables -I INPUT -s 172.19.0.0/16 -p tcp --dport 11435 -j DROP: same probe → status 1tcp_failed → UFW remediation message printed ✓

Test plan

  • 25 unit tests covering all result variants, argument construction, host-gateway mode, DNS failure classification, env var override, and UFW message formatter — all pass
  • No new test failures vs main baseline (verified after npm run build:cli)
  • Biome lint/format, SPDX headers, shfmt all pass (make checkhadolint binary missing on Brev VM, pre-existing)
  • No new TypeScript type errors (npm run typecheck:cli)
  • Functional end-to-end: correct status 0 / status 1 behaviour with and without iptables blocking

Signed-off-by: Prekshi Vyas prekshiv@nvidia.com

Summary by CodeRabbit

  • New Features

    • Onboarding for Ollama-local now probes sandbox→proxy reachability and will halt onboarding with a clear, actionable error if a TCP connectivity failure is detected.
  • Tests

    • Added extensive tests covering reachability probing, network parsing, DNS/connection failure cases, Docker run behavior, and user-facing error message formatting.

Review Change Stack

… Linux

Port 11435 (Ollama auth proxy) has no Docker DNAT rule, so traffic from
sandbox containers reaches the host UFW INPUT chain — where a default-deny
policy silently drops it. The existing host-side container check uses
--add-host host-gateway on the default Docker bridge and misses this path.

Add a new probe module (ollama-proxy-reachability) that launches a short-lived
busybox container on the openshell-docker network (the same network the real
sandbox uses) and performs nc -zw5 host.openshell.internal:11435. A tcp_failed
result (exit 1) surfaces a targeted ufw remediation command and exits 1;
non-fatal probe_unavailable results (Docker Desktop, DNS failure, network
missing) log a warning and continue. Skipped entirely on WSL.

The probe runs inside the existing if (!isWsl()) block, after the proxy
token is persisted but before upsertProvider/inference set, so that a
failing probe leaves no committed inference route: isInferenceRouteReady()
stays false, and both a fresh re-run and --resume re-enter setupInference()
to re-probe connectivity after the user has applied the UFW fix.

Fixes #3340

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented May 13, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 902113da-ef73-4c05-a813-cbf8e4e9f708

📥 Commits

Reviewing files that changed from the base of the PR and between 723a0b8 and 3df9d5c.

📒 Files selected for processing (2)
  • src/lib/inference/ollama/proxy.ts
  • src/lib/onboard.ts

📝 Walkthrough

Walkthrough

Adds a sandbox-side Docker-network reachability probe for the Ollama auth proxy, formats remediation guidance, and integrates a persistence+probe step into the ollama-local onboarding path that exits only on TCP connectivity failure.

Changes

Ollama Proxy Sandbox Reachability Check

Layer / File(s) Summary
Probe contracts and constants
src/lib/onboard/ollama-proxy-reachability.ts
Defines probe constants, OllamaProxyReachabilityReason, option/result interfaces, and defaults.
IPAM parsing and tests
src/lib/onboard/ollama-probe-reachability.ts, src/lib/onboard/ollama-proxy-reachability.test.ts
Implements parseNetworkIpamConfig to extract IPv4 subnet/gateway (skip IPv6) with unit tests for valid/invalid inputs.
Probe runtime and classification
src/lib/onboard/ollama-proxy-reachability.ts, src/lib/onboard/ollama-proxy-reachability.test.ts
Adds probeOllamaProxySandboxReachability plus default inspect/run/host-gateway helpers, stderr classification, and tests for missing networks, gateway behaviors, success/failure exit codes, DNS failures, and Docker Desktop edge cases.
Unreachable message formatting and tests
src/lib/onboard/ollama-proxy-reachability.ts, src/lib/onboard/ollama-proxy-reachability.test.ts
Adds formatOllamaProxyUnreachableMessage producing UFW allow commands when subnet known or subnet-inspection guidance when unknown; exposes __test and includes tests.
Ollama proxy persistence + probe
src/lib/inference/ollama/proxy.ts
Adds persistAndProbeOllamaProxy(token) that persists token, runs sandbox probe, prints formatted message, and exits process only when reason is tcp_failed; exports the function.
Onboarding wiring
src/lib/onboard.ts
Replaces persistProxyToken(proxyToken) with await persistAndProbeOllamaProxy(proxyToken) in the ollama-local onboarding flow.

Sequence Diagram(s)

sequenceDiagram
  participant Wizard as NemoClaw Wizard
  participant Probe as probeOllamaProxySandboxReachability
  participant Docker as Docker API
  participant Container as Busybox+nc container
  participant Proxy as Ollama auth proxy
  Wizard->>Probe: call probeOllamaProxySandboxReachability()
  Probe->>Docker: inspect network IPAM
  Docker-->>Probe: return subnet & gateway IP
  Probe->>Docker: run busybox container with --add-host mapping and nc -zw
  Docker->>Container: start container
  Container->>Proxy: TCP connect to host:11435
  Proxy-->>Container: accept or refuse connection
  Container-->>Docker: exit code + stderr
  Docker-->>Probe: probe result
  Probe->>Probe: classify as ok / tcp_failed / probe_unavailable
  Probe-->>Wizard: reachability result
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

fix, Provider: Ollama, Docker, Platform: Brev, Platform: Ubuntu, Sandbox, NemoClaw CLI, v0.0.40

Suggested reviewers

  • ericksoa
  • jyaunches

Poem

🐰 I sniff the Docker bridge at dawn and try a tiny probe,
Busybox hops in, netcat taps, across the network road.
If UFW guards the gate too tight and TCP gets denied,
I'll print the rule you need to run—no more the silent ride.
A small hop, a hopeful cheer, onboarding now checks inside.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.46% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: adding Ollama proxy reachability probing from the sandbox network on Linux during onboarding.
Linked Issues check ✅ Passed The PR fully addresses the linked issue #3340 by implementing sandbox→host reachability detection for the Ollama auth proxy, detecting UFW-based blocking, and providing actionable remediation guidance.
Out of Scope Changes check ✅ Passed All changes directly support the core objective of detecting and reporting Ollama proxy reachability failures in the sandbox network during onboarding.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/3340-ollama-proxy-ufw-preflight-resigned

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: gpu-e2e, gpu-double-onboard-e2e, onboard-resume-e2e
Optional E2E: test-e2e-ollama-proxy, ollama-proxy-e2e

Dispatch hint: gpu-e2e,gpu-double-onboard-e2e,onboard-resume-e2e

Workflow run

Full advisor summary

Pi Semantic E2E Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • gpu-e2e (high): Full onboard with NEMOCLAW_PROVIDER=ollama exercises the modified setupInference() ollama-local branch end-to-end. The new persistAndProbeOllamaProxy() call runs a real sandbox-network probe; a regression there would break local-Ollama onboard on every GPU host. Closest existing coverage of the user flow this PR touches.
  • gpu-double-onboard-e2e (high): Coverage guard for ollama proxy token diverges from stored token after re-onboard, causing persistent HTTP 401 on inference #2553 (Ollama proxy token consistency across re-onboard). The PR replaces persistProxyToken() with persistAndProbeOllamaProxy() at the exact call site this test was designed to protect; re-onboard now triggers the probe twice and must remain idempotent without leaving stale token/PID state.
  • onboard-resume-e2e (medium): PR docstring on persistAndProbeOllamaProxy explicitly states --resume re-enters setupInference and re-probes. Resume semantics depend on isInferenceRouteReady() staying false on probe failure so the next attempt re-runs setupInference; this test is the only existing coverage of resume re-entry.

Optional E2E

  • test-e2e-ollama-proxy (low): Already auto-runs on this PR via pr.yaml when code changes. Validates the host-side auth proxy with a mock Ollama backend; it does not exercise the new sandbox-network probe but is cheap insurance that the proxy lifecycle still works after the persistAndProbeOllamaProxy() refactor.
  • ollama-proxy-e2e (low): Workflow_dispatch host-side proxy validation: token persistence, recovery from kill, and container reachability check against the proxy. Useful confidence check that token persistence still happens before any probe-induced exit.

New E2E recommendations

  • sandbox-networking (high): No existing E2E exercises the tcp_failed → process.exit(1) + UFW remediation path that motivates this PR (issue [Brev][Inference] Ollama inference hangs because Brev UFW blocks port 11435 from sandbox #3340, Brev/UFW-default-deny). All current tests run with permissive firewalls, so a regression that turns the probe into a false-negative (silently passing) or a false-positive (aborting clean onboards) would not be caught. Suggest a regression-e2e job that installs UFW with default-deny on the openshell-docker bridge subnet, runs nemoclaw onboard --provider ollama-local, asserts non-zero exit and that the printed message contains the expected sudo ufw allow ... port 11435 proto tcp remediation; then opens the rule and asserts a clean retry succeeds.
    • Suggested test: regression-e2e: ollama-proxy-sandbox-reachability-ufw — Linux runner with UFW default-deny; assert tcp_failed exit 1 + remediation text, then allow rule + retry succeeds.
  • sandbox-networking (medium): The probe_unavailable classification on Docker Desktop / macOS / hosts without the openshell-docker network is critical to avoid blocking onboard on those platforms. macos-e2e.yaml exists; consider asserting that the probe returns probe_unavailable (non-fatal) on Docker Desktop and onboard continues past the ollama-local branch without exit 1.
    • Suggested test: macos-e2e: assert ollama-local onboard does not abort on Docker Desktop when the sandbox-network probe is unavailable.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: gpu-e2e,gpu-double-onboard-e2e,onboard-resume-e2e

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 7980-7991: Replace the current multi-line probe block with a
compact, behavior-preserving form: call probeOllamaProxySandboxReachability(),
check reach.ok and if false compute msg via
formatOllamaProxyUnreachableMessage(reach); if reach.reason === "tcp_failed"
print the msg to stderr (console.error) and exit with process.exit(1); otherwise
do nothing — keep the same variable names (reach, msg), functions
(probeOllamaProxySandboxReachability, formatOllamaProxyUnreachableMessage) and
the same branching logic but collapse into fewer lines to avoid increasing file
size.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0f625501-c6e0-40cd-8079-1dba576cfdc6

📥 Commits

Reviewing files that changed from the base of the PR and between 11837d8 and 723a0b8.

📒 Files selected for processing (3)
  • src/lib/onboard.ts
  • src/lib/onboard/ollama-proxy-reachability.test.ts
  • src/lib/onboard/ollama-proxy-reachability.ts

Comment thread src/lib/onboard.ts Outdated
…maProxy

The previous commit added 16 lines to src/lib/onboard.ts (12k-line god
file), failing the onboard-entrypoint-budget check that requires net-zero
or smaller changes to the top-level entrypoint.

Move the probe + format + exit logic into a new persistAndProbeOllamaProxy
helper in src/lib/inference/ollama/proxy.ts that composes the existing
persistProxyToken with probeOllamaProxySandboxReachability and
formatOllamaProxyUnreachableMessage from src/lib/onboard/ollama-proxy-
reachability.ts. The entrypoint now only swaps two existing names:
persistProxyToken -> persistAndProbeOllamaProxy in the import block and
at the call site. Cumulative diff for src/lib/onboard.ts is now +2/-2,
satisfying the budget. Behaviour is unchanged.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
@prekshivyas prekshivyas self-assigned this May 13, 2026
@cv cv merged commit 1794f53 into main May 13, 2026
20 checks passed
@prekshivyas prekshivyas deleted the fix/3340-ollama-proxy-ufw-preflight-resigned branch May 13, 2026 23:16
cv pushed a commit that referenced this pull request May 14, 2026
## Summary
- Bump the docs metadata and version switcher to `0.0.41`.
- Add v0.0.41 release notes plus operator guidance for OpenShell
pinning, Docker bridge reachability, Local Ollama proxy reachability,
and Docker GPU onboarding diagnostics.
- Refresh generated `nemoclaw-user-*` skills from the updated docs.

## Source summary
- #3434 -> `docs/reference/commands.md`,
`docs/reference/troubleshooting.md`, `docs/about/release-notes.md`:
Document Linux Docker-driver GPU onboarding behavior, diagnostics,
cleanup guidance, and the `NEMOCLAW_DOCKER_GPU_PATCH` troubleshooting
escape hatch.
- #3483 -> `docs/about/release-notes.md`: Note that `nemoclaw uninstall`
removes all installer-managed OpenShell helper binaries unless
`--keep-openshell` is passed.
- #3446 -> `docs/reference/commands.md`,
`docs/reference/troubleshooting.md`, `docs/about/release-notes.md`:
Document blueprint-driven OpenShell install pin resolution and fallback
behavior.
- #3472 -> `docs/inference/use-local-inference.md`,
`docs/reference/troubleshooting.md`, `docs/about/release-notes.md`:
Document sandbox-side Local Ollama auth proxy reachability checks and
firewall remediation.
- #3459 -> `docs/reference/commands.md`,
`docs/reference/troubleshooting.md`, `docs/about/release-notes.md`:
Document Docker-driver sandbox-to-gateway reachability checks and
firewall remediation.

## Test plan
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user`
- `make docs`
- `git diff --check`
- `npm run build:cli`
- `npm run typecheck:cli`
- pre-commit hooks during `git commit`

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added `nemoclaw inference get` command to check current inference
settings
* Improved gateway health validation with Linux firewall remediation
guidance

* **Bug Fixes**
  * Enhanced proxy readiness validation with sandbox network path probes
  * Improved local Ollama route onboarding with rerun-safe fixes
  * Better sandbox-to-gateway connectivity detection

* **Documentation**
* Expanded troubleshooting guidance for firewall and connectivity issues
* Updated CLI reference with new command and environment variable
documentation
  * Added gateway binding and Docker-driver GPU compatibility guidance

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3531)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added the bug-fix PR fixes a bug or regression label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Brev][Inference] Ollama inference hangs because Brev UFW blocks port 11435 from sandbox

3 participants