fix(onboard): probe Ollama proxy reachability from sandbox network on Linux#3472
Conversation
… Linux Port 11435 (Ollama auth proxy) has no Docker DNAT rule, so traffic from sandbox containers reaches the host UFW INPUT chain — where a default-deny policy silently drops it. The existing host-side container check uses --add-host host-gateway on the default Docker bridge and misses this path. Add a new probe module (ollama-proxy-reachability) that launches a short-lived busybox container on the openshell-docker network (the same network the real sandbox uses) and performs nc -zw5 host.openshell.internal:11435. A tcp_failed result (exit 1) surfaces a targeted ufw remediation command and exits 1; non-fatal probe_unavailable results (Docker Desktop, DNS failure, network missing) log a warning and continue. Skipped entirely on WSL. The probe runs inside the existing if (!isWsl()) block, after the proxy token is persisted but before upsertProvider/inference set, so that a failing probe leaves no committed inference route: isInferenceRouteReady() stays false, and both a fresh re-run and --resume re-enter setupInference() to re-probe connectivity after the user has applied the UFW fix. Fixes #3340 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughAdds a sandbox-side Docker-network reachability probe for the Ollama auth proxy, formats remediation guidance, and integrates a persistence+probe step into the ollama-local onboarding path that exits only on TCP connectivity failure. ChangesOllama Proxy Sandbox Reachability Check
Sequence Diagram(s)sequenceDiagram
participant Wizard as NemoClaw Wizard
participant Probe as probeOllamaProxySandboxReachability
participant Docker as Docker API
participant Container as Busybox+nc container
participant Proxy as Ollama auth proxy
Wizard->>Probe: call probeOllamaProxySandboxReachability()
Probe->>Docker: inspect network IPAM
Docker-->>Probe: return subnet & gateway IP
Probe->>Docker: run busybox container with --add-host mapping and nc -zw
Docker->>Container: start container
Container->>Proxy: TCP connect to host:11435
Proxy-->>Container: accept or refuse connection
Container-->>Docker: exit code + stderr
Docker-->>Probe: probe result
Probe->>Probe: classify as ok / tcp_failed / probe_unavailable
Probe-->>Wizard: reachability result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryPi Semantic E2E AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 7980-7991: Replace the current multi-line probe block with a
compact, behavior-preserving form: call probeOllamaProxySandboxReachability(),
check reach.ok and if false compute msg via
formatOllamaProxyUnreachableMessage(reach); if reach.reason === "tcp_failed"
print the msg to stderr (console.error) and exit with process.exit(1); otherwise
do nothing — keep the same variable names (reach, msg), functions
(probeOllamaProxySandboxReachability, formatOllamaProxyUnreachableMessage) and
the same branching logic but collapse into fewer lines to avoid increasing file
size.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0f625501-c6e0-40cd-8079-1dba576cfdc6
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/onboard/ollama-proxy-reachability.test.tssrc/lib/onboard/ollama-proxy-reachability.ts
…maProxy The previous commit added 16 lines to src/lib/onboard.ts (12k-line god file), failing the onboard-entrypoint-budget check that requires net-zero or smaller changes to the top-level entrypoint. Move the probe + format + exit logic into a new persistAndProbeOllamaProxy helper in src/lib/inference/ollama/proxy.ts that composes the existing persistProxyToken with probeOllamaProxySandboxReachability and formatOllamaProxyUnreachableMessage from src/lib/onboard/ollama-proxy- reachability.ts. The entrypoint now only swaps two existing names: persistProxyToken -> persistAndProbeOllamaProxy in the import block and at the call site. Cumulative diff for src/lib/onboard.ts is now +2/-2, satisfying the budget. Behaviour is unchanged. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
## Summary - Bump the docs metadata and version switcher to `0.0.41`. - Add v0.0.41 release notes plus operator guidance for OpenShell pinning, Docker bridge reachability, Local Ollama proxy reachability, and Docker GPU onboarding diagnostics. - Refresh generated `nemoclaw-user-*` skills from the updated docs. ## Source summary - #3434 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`, `docs/about/release-notes.md`: Document Linux Docker-driver GPU onboarding behavior, diagnostics, cleanup guidance, and the `NEMOCLAW_DOCKER_GPU_PATCH` troubleshooting escape hatch. - #3483 -> `docs/about/release-notes.md`: Note that `nemoclaw uninstall` removes all installer-managed OpenShell helper binaries unless `--keep-openshell` is passed. - #3446 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`, `docs/about/release-notes.md`: Document blueprint-driven OpenShell install pin resolution and fallback behavior. - #3472 -> `docs/inference/use-local-inference.md`, `docs/reference/troubleshooting.md`, `docs/about/release-notes.md`: Document sandbox-side Local Ollama auth proxy reachability checks and firewall remediation. - #3459 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`, `docs/about/release-notes.md`: Document Docker-driver sandbox-to-gateway reachability checks and firewall remediation. ## Test plan - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user` - `make docs` - `git diff --check` - `npm run build:cli` - `npm run typecheck:cli` - pre-commit hooks during `git commit` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added `nemoclaw inference get` command to check current inference settings * Improved gateway health validation with Linux firewall remediation guidance * **Bug Fixes** * Enhanced proxy readiness validation with sandbox network path probes * Improved local Ollama route onboarding with rerun-safe fixes * Better sandbox-to-gateway connectivity detection * **Documentation** * Expanded troubleshooting guidance for firewall and connectivity issues * Updated CLI reference with new command and environment variable documentation * Added gateway binding and Docker-driver GPU compatibility guidance <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3531) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Fixes #3340.
On Brev VMs (and any Linux host with UFW default-deny), the Ollama auth proxy on port 11435 is unreachable from sandbox containers, causing inference calls to hang. Not fixed by PR #3459 (probes port 8080, which has a Docker DNAT rule that bypasses UFW INPUT) or by v0.0.40 (host-side container check uses
--add-host host-gatewayon the default bridge, never testing the real sandbox path).Root cause: Port 11435 has no Docker DNAT rule, so traffic from sandbox containers reaches the host UFW INPUT chain — where a default-deny policy silently drops it.
What this PR does
Adds
src/lib/onboard/ollama-proxy-reachability.ts, which runs a short-livedbusyboxcontainer on theopenshell-dockernetwork (the exact network the Docker-driver gateway creates for sandboxes) and performsnc -zw5 host.openshell.internal:11435, mirroring the real sandbox route.The probe is called inside
setupInference()beforeupsertProviderandrunOpenshell(["inference", "set", ...]), so:tcp_failedresult (nc exit 1 on Linux native) prints a targetedsudo ufw allow from <subnet> to any port 11435 proto tcpremediation and exits 1.isInferenceRouteReady()stays false — both a fresh re-run and--resumere-entersetupInference()and re-probe after the user applies the UFW fix.probe_unavailableresult (Docker Desktop, DNS failure, network not found, non-0/non-1 nc exit) continues silently — these environments either don't have UFW or aren't using the Docker-driver gateway.Why PR #3459 doesn't fix this
Port 8080 has a Docker DNAT rule (
DNAT tcp dpt:8080 to:172.18.0.2:30051) that redirects traffic before it hits UFW INPUT, so that probe passes even with UFW blocking everything. Port 11435 has no such rule.Why the previous probe (PR #3441) was reverted
PR #3441's probe had no
--add-host, sohost.openshell.internalwas unresolvable inside the probe container — nc always exited 1 regardless of whether UFW was enabled. This PR adds--add-host host.openshell.internal:<gatewayIp>and aisNameResolutionFailure()guard that reclassifies DNS errors asprobe_unavailable(non-fatal) rather thantcp_failed.Scope
Targets Docker-driver gateway mode (v0.0.40+) where sandboxes run on the
openshell-dockerbridge. Pre-v0.0.40 K3S deployments (openshell-cluster-nemoclaw) don't have this network; the probe returnsprobe_unavailableand continues silently. Users re-onboarding to v0.0.40+ migrate to Docker-driver gateway mode where the probe applies.Functional verification
Verified end-to-end on a Brev VM:
nc -zw5 host.openshell.internal:11435onopenshell-docker→status 0→ok✓iptables -I INPUT -s 172.19.0.0/16 -p tcp --dport 11435 -j DROP: same probe →status 1→tcp_failed→ UFW remediation message printed ✓Test plan
npm run build:cli)make check—hadolintbinary missing on Brev VM, pre-existing)npm run typecheck:cli)status 0/status 1behaviour with and without iptables blockingSigned-off-by: Prekshi Vyas prekshiv@nvidia.com
Summary by CodeRabbit
New Features
Tests