fix(onboard): gate host-network GPU local inference reachability (#4509)#4609
Conversation
…DIA#4509) The Docker-driver GPU host-network path recreates the sandbox with --network host and wires OpenClaw to the direct 127.0.0.1 Ollama/vLLM URL, but onboarding declared success without proving the real container could reach that endpoint. A failed host-network recreate, an unexpected non-host network mode, or a host provider binding problem only surfaced later as an opaque ECONNREFUSED during an agent prompt. Add a post-recreate verification gate (verifyDockerGpuHostNetworkLocal Inference) that, only on the Docker-driver GPU host-network local inference path, resolves the recreated OpenShell-managed container, asserts HostConfig.NetworkMode is host, and runs a bounded docker exec curl probe against the direct loopback health endpoint (/api/tags for Ollama, /v1/models for vLLM). On failure it surfaces the endpoint, network mode, container id, and recovery hints, then fails onboarding early. Minimal/custom images lacking curl soft-skip with a warning instead of a false negative. The orchestration lives in src/lib/onboard/docker-gpu-local-inference.ts (verifyGpuSandboxAfterReady) so onboard.ts stays net-neutral per the codebase-growth guardrail. Extends test/e2e/test-gpu-e2e.sh to assert the reachability proof when the direct sandbox URL is active. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThe PR adds a post-ready reachability verification gate for GPU sandboxes using Docker host-network patching. Onboarding now runs the GPU proof and, when applicable, verifies the recreated container can reach the provider's local inference endpoint via curl probing with retries, emitting diagnostics or exiting on failure. ChangesDocker GPU host-network reachability verification
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard/docker-gpu-local-inference.ts`:
- Around line 293-297: The skip-path currently only calls options.log?.(...) so
the curl-missing warning is dropped when no logger is provided; update the
branch that checks containerHasCurl(containerId, dockerRunFn) to always emit the
warning (e.g., call console.warn or console.info) in addition to calling
options.log?.(...) so the operator always sees why the reachability probe was
skipped; keep the return { status: "skipped", reason: "probe-tool-unavailable" }
unchanged and make sure you reference the existing containerHasCurl,
dockerRunFn, and options.log? symbols.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5d0b971f-a5b6-499d-a4e9-c3792235bd2f
📒 Files selected for processing (4)
src/lib/onboard.tssrc/lib/onboard/docker-gpu-local-inference.test.tssrc/lib/onboard/docker-gpu-local-inference.tstest/e2e/test-gpu-e2e.sh
…DIA#4509) Address CodeRabbit review on PR NVIDIA#4609: the curl-missing soft-skip used options.log?.() which silently dropped the warning when no logger was wired, leaving the operator with no explanation for why the reachability proof was skipped. Fall back to console.warn so the skip is always visible. Also add concise docstrings to the new helper functions to clear the docstring-coverage warning. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
|
✨ |
## Summary - Adds the v0.0.56 release notes section with links to the deeper docs pages for installer, status, inference, messaging, policy, and lifecycle changes. - Updates source docs for the remaining release-prep gaps around `uv` in the PyPI preset, compact WhatsApp pairing guidance, and `nemoclaw inference set` command boundaries. - Refreshes generated `nemoclaw-user-*` skills and removes skipped experimental command terms from generated skill surfaces. ## Source summary - #4613 -> `docs/manage-sandboxes/lifecycle.mdx`, `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Documents that public installs and `nemoclaw update` follow the maintained `lkg` tag by default. - #4419 -> `docs/about/release-notes.mdx`: Notes that non-interactive Linux installs can reactivate Docker group membership and continue in one installer run when `sg docker` is available. - #4550 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Captures live sandbox agent-version probing for status, connect, and upgrade checks. - #4609 -> `docs/inference/use-local-inference.mdx`, `docs/about/release-notes.mdx`: Captures the GPU Docker-driver host-network local-inference reachability gate. - #4607 -> `docs/manage-sandboxes/messaging-channels.mdx`, `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Documents compact WhatsApp QR pairing guidance and gateway/session diagnostics. - #4582 -> `docs/manage-sandboxes/messaging-channels.mdx`, `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Reflects Slack credential validation before enabling the channel. - #4554 -> `docs/manage-sandboxes/messaging-channels.mdx`, `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Keeps Telegram allowlist alias guidance in the generated user skills and release notes. - #4563 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Includes the new `nemoclaw <name> skill remove <skill>` command in command docs and release notes. - #4566 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Documents the `nemoclaw inference set` redirect boundary when `--provider` or `--model` is missing. - #4323 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Captures per-sandbox status JSON support. - #4506 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Captures debug command sandbox-name validation and safer tarball writing. - #4569 -> `docs/network-policy/integration-policy-examples.mdx`, `docs/about/release-notes.mdx`: Documents that the `pypi` preset allows `/usr/local/bin/uv`. - #4579 -> `docs/network-policy/integration-policy-examples.mdx`, `docs/about/release-notes.mdx`: Captures observable Jira preset validation guidance. - #4229 -> `docs/manage-sandboxes/lifecycle.mdx`, `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Documents user-data preservation defaults for uninstall. - #4399 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Captures CPU-only sandbox intent preservation across rebuilds. - #4058 -> `docs/reference/commands.mdx`, `docs/about/release-notes.mdx`: Captures safer snapshot restore behavior around existing destinations. - #4155 and #4460 -> skipped by `docs/.docs-skip`: Removed skipped experimental command terms from source docs and generated skill evals instead of documenting those features. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` (passes; Fern reports the pre-existing light-mode accent contrast warning) - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" .agents/skills` (no matches) - `npm run build:cli` (run to refresh local CLI artifacts for the pre-push TypeScript hook) - Commit hooks passed, including `NEMOCLAW_* env-var documentation gate`, `Verify docs-to-skills output`, `markdownlint-cli2`, `gitleaks`, and `Test (skills YAML)`. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Expanded Model Router setup with YAML examples, flow diagrams, and credential handling; strengthened agent-config immutability and integrity guidance; messaging channels updated (Telegram aliases, WhatsApp pairing/diagnostics); CLI docs revised (GPU detection, inference set behavior, uninstall/rebuild preservation); overview rebranded to NemoClaw and added v0.0.56 release notes. * **New Features** * Added `nemoclaw <name> channels status` (messaging diagnostics, JSON); added `nemoclaw <name> skill remove`; Hermes no longer marked experimental; DGX Spark quickstart sandbox-name note. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary - Add the missing `v0.0.57` release-notes section with links to the detailed docs pages for command, inference, onboarding, messaging, status, installer, and policy changes. - Remove public references to docs-skip terms from source docs and regenerate the NemoClaw user skills from the current Fern MDX docs. - Carry forward generated references for the per-agent documentation split, including Hermes-specific reference files. ## Source summary - #4615 and #4653 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover host-side `sessions` and `agents` commands plus `NEMOCLAW_EXTRA_AGENTS_JSON` secondary-agent baking. - #4163, #4204, #4611, #4619, and #4676 -> `docs/about/release-notes.mdx`, `docs/inference/use-local-inference.mdx`: Release notes now cover managed vLLM progress/readiness, DGX Spark model default changes, local Ollama streaming usage, and inference route divergence warnings. - #4267, #4601, #4609, #4642, #4645, and #4661 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover UFW auto-remediation, local-inference reachability gates, gateway reuse/binding, cancel rollback, and policy selection persistence. - #4577, #4582, #4607, and #4660 -> `docs/about/release-notes.mdx`, `docs/manage-sandboxes/messaging-channels.mdx`: Release notes now cover Slack validation, atomic `channels add`, WhatsApp QR diagnostics, and Slack placeholder normalization. - #4388, #4600, #4646, and #4647 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover status failure layers, paused-container hints, Docker-driver doctor behavior, and non-destructive stale-registry recovery. - #4569, #4579, and #4678 -> `docs/about/release-notes.mdx`, `docs/manage-sandboxes/lifecycle.mdx`, `docs/network-policy/integration-policy-examples.mdx`: Release notes now cover installer tag pinning, PyPI `uv` policy access, and observable Jira validation. - #4632 -> `.agents/skills/`: Regenerated user skills from the current per-agent docs source, including newly generated Hermes reference files. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" docs --glob "*.mdx"` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" .agents/skills --glob "*.md"` - `npm run docs` - `npm run build:cli` - Commit hooks: markdownlint, docs-to-skills verification, gitleaks, skills YAML, commitlint <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Restructured documentation to clearly distinguish OpenClaw and Hermes agent variants throughout user guides. * Enhanced security, credential storage, and deployment guidance with clearer setup flows. * Added Hermes plugin installation and ecosystem documentation. * Improved workspace, messaging, and policy management references with variant-specific command examples. * Refined troubleshooting and CLI reference sections for clarity. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
NVIDIA#4509) PR NVIDIA#4609 verified host-network GPU local inference with `docker exec` against the recreated `--network host` container, whose main network namespace IS the host's — so the probe passed while the OpenClaw agent, which runs in OpenShell's isolated sandbox network namespace, still got ECONNREFUSED on the direct 127.0.0.1 provider URL. The sandbox namespace cannot reach the host loopback even under `--network host` (see detectSandboxFallbackDns), so the direct-loopback wiring was unreachable. - Never pin OpenClaw to a direct container-loopback inference URL; for local providers, downgrade an opted-in host-network GPU patch to the OpenShell bridge so inference routes through the reachable inference.local path (host networking is not needed for GPU access). - Re-run the sandbox bridge reachability probe (with UFW auto-fix) after the downgrade, since gateway startup skipped it under host mode. - Replace the docker-exec gate with a runtime-context probe via `openshell sandbox exec` that hits inference.local exactly as the agent does, requiring 2xx; 000/4xx/5xx fail with actionable recovery. Soft-skip only when the sandbox image genuinely lacks curl. - Update the GPU E2E to prove inference through `openshell sandbox exec` (the real runtime), removing the docker-exec shortcut that masked the bug. Signed-off-by: Yimo Jiang <yimoj@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
NVIDIA#4509) PR NVIDIA#4609 verified host-network GPU local inference with `docker exec` against the recreated `--network host` container, whose main network namespace IS the host's — so the probe passed while the OpenClaw agent, which runs in OpenShell's isolated sandbox network namespace, still got ECONNREFUSED on the direct 127.0.0.1 provider URL. The sandbox namespace cannot reach the host loopback even under `--network host` (see detectSandboxFallbackDns), so the direct-loopback wiring was unreachable. - Never pin OpenClaw to a direct container-loopback inference URL; for local providers, downgrade an opted-in host-network GPU patch to the OpenShell bridge so inference routes through the reachable inference.local path (host networking is not needed for GPU access). - Re-run the sandbox bridge reachability probe (with UFW auto-fix) after the downgrade, since gateway startup skipped it under host mode. - Replace the docker-exec gate with a runtime-context probe via `openshell sandbox exec` that hits inference.local exactly as the agent does, requiring 2xx; 000/4xx/5xx fail with actionable recovery. Soft-skip only when the sandbox image genuinely lacks curl. - Update the GPU E2E to prove inference through `openshell sandbox exec` (the real runtime), removing the docker-exec shortcut that masked the bug. Signed-off-by: Yimo Jiang <yimoj@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
#4509) (#5024) ## Summary Reopened #4509: on an Ubuntu 24.04 GPU host-network setup, onboard printed "local inference reachable" yet the agent then failed with `ECONNREFUSED` / "LLM request failed: network connection error". PR #4609 proved reachability with `docker exec` against the recreated `--network host` container — whose *main* network namespace is the host's — but OpenClaw runs in OpenShell's **isolated sandbox network namespace**, which cannot reach the host loopback even under `--network host`. So the direct `127.0.0.1` provider URL was unreachable for the agent while the probe falsely passed. This fixes the URL/network mapping and verifies it from the real runtime context. ## Related Issue Fixes #4509 ## Changes - **No direct container-loopback inference URL.** For local providers, an opted-in host-network GPU patch (`NEMOCLAW_DOCKER_GPU_PATCH_NETWORK=host`) is downgraded to the OpenShell bridge so inference routes through the reachable `inference.local` path. Host networking is unnecessary for GPU device access (that comes from the GPU mode flags). Non-local (cloud/routed/custom) GPU sandboxes are untouched. - **Bridge reachability re-checked after the downgrade** (with UFW auto-fix), since gateway startup skipped that probe while host networking was still requested. - **Runtime-context reachability gate.** The post-ready gate now probes `https://inference.local/v1/models` via `openshell sandbox exec` — the exact network namespace and route OpenClaw uses — instead of `docker exec`. Success requires a `2xx`; `000` (ECONNREFUSED), `4xx` (route/auth misconfig), and `5xx` (backend down) fail with actionable recovery. A genuinely missing `curl` soft-skips (OpenClaw's HTTP client does not need it); a broken sandbox exec path fails rather than masquerading as missing-curl. - **GPU E2E** (`test/e2e/test-gpu-e2e.sh`) now proves inference through `openshell sandbox exec` (the real runtime) and asserts the new gate, removing the `docker exec` shortcut that masked the bug. - `src/lib/onboard.ts` stays net-neutral (orchestration lives in `src/lib/onboard/`). ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] `npx prek run --files` on the changed files (TS/biome/spdx/shellcheck clean; the only failures were unrelated env-flakes — missing plugin `node_modules` and 5s CLI-spawn timeouts under a loaded host — which pass with deps installed and a normal timeout: 152/152) - [x] `npm run build:cli`, `npm run typecheck:cli` - [x] `npx vitest run` for the gate (21), `test/onboard.test.ts` (66), `docker-gpu-patch` (50), `inference/local` (65), `provider-inference` (13), `docker-gpu-sandbox-create` (5) - [x] Tests added/updated for new and changed behavior (runtime-context probe, 2xx-only, local-only downgrade + bridge re-check, exec-failure vs missing-curl) - [x] No secrets, API keys, or credentials committed ### Reporter-workflow E2E evidence Full reporter reproduction requires Ubuntu 24.04 + NVIDIA GPU + native Docker (host-network GPU patch), which is not available on this CI-less dev host. The exact workflow is covered by the **GPU pipeline E2E** (`test/e2e/test-gpu-e2e.sh`, Brev GPU runner), which this PR extends to verify local inference **through `openshell sandbox exec`** (the agent runtime netns) and to assert the runtime-context gate — so a future regression cannot pass via the container-main-namespace shortcut. The root-cause *mechanism* was reproduced locally and hermetically (no GPU needed), modeling the OpenShell Docker-driver topology — a `--network host` container plus an inner `unshare -n` namespace (how OpenShell runs the sandbox agent): ``` [A] container MAIN netns (== host loopback under --network host; what docker exec / PR #4609 hit): http_code=200 RESULT: OK-MAIN (reaches host Ollama) [B] INNER netns via unshare -n (== OpenShell sandbox agent runtime / openshell sandbox exec): http_code=000 RESULT: FAIL-INNER (ECONNREFUSED — matches the reporter) ``` This confirms why the `docker exec` probe passed while the agent got `ECONNREFUSED`, and why routing through the OpenShell-managed `inference.local` path (on the bridge) is the reachable fix. --- Signed-off-by: Yimo Jiang <yimoj@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Verify GPU local inference from inside the sandbox runtime (not via host-network probes), reducing false positives and handling curl/unreachability scenarios more robustly. * **Refactor** * Default Docker GPU patching for local providers now uses the OpenShell-managed bridge instead of host networking to improve inference accessibility and consistency. * **Tests** * End-to-end and unit tests updated to exercise the sandbox-side inference path and cover success, skip, retry, and failure cases. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Yimo Jiang <yimoj@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
The Docker-driver GPU host-network path recreates the sandbox with
--network hostand wires OpenClaw to the direct127.0.0.1Ollama/vLLM URL, but onboarding declared success without proving the recreated container could actually reach that endpoint. A failed host-network recreate, an unexpected non-host network mode, or a host provider binding/state problem only surfaced later as an opaqueECONNREFUSEDduring the first agent prompt. This adds a post-recreate reachability gate so onboarding fails early with actionable output.Related Issue
Fixes #4509
Changes
verifyDockerGpuHostNetworkLocalInferenceinsrc/lib/onboard/docker-gpu-local-inference.ts: on the Docker-driver GPU host-network local-inference path it resolves the recreated OpenShell-managed container, assertsHostConfig.NetworkModeishost, and runs a boundeddocker exec curlprobe against the direct loopback health endpoint (/api/tagsfor Ollama,/v1/modelsfor vLLM).NEMOCLAW_DOCKER_GPU_PATCH=0), when the network mode is not host, or — to avoid false negatives — when a minimal/custom image lackscurl(soft-skip with a warning).verifyGpuSandboxAfterReadysosrc/lib/onboard.tsstays net-neutral per the codebase-growth guardrail (logic lives undersrc/lib/onboard/).test/e2e/test-gpu-e2e.shto assert the reachability proof when the direct sandbox URL is active, instead of only discovering failure during the agent prompt.Type of Change
Verification
npm testpasses (unrelatede2e-scenarioframework tests flaked on the shared host's default 5s timeout under concurrent load; they pass green at a 30s timeout in isolation)npm run build:cliandnpm run typecheck:cliclean;biome checkcleanHardware-gated E2E gap
The full host-network proof requires an Ubuntu 24.04 + NVIDIA GPU + native Docker environment, which the triage host does not have. The container-inspection and probe decision logic is covered by unit tests with mocked Docker/OpenShell adapters; the live GPU host-network proof is exercised by
test/e2e/test-gpu-e2e.shon GPU hardware.Signed-off-by: Yimo Jiang yimoj@nvidia.com
Summary by CodeRabbit
New Features
Tests