fix(onboard): prove GPU sandbox local inference from the agent runtime (#4509)#5024
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (3)
📝 WalkthroughWalkthroughThe PR replaces Docker GPU host-network local inference verification with a preserve-network bridge mode and sandbox-runtime reachability probes. For local inference providers, host-network patching is downgraded to the OpenShell-managed bridge and reachability is verified by executing a bounded curl probe inside the sandbox via ChangesGPU Sandbox Preserve-Network Inference Verification
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
1118a90 to
564a0ae
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/lib/onboard/docker-gpu-local-inference.ts (1)
182-187: ⚡ Quick winConsider shell-escaping the endpoint parameter to prevent future injection risk.
The
endpointis directly interpolated into the shell script without escaping. Currently safe becausegetSandboxRuntimeInferenceEndpointalways returns the constantSANDBOX_RUNTIME_INFERENCE_ENDPOINT, but the function signature accepts any string. If the endpoint ever becomes dynamic, this would be a shell injection vulnerability.🛡️ Recommended fix: shell-escape the endpoint
One approach is to use a shell-safe quoting helper or validate the endpoint format:
function probeSandboxRuntimeInference( sandboxName: string, endpoint: string, deps: { execInSandbox: NonNullable<DockerGpuSandboxInferenceVerifyDeps["execInSandbox"]>; sleep: NonNullable<DockerGpuSandboxInferenceVerifyDeps["sleep"]>; }, ): RuntimeProbeOutcome { + // Validate endpoint format to prevent shell injection + if (!/^https?:\/\/[a-z0-9._\/-]+$/i.test(endpoint)) { + return { + kind: "exec-failed", + detail: `invalid endpoint format: ${endpoint}`, + }; + } const script = `if ! command -v curl >/dev/null 2>&1; then echo NO_CURL; exit 0; fi; ` + `code=$(curl -so /dev/null -w '%{http_code}' ` +Alternatively, use shell single-quoting (note: even single quotes need escaping if the endpoint contains single quotes):
- `--max-time ${DOCKER_GPU_INFERENCE_PROBE_MAX_TIME_SECS} ${endpoint} 2>/dev/null || echo 000); ` + + `--max-time ${DOCKER_GPU_INFERENCE_PROBE_MAX_TIME_SECS} '${endpoint.replace(/'/g, "'\\''")}' 2>/dev/null || echo 000); ` +🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/docker-gpu-local-inference.ts` around lines 182 - 187, The interpolated endpoint in the constructed shell `script` string is not shell-escaped and could permit injection if `getSandboxRuntimeInferenceEndpoint` ever returns a dynamic value; update the code that builds `script` (the template string assigned to `script` in docker-gpu-local-inference.ts) to safely quote/escape the `endpoint` variable before inserting it (e.g., single-quote the value and replace any embedded single quotes with the POSIX-safe '\'' sequence, or use a shell-quoting helper such as printf %q), so the curl invocation uses the escaped endpoint and all tests/usage of DOCKER_GPU_INFERENCE_PROBE_* constants remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/test-gpu-e2e.sh`:
- Around line 569-576: run_sandbox_inference_probe currently clears
sandbox_probe_failure and discards the exit status of the probe by using "||
true", which hides TIMEOUT_CMD / openshell sandbox exec failures; change the
function so it captures the command's exit code and sets sandbox_probe_failure
on non-zero exit (while still capturing stdout/stderr into sandbox_response),
e.g. run the probe via TIMEOUT_CMD openshell sandbox exec ... -- sh -lc
"$sandbox_curl_cmd", save its output into sandbox_response and its exit code
into a local variable, and if the exit code is non-zero set
sandbox_probe_failure to a descriptive value (or return non-zero) instead of
swallowing the failure. Ensure references: run_sandbox_inference_probe,
sandbox_probe_failure, sandbox_response, TIMEOUT_CMD, openshell sandbox exec,
and $sandbox_curl_cmd.
---
Nitpick comments:
In `@src/lib/onboard/docker-gpu-local-inference.ts`:
- Around line 182-187: The interpolated endpoint in the constructed shell
`script` string is not shell-escaped and could permit injection if
`getSandboxRuntimeInferenceEndpoint` ever returns a dynamic value; update the
code that builds `script` (the template string assigned to `script` in
docker-gpu-local-inference.ts) to safely quote/escape the `endpoint` variable
before inserting it (e.g., single-quote the value and replace any embedded
single quotes with the POSIX-safe '\'' sequence, or use a shell-quoting helper
such as printf %q), so the curl invocation uses the escaped endpoint and all
tests/usage of DOCKER_GPU_INFERENCE_PROBE_* constants remain unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0fd7bd0b-e945-4416-827e-5eef9c60f2ec
📒 Files selected for processing (4)
src/lib/onboard.tssrc/lib/onboard/docker-gpu-local-inference.test.tssrc/lib/onboard/docker-gpu-local-inference.tstest/e2e/test-gpu-e2e.sh
NVIDIA#4509) PR NVIDIA#4609 verified host-network GPU local inference with `docker exec` against the recreated `--network host` container, whose main network namespace IS the host's — so the probe passed while the OpenClaw agent, which runs in OpenShell's isolated sandbox network namespace, still got ECONNREFUSED on the direct 127.0.0.1 provider URL. The sandbox namespace cannot reach the host loopback even under `--network host` (see detectSandboxFallbackDns), so the direct-loopback wiring was unreachable. - Never pin OpenClaw to a direct container-loopback inference URL; for local providers, downgrade an opted-in host-network GPU patch to the OpenShell bridge so inference routes through the reachable inference.local path (host networking is not needed for GPU access). - Re-run the sandbox bridge reachability probe (with UFW auto-fix) after the downgrade, since gateway startup skipped it under host mode. - Replace the docker-exec gate with a runtime-context probe via `openshell sandbox exec` that hits inference.local exactly as the agent does, requiring 2xx; 000/4xx/5xx fail with actionable recovery. Soft-skip only when the sandbox image genuinely lacks curl. - Update the GPU E2E to prove inference through `openshell sandbox exec` (the real runtime), removing the docker-exec shortcut that masked the bug. Signed-off-by: Yimo Jiang <yimoj@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
564a0ae to
b005d86
Compare
|
Addressed both CodeRabbit findings in
|
## Summary - Add v0.0.62 release notes from Discussion #5100 and link release highlights to the relevant docs pages. - Document the release's GPU sandbox recreation, sandbox-side local inference verification, and Hermes dashboard port guard in the command and inference references. - Refresh generated NemoClaw user skills for the release-prep docs set. ## Source Summary - #4956 -> `docs/reference/commands.mdx`: Document CDI-first Docker GPU recreation behavior for Linux Docker-driver sandboxes. - #5024 -> `docs/inference/use-local-inference.mdx`: Document sandbox-runtime verification of the `inference.local` local inference route. - #5018 -> `docs/reference/commands.mdx`: Document Jetson/Tegra device-node group propagation for sandbox CUDA initialization. - #5012, #4763, #4706, #5030, #5015 -> `docs/about/release-notes.mdx`: Summarize onboarding and recovery reliability fixes, including the reserved Hermes API port guard. - #5017 and #5043 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Summarize mutable OpenClaw config recovery and host-side `agents list` coverage. - #5010 and #5016 -> `docs/about/release-notes.mdx`: Summarize Hermes upstream metadata visibility and WhatsApp QR rendering reliability. - #5045 and prior source docs in the v0.0.62 range -> `.agents/skills/`: Refresh generated user-skill references from the current docs source. ## Skipped - #5019 -> skipped for new prose because it touched `openclaw-sandbox-permissive.yaml`, which matches `docs/.docs-skip`. Existing source docs remain the source for generated skill synchronization. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` (passes; Fern reports 0 errors and 1 hidden warning) - Pre-commit hooks passed during commit, including docs-to-skills verification, markdown lint, gitleaks, and skills YAML tests. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added `nemoclaw <name> agents list` command. * v0.0.62 release notes added summarizing onboarding and recovery improvements. * **Bug Fixes** * Improved GPU sandbox onboarding reliability (NVIDIA CDI path, Jetson/Tegra device handling). * Better local inference verification and recovery for Linux Docker-driver GPU sandboxes. * Quieter/earlier handling of onboarding drift and port collisions. * **Documentation** * Expanded GPU passthrough, inference verification, writable paths (`/dev/pts`), port 8642 restriction, and command examples. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Prekshi Vyas <34834085+prekshivyas@users.noreply.github.com>
Summary
Reopened #4509: on an Ubuntu 24.04 GPU host-network setup, onboard printed "local inference reachable" yet the agent then failed with
ECONNREFUSED/ "LLM request failed: network connection error". PR #4609 proved reachability withdocker execagainst the recreated--network hostcontainer — whose main network namespace is the host's — but OpenClaw runs in OpenShell's isolated sandbox network namespace, which cannot reach the host loopback even under--network host. So the direct127.0.0.1provider URL was unreachable for the agent while the probe falsely passed. This fixes the URL/network mapping and verifies it from the real runtime context.Related Issue
Fixes #4509
Changes
NEMOCLAW_DOCKER_GPU_PATCH_NETWORK=host) is downgraded to the OpenShell bridge so inference routes through the reachableinference.localpath. Host networking is unnecessary for GPU device access (that comes from the GPU mode flags). Non-local (cloud/routed/custom) GPU sandboxes are untouched.https://inference.local/v1/modelsviaopenshell sandbox exec— the exact network namespace and route OpenClaw uses — instead ofdocker exec. Success requires a2xx;000(ECONNREFUSED),4xx(route/auth misconfig), and5xx(backend down) fail with actionable recovery. A genuinely missingcurlsoft-skips (OpenClaw's HTTP client does not need it); a broken sandbox exec path fails rather than masquerading as missing-curl.test/e2e/test-gpu-e2e.sh) now proves inference throughopenshell sandbox exec(the real runtime) and asserts the new gate, removing thedocker execshortcut that masked the bug.src/lib/onboard.tsstays net-neutral (orchestration lives insrc/lib/onboard/).Type of Change
Verification
npx prek run --fileson the changed files (TS/biome/spdx/shellcheck clean; the only failures were unrelated env-flakes — missing pluginnode_modulesand 5s CLI-spawn timeouts under a loaded host — which pass with deps installed and a normal timeout: 152/152)npm run build:cli,npm run typecheck:clinpx vitest runfor the gate (21),test/onboard.test.ts(66),docker-gpu-patch(50),inference/local(65),provider-inference(13),docker-gpu-sandbox-create(5)Reporter-workflow E2E evidence
Full reporter reproduction requires Ubuntu 24.04 + NVIDIA GPU + native Docker (host-network GPU patch), which is not available on this CI-less dev host. The exact workflow is covered by the GPU pipeline E2E (
test/e2e/test-gpu-e2e.sh, Brev GPU runner), which this PR extends to verify local inference throughopenshell sandbox exec(the agent runtime netns) and to assert the runtime-context gate — so a future regression cannot pass via the container-main-namespace shortcut.The root-cause mechanism was reproduced locally and hermetically (no GPU needed), modeling the OpenShell Docker-driver topology — a
--network hostcontainer plus an innerunshare -nnamespace (how OpenShell runs the sandbox agent):This confirms why the
docker execprobe passed while the agent gotECONNREFUSED, and why routing through the OpenShell-managedinference.localpath (on the bridge) is the reachable fix.Signed-off-by: Yimo Jiang yimoj@nvidia.com
Summary by CodeRabbit
Bug Fixes
Refactor
Tests