fix(onboard): reuse containerized gateway and repair routed provider reachability (#4520, #4564)#4601
Conversation
…reachability (NVIDIA#4520, NVIDIA#4564) Two Docker-driver onboarding fixes sharing the runtime-identity / host-service reachability axis. NVIDIA#4520: The glibc-compat gateway runs as a host-side `docker run ... openshell-gateway` parent, so /proc/<pid>/exe is /usr/bin/docker. buildDockerDriverGatewayRuntimeIdentity() encodes this with driftGatewayBin=null to skip the executable check, but both callers used `runtimeIdentity?.driftGatewayBin ?? gatewayBin`, coalescing the deliberate null back to the host binary and falsely marking a healthy compat gateway stale, triggering a recreate and port 8080 collision on the second onboard. Add resolveDriftGatewayBin() which preserves the null, and use it in refreshDockerDriverGatewayReuseState and startDockerDriverGateway. NVIDIA#4564: On Linux Docker-driver hosts, Model Router on localhost:4000 is the sandbox loopback, not the host router. Resume only reconciled the router and left a stale localhost provider base URL, and there was no sandbox-network reachability probe like the Ollama auth proxy has. Extract a generic host-service reachability probe (reused by the Ollama wrapper), probe host.openshell.internal:<routerPort> from the OpenShell Docker network inside reconcileModelRouter() with a concrete ufw remediation on tcp_failed, and re-upsert the routed provider with the normalized host alias (and the routed profile's credential env) on resume so stale localhost entries are repaired. Non-Docker-driver behavior is unchanged. New logic lives under src/lib/onboard/** to keep onboard.ts net-neutral. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThis PR extends the OpenShell onboarding flow with Model Router (routed inference) provider support and improves host service reachability diagnostics. It introduces a generic Docker sandbox-to-host TCP probe, refactors Ollama proxy reachability to reuse it, normalizes routed provider endpoint URLs from localhost to sandbox-facing host aliases, resolves credential environments following profile/router precedence, verifies Model Router reachability after reconciliation, improves drift gateway binary selection semantics in containerized environments, and integrates these pieces into the main onboarding flow and provider-inference state machine. ChangesRouted Inference & Host Reachability
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard/routed-inference.ts`:
- Around line 62-70: The URL rewrite in normalizeRoutedEndpointUrl currently
appends a colon even when parsed.port is empty and drops query/hash; update the
try block handling for localhost/127.0.0.1 so that you only include
`:${parsed.port}` when parsed.port is non-empty, and append parsed.pathname +
parsed.search + parsed.hash to preserve path, query and fragment; keep the
existing HOST_GATEWAY_URL prefix and the try/catch behavior, and return the
original url in the catch or when not localhost.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e76c798c-f364-403f-9564-867b7d35fc2e
📒 Files selected for processing (12)
src/lib/onboard.tssrc/lib/onboard/docker-driver-gateway-launch.test.tssrc/lib/onboard/docker-driver-gateway-launch.tssrc/lib/onboard/host-service-reachability.test.tssrc/lib/onboard/host-service-reachability.tssrc/lib/onboard/machine/handlers/provider-inference.test.tssrc/lib/onboard/machine/handlers/provider-inference.tssrc/lib/onboard/model-router.tssrc/lib/onboard/ollama-proxy-reachability.tssrc/lib/onboard/routed-inference.test.tssrc/lib/onboard/routed-inference.tstest/onboard-gateway-runtime.test.ts
normalizeRoutedEndpointUrl emitted a dangling colon (host.openshell.internal:/v1) for a portless localhost endpoint and dropped any query/hash. Only append the port when present and carry through search and hash. Addresses CodeRabbit review on NVIDIA#4601. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
1 similar comment
## Summary - Add the missing `v0.0.57` release-notes section with links to the detailed docs pages for command, inference, onboarding, messaging, status, installer, and policy changes. - Remove public references to docs-skip terms from source docs and regenerate the NemoClaw user skills from the current Fern MDX docs. - Carry forward generated references for the per-agent documentation split, including Hermes-specific reference files. ## Source summary - #4615 and #4653 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover host-side `sessions` and `agents` commands plus `NEMOCLAW_EXTRA_AGENTS_JSON` secondary-agent baking. - #4163, #4204, #4611, #4619, and #4676 -> `docs/about/release-notes.mdx`, `docs/inference/use-local-inference.mdx`: Release notes now cover managed vLLM progress/readiness, DGX Spark model default changes, local Ollama streaming usage, and inference route divergence warnings. - #4267, #4601, #4609, #4642, #4645, and #4661 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover UFW auto-remediation, local-inference reachability gates, gateway reuse/binding, cancel rollback, and policy selection persistence. - #4577, #4582, #4607, and #4660 -> `docs/about/release-notes.mdx`, `docs/manage-sandboxes/messaging-channels.mdx`: Release notes now cover Slack validation, atomic `channels add`, WhatsApp QR diagnostics, and Slack placeholder normalization. - #4388, #4600, #4646, and #4647 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release notes now cover status failure layers, paused-container hints, Docker-driver doctor behavior, and non-destructive stale-registry recovery. - #4569, #4579, and #4678 -> `docs/about/release-notes.mdx`, `docs/manage-sandboxes/lifecycle.mdx`, `docs/network-policy/integration-policy-examples.mdx`: Release notes now cover installer tag pinning, PyPI `uv` policy access, and observable Jira validation. - #4632 -> `.agents/skills/`: Regenerated user skills from the current per-agent docs source, including newly generated Hermes reference files. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" docs --glob "*.mdx"` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" .agents/skills --glob "*.md"` - `npm run docs` - `npm run build:cli` - Commit hooks: markdownlint, docs-to-skills verification, gitleaks, skills YAML, commitlint <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Restructured documentation to clearly distinguish OpenClaw and Hermes agent variants throughout user guides. * Enhanced security, credential storage, and deployment guidance with clearer setup flows. * Added Hermes plugin installation and ecosystem documentation. * Improved workspace, messaging, and policy management references with variant-specific command examples. * Refined troubleshooting and CLI reference sections for clarity. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Two Docker-driver onboarding fixes that share the runtime-identity / host-service reachability axis: a healthy containerized compatibility gateway is now reused instead of falsely recreated (#4520), and Model Router provider setup/resume now binds the sandbox-facing host alias and verifies sandbox→router reachability (#4564).
Related Issue
Fixes #4520
Fixes #4564
Changes
#4520 — false-stale containerized gateway
docker run … openshell-gatewayparent, so/proc/<pid>/exeis/usr/bin/docker.buildDockerDriverGatewayRuntimeIdentity()encodes this withdriftGatewayBin: nullto skip the executable check, but both callers usedruntimeIdentity?.driftGatewayBin ?? gatewayBin, coalescing the deliberatenullback to the host binary and marking a healthy gateway stale → recreate → port 8080 collision on the second onboard.resolveDriftGatewayBin()which preserves thenull, and use it inrefreshDockerDriverGatewayReuseState()andstartDockerDriverGateway().#4564 — Model Router unreachable on Linux Docker-driver
host-service-reachabilityprobe (short-lived container on the OpenShell Docker network →host.openshell.internal:<port>); the Ollama auth-proxy probe is now a thin wrapper over it.reconcileModelRouter()probeshost.openshell.internal:<routerPort>after the router is healthy and prints a concretesudo ufw allow … port <routerPort>remediation ontcp_failed(non-fatal when the sandbox network does not yet exist).host.openshell.internalbase URL and the routed profile's credential env, repairing a stalelocalhost:4000entry left by an earlier run.src/lib/onboard/**;src/lib/onboard.tsis net-neutral.Non-Docker-driver behavior is unchanged.
Type of Change
Verification
npm testpasses (the only failures on this host are pre-existing/environmental: an unbuiltnemoclaw/subproject and an ownership-dependent chmod test that fails on a clean tree; all pass in isolation oncenemoclaw/is built)npm run typecheck:cli) andcodex review --uncommittedcleanReproduced both bugs against compiled
distbefore fixing, then confirmed the fixes through the same reproductions and new unit tests.Signed-off-by: Yimo Jiang yimoj@nvidia.com
Summary by CodeRabbit
New Features
Bug Fixes
Tests