Skip to content

fix(onboard): reuse containerized gateway and repair routed provider reachability (#4520, #4564)#4601

Merged
cv merged 2 commits into
NVIDIA:mainfrom
yimoj:fix/4520-docker-driver-onboard-host-services
Jun 1, 2026
Merged

fix(onboard): reuse containerized gateway and repair routed provider reachability (#4520, #4564)#4601
cv merged 2 commits into
NVIDIA:mainfrom
yimoj:fix/4520-docker-driver-onboard-host-services

Conversation

@yimoj

@yimoj yimoj commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Two Docker-driver onboarding fixes that share the runtime-identity / host-service reachability axis: a healthy containerized compatibility gateway is now reused instead of falsely recreated (#4520), and Model Router provider setup/resume now binds the sandbox-facing host alias and verifies sandbox→router reachability (#4564).

Related Issue

Fixes #4520
Fixes #4564

Changes

#4520 — false-stale containerized gateway

  • The glibc-compat gateway runs as a host-side docker run … openshell-gateway parent, so /proc/<pid>/exe is /usr/bin/docker. buildDockerDriverGatewayRuntimeIdentity() encodes this with driftGatewayBin: null to skip the executable check, but both callers used runtimeIdentity?.driftGatewayBin ?? gatewayBin, coalescing the deliberate null back to the host binary and marking a healthy gateway stale → recreate → port 8080 collision on the second onboard.
  • Add resolveDriftGatewayBin() which preserves the null, and use it in refreshDockerDriverGatewayReuseState() and startDockerDriverGateway().

#4564 — Model Router unreachable on Linux Docker-driver

  • Extract a generic host-service-reachability probe (short-lived container on the OpenShell Docker network → host.openshell.internal:<port>); the Ollama auth-proxy probe is now a thin wrapper over it.
  • reconcileModelRouter() probes host.openshell.internal:<routerPort> after the router is healthy and prints a concrete sudo ufw allow … port <routerPort> remediation on tcp_failed (non-fatal when the sandbox network does not yet exist).
  • On resume, re-upsert the routed provider with the normalized host.openshell.internal base URL and the routed profile's credential env, repairing a stale localhost:4000 entry left by an earlier run.
  • New logic lives under src/lib/onboard/**; src/lib/onboard.ts is net-neutral.

Non-Docker-driver behavior is unchanged.

Type of Change

  • Code change (feature, bug fix, or refactor)

Verification

  • npm test passes (the only failures on this host are pre-existing/environmental: an unbuilt nemoclaw/ subproject and an ownership-dependent chmod test that fails on a clean tree; all pass in isolation once nemoclaw/ is built)
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • CLI typecheck (npm run typecheck:cli) and codex review --uncommitted clean

Reproduced both bugs against compiled dist before fixing, then confirmed the fixes through the same reproductions and new unit tests.


Signed-off-by: Yimo Jiang yimoj@nvidia.com

Summary by CodeRabbit

  • New Features

    • Onboarding now reliably wires routed inference providers and re-applies provider endpoints/credentials during resume.
    • Added sandbox reachability probing with user-facing remediation guidance for host-exposed services; reused for the Ollama proxy.
  • Bug Fixes

    • Prevents stale/localhost endpoints from breaking routed providers by ensuring refreshed endpoint resolution and error handling.
  • Tests

    • Expanded test coverage for gateway binary resolution, routed provider behavior, and sandbox reachability scenarios.

…reachability (NVIDIA#4520, NVIDIA#4564)

Two Docker-driver onboarding fixes sharing the runtime-identity / host-service
reachability axis.

NVIDIA#4520: The glibc-compat gateway runs as a host-side `docker run ...
openshell-gateway` parent, so /proc/<pid>/exe is /usr/bin/docker.
buildDockerDriverGatewayRuntimeIdentity() encodes this with
driftGatewayBin=null to skip the executable check, but both callers used
`runtimeIdentity?.driftGatewayBin ?? gatewayBin`, coalescing the deliberate
null back to the host binary and falsely marking a healthy compat gateway
stale, triggering a recreate and port 8080 collision on the second onboard.
Add resolveDriftGatewayBin() which preserves the null, and use it in
refreshDockerDriverGatewayReuseState and startDockerDriverGateway.

NVIDIA#4564: On Linux Docker-driver hosts, Model Router on localhost:4000 is the
sandbox loopback, not the host router. Resume only reconciled the router and
left a stale localhost provider base URL, and there was no sandbox-network
reachability probe like the Ollama auth proxy has. Extract a generic
host-service reachability probe (reused by the Ollama wrapper), probe
host.openshell.internal:<routerPort> from the OpenShell Docker network inside
reconcileModelRouter() with a concrete ufw remediation on tcp_failed, and
re-upsert the routed provider with the normalized host alias (and the routed
profile's credential env) on resume so stale localhost entries are repaired.

Non-Docker-driver behavior is unchanged. New logic lives under
src/lib/onboard/** to keep onboard.ts net-neutral.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3f3160bb-3d9b-447d-9010-9641e9dfff22

📥 Commits

Reviewing files that changed from the base of the PR and between 570629c and 3e7b849.

📒 Files selected for processing (2)
  • src/lib/onboard/routed-inference.test.ts
  • src/lib/onboard/routed-inference.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/lib/onboard/routed-inference.ts
  • src/lib/onboard/routed-inference.test.ts

📝 Walkthrough

Walkthrough

This PR extends the OpenShell onboarding flow with Model Router (routed inference) provider support and improves host service reachability diagnostics. It introduces a generic Docker sandbox-to-host TCP probe, refactors Ollama proxy reachability to reuse it, normalizes routed provider endpoint URLs from localhost to sandbox-facing host aliases, resolves credential environments following profile/router precedence, verifies Model Router reachability after reconciliation, improves drift gateway binary selection semantics in containerized environments, and integrates these pieces into the main onboarding flow and provider-inference state machine.

Changes

Routed Inference & Host Reachability

Layer / File(s) Summary
Generic host service sandbox reachability probe
src/lib/onboard/host-service-reachability.ts, src/lib/onboard/host-service-reachability.test.ts
Implements reusable TCP connectivity verification from Docker sandbox to host services. Inspects Docker network IPAM to resolve gateway IP, runs an nc-based probe container, classifies results as ok/tcp_failed/probe_unavailable, and generates user-facing UFW remediation messages with subnet/gateway fallback logic.
Routed inference provider utilities
src/lib/onboard/routed-inference.ts, src/lib/onboard/routed-inference.test.ts
Provides normalizeRoutedEndpointUrl (rewriting localhost/127.0.0.1 to sandbox host alias), resolveRoutedCredentialEnv (profile/router/default precedence), and upsertRoutedProvider (orchestrates normalization, resolution, and injected upsert with structured result return).
Drift gateway binary selection helper
src/lib/onboard/docker-driver-gateway-launch.ts, src/lib/onboard/docker-driver-gateway-launch.test.ts
Adds resolveDriftGatewayBin() to properly select gateway binary for drift comparison, preserving null semantics to skip executable checks in Docker compat mode and falling back to provided gateway path when identity is unavailable.
Model Router sandbox reachability verification
src/lib/onboard/model-router.ts
Adds verifyModelRouterSandboxReachability() helper invoked after successful router reconciliation to probe sandbox connectivity and throw with formatted remediation messages on tcp_failed; treats probe_unavailable as non-fatal.
Ollama proxy reachability refactoring
src/lib/onboard/ollama-proxy-reachability.ts
Refactors from self-contained probe to thin wrapper delegating to generic host-service-reachability, removing duplicate Docker inspection/probe logic while preserving public API via type aliases and delegation.
Provider inference state routed re-upsert handling
src/lib/onboard/machine/handlers/provider-inference.ts, src/lib/onboard/machine/handlers/provider-inference.test.ts
Adds reupsertRoutedProvider dependency hook enabling re-upsert of routed providers during resume with corrected sandbox-facing endpoint URLs, preventing stale localhost endpoints from earlier runs.
Main onboard.ts integration
src/lib/onboard.ts
Wires routed inference: imports routedInference module, uses resolveDriftGatewayBin for gateway drift comparison in reuse/startup flows, delegates routed provider setup to upsertRoutedProvider with error handling, and wires reupsertRoutedProvider callback for provider-inference state.
Runtime drift detection regression test
test/onboard-gateway-runtime.test.ts
Verifies containerized-compat gateway drift detection handles null gatewayBin correctly (skipped check) and detects staleness when gatewayBin is mistakenly coalesced to host path.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through nets and gateway mist,
Probed the host where ports might twist,
Routers warmed and endpoints aligned,
Credentials found and upserts signed,
Onboarding stitched — a joyful list!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.77% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically summarizes the main changes: containerized gateway reuse and routed provider reachability repair, referencing two related issue numbers.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard/routed-inference.ts`:
- Around line 62-70: The URL rewrite in normalizeRoutedEndpointUrl currently
appends a colon even when parsed.port is empty and drops query/hash; update the
try block handling for localhost/127.0.0.1 so that you only include
`:${parsed.port}` when parsed.port is non-empty, and append parsed.pathname +
parsed.search + parsed.hash to preserve path, query and fragment; keep the
existing HOST_GATEWAY_URL prefix and the try/catch behavior, and return the
original url in the catch or when not localhost.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e76c798c-f364-403f-9564-867b7d35fc2e

📥 Commits

Reviewing files that changed from the base of the PR and between df7d054 and 570629c.

📒 Files selected for processing (12)
  • src/lib/onboard.ts
  • src/lib/onboard/docker-driver-gateway-launch.test.ts
  • src/lib/onboard/docker-driver-gateway-launch.ts
  • src/lib/onboard/host-service-reachability.test.ts
  • src/lib/onboard/host-service-reachability.ts
  • src/lib/onboard/machine/handlers/provider-inference.test.ts
  • src/lib/onboard/machine/handlers/provider-inference.ts
  • src/lib/onboard/model-router.ts
  • src/lib/onboard/ollama-proxy-reachability.ts
  • src/lib/onboard/routed-inference.test.ts
  • src/lib/onboard/routed-inference.ts
  • test/onboard-gateway-runtime.test.ts

Comment thread src/lib/onboard/routed-inference.ts
normalizeRoutedEndpointUrl emitted a dangling colon (host.openshell.internal:/v1)
for a portless localhost endpoint and dropped any query/hash. Only append the
port when present and carry through search and hash. Addresses CodeRabbit review
on NVIDIA#4601.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@yimoj yimoj added the v0.0.56 Release target label Jun 1, 2026
@wscurran wscurran added Docker platform: ubuntu Affects Ubuntu Linux environments labels Jun 1, 2026
1 similar comment
@cv cv merged commit 44b23d0 into NVIDIA:main Jun 1, 2026
30 checks passed
@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output area: packaging Packages, images, registries, installers, or distribution bug-fix PR fixes a bug or regression platform: container Affects Docker, containerd, Podman, or images and removed area: packaging Packages, images, registries, installers, or distribution NemoClaw CLI labels Jun 3, 2026
cv pushed a commit that referenced this pull request Jun 3, 2026
## Summary
- Add the missing `v0.0.57` release-notes section with links to the
detailed docs pages for command, inference, onboarding, messaging,
status, installer, and policy changes.
- Remove public references to docs-skip terms from source docs and
regenerate the NemoClaw user skills from the current Fern MDX docs.
- Carry forward generated references for the per-agent documentation
split, including Hermes-specific reference files.

## Source summary
- #4615 and #4653 -> `docs/about/release-notes.mdx`,
`docs/reference/commands.mdx`: Release notes now cover host-side
`sessions` and `agents` commands plus `NEMOCLAW_EXTRA_AGENTS_JSON`
secondary-agent baking.
- #4163, #4204, #4611, #4619, and #4676 ->
`docs/about/release-notes.mdx`,
`docs/inference/use-local-inference.mdx`: Release notes now cover
managed vLLM progress/readiness, DGX Spark model default changes, local
Ollama streaming usage, and inference route divergence warnings.
- #4267, #4601, #4609, #4642, #4645, and #4661 ->
`docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Release
notes now cover UFW auto-remediation, local-inference reachability
gates, gateway reuse/binding, cancel rollback, and policy selection
persistence.
- #4577, #4582, #4607, and #4660 -> `docs/about/release-notes.mdx`,
`docs/manage-sandboxes/messaging-channels.mdx`: Release notes now cover
Slack validation, atomic `channels add`, WhatsApp QR diagnostics, and
Slack placeholder normalization.
- #4388, #4600, #4646, and #4647 -> `docs/about/release-notes.mdx`,
`docs/reference/commands.mdx`: Release notes now cover status failure
layers, paused-container hints, Docker-driver doctor behavior, and
non-destructive stale-registry recovery.
- #4569, #4579, and #4678 -> `docs/about/release-notes.mdx`,
`docs/manage-sandboxes/lifecycle.mdx`,
`docs/network-policy/integration-policy-examples.mdx`: Release notes now
cover installer tag pinning, PyPI `uv` policy access, and observable
Jira validation.
- #4632 -> `.agents/skills/`: Regenerated user skills from the current
per-agent docs source, including newly generated Hermes reference files.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" docs --glob "*.mdx"`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" .agents/skills --glob "*.md"`
- `npm run docs`
- `npm run build:cli`
- Commit hooks: markdownlint, docs-to-skills verification, gitleaks,
skills YAML, commitlint

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Restructured documentation to clearly distinguish OpenClaw and Hermes
agent variants throughout user guides.
* Enhanced security, credential storage, and deployment guidance with
clearer setup flows.
  * Added Hermes plugin installation and ecosystem documentation.
* Improved workspace, messaging, and policy management references with
variant-specific command examples.
  * Refined troubleshooting and CLI reference sections for clarity.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output bug-fix PR fixes a bug or regression platform: container Affects Docker, containerd, Podman, or images platform: ubuntu Affects Ubuntu Linux environments v0.0.57 Release target

Projects

None yet

3 participants