fix(onboard): catch host DNS block before provider validation (#4784)#5041
Conversation
…#4784) A host OUTPUT-chain firewall that drops tcp/udp:53 (corporate firewall, VPN kill-switch, or a literal `iptables -I OUTPUT -p udp --dport 53 -j DROP`) leaves the NemoClaw CLI process unable to resolve the provider endpoint. The existing container DNS probe still passes in that state — docker's embedded resolver and bridge NAT take a different egress path — so preflight printed "Container DNS resolution works" and onboarding only failed much later at NVIDIA Endpoints validation with the cryptic `curl: (6) Could not resolve host: integrate.api.nvidia.com`. Add a host-side DNS preflight that resolves the provider endpoint from the CLI process (via `dns.lookup`/getaddrinfo, so it follows the same `/etc/hosts`/nsswitch path the later curl validation uses) and prints a `Host DNS resolution` line distinct from the container one. On a fatal failure it fails early with host-DNS/VPN/firewall/iptables remediation, before provider validation. The probe runs only when NVIDIA Endpoints is the effective provider: non-interactive `NEMOCLAW_PROVIDER`, else the recorded session/registry provider, else the non-interactive default. Interactive runs (provider not yet chosen), explicit local/non-NVIDIA providers, proxied provider hosts (HTTPS_PROXY honoring NO_PROXY), and `NEMOCLAW_SKIP_HOST_DNS_PREFLIGHT=1` all skip it so valid local/air-gapped/proxy onboarding is never blocked. src/lib/onboard.ts stays net-neutral (logic lives in onboard/ modules). Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
📝 WalkthroughWalkthroughThis PR adds a host-side DNS preflight check to detect NVIDIA Endpoints provider hostname resolution failures before container DNS tests run. The feature includes a probe implementation, assertion logic with remediation output, integration into the onboarding flow, and comprehensive test coverage. ChangesHost-side DNS preflight checks
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
6189-6199: Run the recommended onboarding E2E slices for this wiring change.Since this change threads provider context through preflight in
src/lib/onboard.ts, run the targeted nightly E2E subset to validate resume/non-interactive provider-gating behavior end to end.As per coding guidelines, "
src/lib/onboard.ts: This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 6189 - 6199, Change threads provider context through preflight; run targeted nightly E2E slices to validate resume/non-interactive provider-gating behavior by exercising the wiring around runPreflight (preflight), assertDockerBridgeAndContainerDnsHealthy, rejectUnsupportedContainerRuntime, and assertCdiNvidiaGpuSpecPresent; specifically run the subset that covers resume paths, non-interactive flows, provider-recorded session reads (readRecordedProvider), and Docker/Podman runtime gating to ensure the new session?.provider propagation behaves correctly end-to-end.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 6189-6199: Change threads provider context through preflight; run
targeted nightly E2E slices to validate resume/non-interactive provider-gating
behavior by exercising the wiring around runPreflight (preflight),
assertDockerBridgeAndContainerDnsHealthy, rejectUnsupportedContainerRuntime, and
assertCdiNvidiaGpuSpecPresent; specifically run the subset that covers resume
paths, non-interactive flows, provider-recorded session reads
(readRecordedProvider), and Docker/Podman runtime gating to ensure the new
session?.provider propagation behaves correctly end-to-end.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 15a037c5-ceb5-401a-b25a-b7dec35677d4
📒 Files selected for processing (4)
src/lib/onboard.tssrc/lib/onboard/bridge-dns-preflight.tssrc/lib/onboard/host-dns-preflight.test.tssrc/lib/onboard/preflight.ts
|
Re: CodeRabbit nitpick on the
|
prekshivyas
left a comment
There was a problem hiding this comment.
LGTM. Correctly addresses #4784 by catching a blocked host DNS in preflight instead of late during provider validation — and it's appropriately fail-safe for a blocking check:
- Only a fatal probe result aborts; inconclusive/transient failures warn-and-continue (
!isFatalHostDnsProbeFailure → console.warn). - Skips when an HTTPS proxy is configured (provider reached via proxy, host DNS irrelevant) and via the
NEMOCLAW_SKIP_HOST_DNS_PREFLIGHTescape hatch. - Gated to NVIDIA-endpoint providers, so local/Ollama onboards aren't subjected to it.
The error path names the skip env. CI green, test present. Given the size (~833 LOC) the probe internals (probeHostDns/isFatalHostDnsProbeFailure) are worth a maintainer eyeball, but the decision-boundary behavior is sound and won't over-block.
Biome was collapsing wrapped console.log/warn calls; apply it so static-checks' formatter hook leaves the files unchanged. Formatting only. Co-authored-by: yimoj <yimoj@nvidia.com> Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…IA#4784) The biome-format pass expanded the threaded provider context into multi-line calls, pushing src/lib/onboard.ts net +19 and failing the codebase-growth-guardrails entrypoint budget. Keep onboard.ts net-neutral: pass only isNonInteractive() to the host DNS gate (one short line) and drop the recorded-provider threading through preflight() / the resume backstop. The host DNS check now gates purely on non-interactive NEMOCLAW_PROVIDER (unset → NVIDIA Endpoints default), which still catches the reporter's `nemoclaw onboard --non-interactive` repro and skips explicit local/non-NVIDIA providers and interactive runs. Simplify the gate accordingly (no providerKey param / recorded-name classification), since nothing wires it now. The fresh, no-recorded-state repro is unaffected (recorded provider was null there anyway); non-interactive local reruns set NEMOCLAW_PROVIDER, and NEMOCLAW_SKIP_HOST_DNS_PREFLIGHT remains the escape hatch. Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
## Summary - Add the v0.0.63 release-note section using the published development note as source context. - Update source docs for sandbox recovery, OpenClaw config restore safety, managed vLLM selection, Slack Socket Mode conflict handling, and host diagnostics. - Refresh generated `nemoclaw-user-*` skills from the updated Fern MDX docs. - Update the release-doc refresh skill so post-release docs for version `n` look up the matching announcement discussion and use the `n+1` patch release label. - Fix CLI/docs parity by avoiding a `--from <Dockerfile>` flag mention inside the `upgrade-sandboxes` command section. ## Source summary - #5034 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document safer stale-sandbox recovery through `rebuild --yes` before recreating from scratch. - #5091 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document Docker-driver post-reboot recovery from OpenShell container labels. - #5101, #5174, #5177 -> `docs/manage-sandboxes/backup-restore.mdx`, `docs/about/release-notes.mdx`: Document OpenClaw `openclaw.json` preservation, merge behavior, and fail-safe restore handling. - #5102 -> `docs/reference/commands.mdx`, `docs/reference/commands-nemohermes.mdx`, `docs/manage-sandboxes/lifecycle.mdx`, `docs/about/release-notes.mdx`: Document `upgrade-sandboxes` image-fingerprint drift detection. - #4201 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document the installer diagnostic for unexpected Docker daemon access outside the `docker` group. - #5038 -> `docs/inference/inference-options.mdx`, `docs/inference/use-local-inference.mdx`, `docs/about/release-notes.mdx`: Document the interactive managed-vLLM model picker and non-interactive override behavior. - #5040, #5041 -> `docs/reference/troubleshooting.mdx`, `docs/about/release-notes.mdx`: Document Ollama auth-proxy recovery and host DNS preflight diagnostics. - #4986, #5039 -> `docs/manage-sandboxes/messaging-channels.mdx`, `docs/about/release-notes.mdx`: Document Slack validation and duplicate Slack Socket Mode sandbox handling. - #4981, #5168 -> `docs/about/release-notes.mdx`: Capture Hermes gateway secret-guard and wrapped-argv startup hardening in the release surface. - Follow-up -> `.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`: Record the post-release docs workflow, discussion-announcement lookup, and next-patch release label rule. - Follow-up -> `docs/reference/commands.mdx`, `docs/reference/commands-nemohermes.mdx`: Reword custom Dockerfile sandbox text so CLI parity does not treat `--from` as an `upgrade-sandboxes` flag. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `npm run docs` - `npm run build:cli` - `bash test/e2e/e2e-cloud-experimental/check-docs.sh --only-cli` - Skip-term scan for `docs/.docs-skip` blocked terms across generated user skills <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Enhanced local inference setup with interactive model selection prompts and environment variable overrides * Improved sandbox upgrade detection using build fingerprints and version checks * Clarified configuration restore behavior preserving user settings during rebuild/restore * Added gateway authentication as fifth security layer * Expanded Slack messaging validation with live credential checking * Enhanced troubleshooting guidance for Docker access, DNS issues, and sandbox recovery * Updated release notes for v0.0.63 featuring sandbox recovery and inference improvements <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
A host firewall that drops outbound DNS (tcp/udp:53) leaves the NemoClaw CLI unable to resolve the provider endpoint, but the existing container DNS probe still passes — so preflight printed
✓ Container DNS resolution worksand onboarding only failed much later at NVIDIA Endpoints validation with the crypticcurl: (6) Could not resolve host: integrate.api.nvidia.com. This adds a host-side DNS preflight that catches the blocked condition up front with targeted remediation.Related Issue
Fixes #4784
Changes
src/lib/onboard/preflight.ts— newprobeHostDns()/isFatalHostDnsProbeFailure(). The probe resolves the provider host from the CLI process usingdns.lookup(getaddrinfo), so it follows the same/etc/hosts/nsswitch path the later curl-based provider validation uses (won't false-fail hosts that reach the provider via/etc/hosts). Runs as a boundednode -echild so it stays synchronous and fully injectable for tests.src/lib/onboard/bridge-dns-preflight.ts—assertHostDnsHealthy()+printHostDnsRemediation(), invoked fromassertDockerBridgeAndContainerDnsHealthybefore the container DNS probe. Prints aHost DNS resolutionline distinct from the container one; on a fatal failure it fails early with host-DNS / VPN / firewall /iptables OUTPUTremediation.NEMOCLAW_PROVIDER, else the recorded session/registry provider (handles reruns and--resume, incl. the legacynvidia-nimalias), else the non-interactive default. Interactive runs (provider not yet chosen) and explicit local/non-NVIDIA providers skip it.HTTPS_PROXY/ALL_PROXY, honoringNO_PROXY).NEMOCLAW_SKIP_HOST_DNS_PREFLIGHT=1.src/lib/onboard.tsstays net-neutral (4 single-line modifications, +4/−4); all logic lives undersrc/lib/onboard/.src/lib/onboard/host-dns-preflight.test.ts— new focused test file (25 tests) covering probe classification, remediation output, the fatal/warn/skip branches, provider gating, recorded-provider/nvidia-nimhandling, and proxy/NO_PROXYbehavior. (Kept out ofpreflight.test.ts, which is frozen at its CI line-size budget.)Type of Change
Verification
npm test(targeted:host-dns-preflight,preflight,bridge-dns-preflight,test-file-size-budget, onboard machine/handlers +onboard.test.ts— all pass; CLI type-check clean; Biome lint clean)sudo iptables -I OUTPUT -p {udp,tcp} --dport 53 -j DROP): default non-interactive onboard now aborts at the host DNS gate with remediation;NEMOCLAW_PROVIDER=ollamacorrectly skips it; an/etc/hosts-resolvable name passes even under the block. Firewall rules were trap-cleaned after each run.Signed-off-by: Yimo Jiang yimoj@nvidia.com
Summary by CodeRabbit
New Features
Tests