Skip to content

fix(onboard): catch host DNS block before provider validation (#4784)#5041

Merged
cv merged 4 commits into
NVIDIA:mainfrom
yimoj:fix/4784-host-dns-preflight
Jun 10, 2026
Merged

fix(onboard): catch host DNS block before provider validation (#4784)#5041
cv merged 4 commits into
NVIDIA:mainfrom
yimoj:fix/4784-host-dns-preflight

Conversation

@yimoj

@yimoj yimoj commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

A host firewall that drops outbound DNS (tcp/udp:53) leaves the NemoClaw CLI unable to resolve the provider endpoint, but the existing container DNS probe still passes — so preflight printed ✓ Container DNS resolution works and onboarding only failed much later at NVIDIA Endpoints validation with the cryptic curl: (6) Could not resolve host: integrate.api.nvidia.com. This adds a host-side DNS preflight that catches the blocked condition up front with targeted remediation.

Related Issue

Fixes #4784

Changes

  • src/lib/onboard/preflight.ts — new probeHostDns() / isFatalHostDnsProbeFailure(). The probe resolves the provider host from the CLI process using dns.lookup (getaddrinfo), so it follows the same /etc/hosts/nsswitch path the later curl-based provider validation uses (won't false-fail hosts that reach the provider via /etc/hosts). Runs as a bounded node -e child so it stays synchronous and fully injectable for tests.
  • src/lib/onboard/bridge-dns-preflight.tsassertHostDnsHealthy() + printHostDnsRemediation(), invoked from assertDockerBridgeAndContainerDnsHealthy before the container DNS probe. Prints a Host DNS resolution line distinct from the container one; on a fatal failure it fails early with host-DNS / VPN / firewall / iptables OUTPUT remediation.
  • Scoped to the NVIDIA Endpoints path so it never blocks valid local/air-gapped/proxy onboarding:
    • Runs only when NVIDIA Endpoints is the effective provider: non-interactive NEMOCLAW_PROVIDER, else the recorded session/registry provider (handles reruns and --resume, incl. the legacy nvidia-nim alias), else the non-interactive default. Interactive runs (provider not yet chosen) and explicit local/non-NVIDIA providers skip it.
    • Skips when the provider host is reached via a configured proxy (HTTPS_PROXY/ALL_PROXY, honoring NO_PROXY).
    • Escape hatch: NEMOCLAW_SKIP_HOST_DNS_PREFLIGHT=1.
  • src/lib/onboard.ts stays net-neutral (4 single-line modifications, +4/−4); all logic lives under src/lib/onboard/.
  • src/lib/onboard/host-dns-preflight.test.ts — new focused test file (25 tests) covering probe classification, remediation output, the fatal/warn/skip branches, provider gating, recorded-provider/nvidia-nim handling, and proxy/NO_PROXY behavior. (Kept out of preflight.test.ts, which is frozen at its CI line-size budget.)

Type of Change

  • Code change (feature, bug fix, or refactor)

Verification

  • npm test (targeted: host-dns-preflight, preflight, bridge-dns-preflight, test-file-size-budget, onboard machine/handlers + onboard.test.ts — all pass; CLI type-check clean; Biome lint clean)
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Reproduced and re-verified end-to-end against a real host DNS block (sudo iptables -I OUTPUT -p {udp,tcp} --dport 53 -j DROP): default non-interactive onboard now aborts at the host DNS gate with remediation; NEMOCLAW_PROVIDER=ollama correctly skips it; an /etc/hosts-resolvable name passes even under the block. Firewall rules were trap-cleaned after each run.

Signed-off-by: Yimo Jiang yimoj@nvidia.com

Summary by CodeRabbit

  • New Features

    • Added host-side DNS health checking during onboarding to detect DNS blocking issues earlier.
    • Provides targeted diagnostics and remediation guidance when host DNS fails.
    • Host DNS check can be skipped via environment variable if needed.
  • Tests

    • Added comprehensive test coverage for host DNS preflight validation.

…#4784)

A host OUTPUT-chain firewall that drops tcp/udp:53 (corporate firewall,
VPN kill-switch, or a literal `iptables -I OUTPUT -p udp --dport 53 -j
DROP`) leaves the NemoClaw CLI process unable to resolve the provider
endpoint. The existing container DNS probe still passes in that state —
docker's embedded resolver and bridge NAT take a different egress path —
so preflight printed "Container DNS resolution works" and onboarding only
failed much later at NVIDIA Endpoints validation with the cryptic
`curl: (6) Could not resolve host: integrate.api.nvidia.com`.

Add a host-side DNS preflight that resolves the provider endpoint from
the CLI process (via `dns.lookup`/getaddrinfo, so it follows the same
`/etc/hosts`/nsswitch path the later curl validation uses) and prints a
`Host DNS resolution` line distinct from the container one. On a fatal
failure it fails early with host-DNS/VPN/firewall/iptables remediation,
before provider validation.

The probe runs only when NVIDIA Endpoints is the effective provider:
non-interactive `NEMOCLAW_PROVIDER`, else the recorded session/registry
provider, else the non-interactive default. Interactive runs (provider
not yet chosen), explicit local/non-NVIDIA providers, proxied provider
hosts (HTTPS_PROXY honoring NO_PROXY), and `NEMOCLAW_SKIP_HOST_DNS_PREFLIGHT=1`
all skip it so valid local/air-gapped/proxy onboarding is never blocked.

src/lib/onboard.ts stays net-neutral (logic lives in onboard/ modules).

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds a host-side DNS preflight check to detect NVIDIA Endpoints provider hostname resolution failures before container DNS tests run. The feature includes a probe implementation, assertion logic with remediation output, integration into the onboarding flow, and comprehensive test coverage.

Changes

Host-side DNS preflight checks

Layer / File(s) Summary
Host DNS probe implementation and types
src/lib/onboard/preflight.ts
Introduces probeHostDns to spawn a Node child process that resolves the NVIDIA Endpoints hostname, classifying outcomes into fatal (servers_unreachable, resolution_failed, timeout, killed) or inconclusive (error) states. Adds HostDnsProbeResult, ProbeHostDnsOpts, and isFatalHostDnsProbeFailure types/helpers.
Host DNS assertion, gating, and remediation
src/lib/onboard/bridge-dns-preflight.ts
Extends assertDockerBridgeAndContainerDnsHealthy to accept provider context and gating parameters. Implements assertHostDnsHealthy to run the probe only when the effective provider is NVIDIA Endpoints (unless skipped via environment or proxy), exit early with remediation on fatal failures, and continue on inconclusive results. Adds printHostDnsRemediation for platform-specific remediation output.
Onboarding wiring for provider context
src/lib/onboard.ts
Extends PreflightOptions with recordedProviderForHostDns and threads it through onboarding by passing it to assertDockerBridgeAndContainerDnsHealthy alongside isNonInteractive. The runPreflight dependency injection resolves the recorded provider from session or sandbox state.
Host DNS preflight test suite
src/lib/onboard/host-dns-preflight.test.ts
Comprehensive tests for probeHostDns (classification, hostname validation), printHostDnsRemediation (platform-specific output), and assertHostDnsHealthy (provider selection, mode gating, environment control, proxy handling).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Suggested labels

fix, onboarding, area: networking, bug-fix

Suggested reviewers

  • laitingsheng
  • cv
  • sandl99

Poem

🐰 A rabbit hops through DNS gates,
Checking hosts before it's late,
NVIDIA paths now verified fast,
No more Docker dreams unsurpassed!
Resolution flows, remediation shows—
One hop closer, on it goes! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(onboard): catch host DNS block before provider validation (#4784)' directly and clearly summarizes the main change: a host DNS check added to the onboarding flow that detects DNS blocks before provider validation occurs, matching the PR's core objective.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

6189-6199: Run the recommended onboarding E2E slices for this wiring change.

Since this change threads provider context through preflight in src/lib/onboard.ts, run the targeted nightly E2E subset to validate resume/non-interactive provider-gating behavior end to end.

As per coding guidelines, "src/lib/onboard.ts: This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 6189 - 6199, Change threads provider context
through preflight; run targeted nightly E2E slices to validate
resume/non-interactive provider-gating behavior by exercising the wiring around
runPreflight (preflight), assertDockerBridgeAndContainerDnsHealthy,
rejectUnsupportedContainerRuntime, and assertCdiNvidiaGpuSpecPresent;
specifically run the subset that covers resume paths, non-interactive flows,
provider-recorded session reads (readRecordedProvider), and Docker/Podman
runtime gating to ensure the new session?.provider propagation behaves correctly
end-to-end.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 6189-6199: Change threads provider context through preflight; run
targeted nightly E2E slices to validate resume/non-interactive provider-gating
behavior by exercising the wiring around runPreflight (preflight),
assertDockerBridgeAndContainerDnsHealthy, rejectUnsupportedContainerRuntime, and
assertCdiNvidiaGpuSpecPresent; specifically run the subset that covers resume
paths, non-interactive flows, provider-recorded session reads
(readRecordedProvider), and Docker/Podman runtime gating to ensure the new
session?.provider propagation behaves correctly end-to-end.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 15a037c5-ceb5-401a-b25a-b7dec35677d4

📥 Commits

Reviewing files that changed from the base of the PR and between 784f573 and 55417d9.

📒 Files selected for processing (4)
  • src/lib/onboard.ts
  • src/lib/onboard/bridge-dns-preflight.ts
  • src/lib/onboard/host-dns-preflight.test.ts
  • src/lib/onboard/preflight.ts

@yimoj

yimoj commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Re: CodeRabbit nitpick on the src/lib/onboard.ts preflight wiring (run onboarding E2E for resume/non-interactive provider gating) — validated end-to-end:

  • Real-CLI E2E under a live host DNS block (sudo iptables -I OUTPUT -p {udp,tcp} --dport 53 -j DROP), going through runPreflightpreflight()assertDockerBridgeAndContainerDnsHealthy(host, isNonInteractive(), recordedProviderForHostDns):
    • Default --non-interactive (NVIDIA Endpoints default): aborts at the new ✗ Host DNS resolution failed (could not resolve integrate.api.nvidia.com) gate at step [1/8], with remediation — before the misleading container-DNS line and before later provider validation.
    • NEMOCLAW_PROVIDER=ollama: host DNS gate is skipped, onboarding proceeds past preflight into Ollama setup (no NVIDIA-domain DNS dependency).
    • A /etc/hosts-resolvable name resolves OK even under the block, confirming the probe uses dns.lookup/getaddrinfo (same path as the later curl validation) and won't false-fail proxy//etc/hosts setups.
    • Firewall rules trap-cleaned after each run (verified none left behind).
  • Unit coverage for the gating matrix in src/lib/onboard/host-dns-preflight.test.ts: non-interactive default → run; interactive/unset → skip; explicit + recorded providers (build/cloud/routed/nvidia-prod/nvidia-nim/nvidia-router → run; ollama-local/vllm-local/nim-local → skip); proxy/NO_PROXY; and the NEMOCLAW_SKIP_HOST_DNS_PREFLIGHT escape hatch.

src/lib/onboard.ts itself is net-neutral (4 single-line modifications) per the entrypoint budget; all logic lives under src/lib/onboard/.

@wscurran wscurran added area: networking DNS, proxy, TLS, ports, host aliases, or connectivity bug-fix PR fixes a bug or regression labels Jun 9, 2026
@wscurran

wscurran commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

@prekshivyas prekshivyas self-assigned this Jun 10, 2026

@prekshivyas prekshivyas left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Correctly addresses #4784 by catching a blocked host DNS in preflight instead of late during provider validation — and it's appropriately fail-safe for a blocking check:

  • Only a fatal probe result aborts; inconclusive/transient failures warn-and-continue (!isFatalHostDnsProbeFailure → console.warn).
  • Skips when an HTTPS proxy is configured (provider reached via proxy, host DNS irrelevant) and via the NEMOCLAW_SKIP_HOST_DNS_PREFLIGHT escape hatch.
  • Gated to NVIDIA-endpoint providers, so local/Ollama onboards aren't subjected to it.
    The error path names the skip env. CI green, test present. Given the size (~833 LOC) the probe internals (probeHostDns/isFatalHostDnsProbeFailure) are worth a maintainer eyeball, but the decision-boundary behavior is sound and won't over-block.

@prekshivyas prekshivyas added the v0.0.63 Release target label Jun 10, 2026
prekshivyas and others added 3 commits June 9, 2026 18:34
Biome was collapsing wrapped console.log/warn calls; apply it so static-checks' formatter hook leaves the files unchanged. Formatting only.

Co-authored-by: yimoj <yimoj@nvidia.com>
Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…IA#4784)

The biome-format pass expanded the threaded provider context into
multi-line calls, pushing src/lib/onboard.ts net +19 and failing the
codebase-growth-guardrails entrypoint budget.

Keep onboard.ts net-neutral: pass only isNonInteractive() to the host
DNS gate (one short line) and drop the recorded-provider threading
through preflight() / the resume backstop. The host DNS check now gates
purely on non-interactive NEMOCLAW_PROVIDER (unset → NVIDIA Endpoints
default), which still catches the reporter's `nemoclaw onboard
--non-interactive` repro and skips explicit local/non-NVIDIA providers
and interactive runs. Simplify the gate accordingly (no providerKey
param / recorded-name classification), since nothing wires it now.

The fresh, no-recorded-state repro is unaffected (recorded provider was
null there anyway); non-interactive local reruns set NEMOCLAW_PROVIDER,
and NEMOCLAW_SKIP_HOST_DNS_PREFLIGHT remains the escape hatch.

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@cv cv merged commit 708020c into NVIDIA:main Jun 10, 2026
37 checks passed
miyoungc added a commit that referenced this pull request Jun 11, 2026
## Summary
- Add the v0.0.63 release-note section using the published development
note as source context.
- Update source docs for sandbox recovery, OpenClaw config restore
safety, managed vLLM selection, Slack Socket Mode conflict handling, and
host diagnostics.
- Refresh generated `nemoclaw-user-*` skills from the updated Fern MDX
docs.
- Update the release-doc refresh skill so post-release docs for version
`n` look up the matching announcement discussion and use the `n+1` patch
release label.
- Fix CLI/docs parity by avoiding a `--from <Dockerfile>` flag mention
inside the `upgrade-sandboxes` command section.

## Source summary
- #5034 -> `docs/reference/troubleshooting.mdx`,
`docs/about/release-notes.mdx`: Document safer stale-sandbox recovery
through `rebuild --yes` before recreating from scratch.
- #5091 -> `docs/reference/troubleshooting.mdx`,
`docs/about/release-notes.mdx`: Document Docker-driver post-reboot
recovery from OpenShell container labels.
- #5101, #5174, #5177 -> `docs/manage-sandboxes/backup-restore.mdx`,
`docs/about/release-notes.mdx`: Document OpenClaw `openclaw.json`
preservation, merge behavior, and fail-safe restore handling.
- #5102 -> `docs/reference/commands.mdx`,
`docs/reference/commands-nemohermes.mdx`,
`docs/manage-sandboxes/lifecycle.mdx`, `docs/about/release-notes.mdx`:
Document `upgrade-sandboxes` image-fingerprint drift detection.
- #4201 -> `docs/reference/troubleshooting.mdx`,
`docs/about/release-notes.mdx`: Document the installer diagnostic for
unexpected Docker daemon access outside the `docker` group.
- #5038 -> `docs/inference/inference-options.mdx`,
`docs/inference/use-local-inference.mdx`,
`docs/about/release-notes.mdx`: Document the interactive managed-vLLM
model picker and non-interactive override behavior.
- #5040, #5041 -> `docs/reference/troubleshooting.mdx`,
`docs/about/release-notes.mdx`: Document Ollama auth-proxy recovery and
host DNS preflight diagnostics.
- #4986, #5039 -> `docs/manage-sandboxes/messaging-channels.mdx`,
`docs/about/release-notes.mdx`: Document Slack validation and duplicate
Slack Socket Mode sandbox handling.
- #4981, #5168 -> `docs/about/release-notes.mdx`: Capture Hermes gateway
secret-guard and wrapped-argv startup hardening in the release surface.
- Follow-up ->
`.agents/skills/nemoclaw-contributor-update-docs/SKILL.md`: Record the
post-release docs workflow, discussion-announcement lookup, and
next-patch release label rule.
- Follow-up -> `docs/reference/commands.mdx`,
`docs/reference/commands-nemohermes.mdx`: Reword custom Dockerfile
sandbox text so CLI parity does not treat `--from` as an
`upgrade-sandboxes` flag.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `npm run docs`
- `npm run build:cli`
- `bash test/e2e/e2e-cloud-experimental/check-docs.sh --only-cli`
- Skip-term scan for `docs/.docs-skip` blocked terms across generated
user skills

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Enhanced local inference setup with interactive model selection
prompts and environment variable overrides
* Improved sandbox upgrade detection using build fingerprints and
version checks
* Clarified configuration restore behavior preserving user settings
during rebuild/restore
  * Added gateway authentication as fifth security layer
  * Expanded Slack messaging validation with live credential checking
* Enhanced troubleshooting guidance for Docker access, DNS issues, and
sandbox recovery
* Updated release notes for v0.0.63 featuring sandbox recovery and
inference improvements

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: networking DNS, proxy, TLS, ports, host aliases, or connectivity bug-fix PR fixes a bug or regression v0.0.63 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[All Platforms][Install] Host DNS blocking via iptables OUTPUT is not caught in preflight; failures only surface during provider validation

4 participants