Skip to content

fix(runtime): repair stale inference DNS routes#3267

Merged
ericksoa merged 6 commits into
mainfrom
fix/inference-local-dns-recovery
May 9, 2026
Merged

fix(runtime): repair stale inference DNS routes#3267
ericksoa merged 6 commits into
mainfrom
fix/inference-local-dns-recovery

Conversation

@ericksoa

@ericksoa ericksoa commented May 8, 2026

Copy link
Copy Markdown
Contributor

Summary

  • repair stale sandbox inference.local DNS proxy routes during connect/probe when the sandbox has a persisted inference provider/model
  • prefer the stable kube-dns service IP for the sandbox DNS proxy, with CoreDNS endpoint fallback
  • route nemohermes uninstall as a global uninstall command
  • omit the Brave policy preset when the selected sandbox image does not support NemoClaw's Brave web-search path, while preserving OpenClaw behavior and sandbox custom presets with colliding names
  • document NEMOCLAW_DISABLE_INFERENCE_ROUTE_REPAIR as the troubleshooting escape hatch for automatic DNS-proxy repair

Validation

  • npm run build:cli
  • npm run typecheck:cli
  • npx vitest run test/policy-tiers-onboard.test.ts test/onboard-preset-diff.test.ts
  • npx vitest run test/onboard.test.ts -t "computeSetupPresetSuggestions|agentSupportsWebSearch|configureWebSearch|prints numbered step headers"
  • npx vitest run test/sandbox-connect-inference.test.ts src/lib/actions/dns/index.test.ts test/nemohermes-alias.test.ts test/uninstall.test.ts src/lib/actions/sandbox/oclif-command-adapters.test.ts src/lib/commands/simple-global-oclif-adapters.test.ts
  • git diff --check

@coderabbitai

coderabbitai Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR prefers kube-dns service IP for DNS upstream with endpoint fallback, adds inference.local health probing and conditional DNS-proxy repair in sandbox connect, and makes onboarding policy preset selection agent-aware with exclusions and tests.

Changes

DNS Upstream Discovery and Inference Route Repair

Layer / File(s) Summary
DNS Upstream Discovery Logic
src/lib/actions/dns/index.ts
runSetupDnsProxy now queries kube-dns service clusterIP first and falls back to the first endpoint IP if the service lookup is empty/none.
DNS Upstream Unit Tests
src/lib/actions/dns/index.test.ts
Tests updated to validate service-IP preference, endpoint fallback when service is empty, and unsafe upstream rejection sourced from the service lookup.
DNS Shell Integration Tests
test/dns-proxy.test.ts
E2E shell test now asserts kubectl get service kube-dns is called and kubectl get endpoints kube-dns is not called.
Inference Route Constants
src/lib/actions/sandbox/connect.ts
Adds NEMOCLAW_GATEWAY_NAME and INFERENCE_ROUTE_PROBE_TIMEOUT_MS constants.
Inference Route Verification and Repair Helpers
src/lib/actions/sandbox/connect.ts
New helpers probe inference.local/v1/models, optionally run DNS-proxy repair (env-disableable), reconcile live vs persisted inference config, and attempt inference set --no-verify; failures are non-fatal.
Inference Route Connect Integration
src/lib/actions/sandbox/connect.ts
Sandbox connect probe now calls ensureSandboxInferenceRoute(sandboxName) in both success and failure paths and replaces prior inline swapping logic.
Inference Route Integration Tests
test/sandbox-connect-inference.test.ts
Test harness extended with fake openshell/docker stubs and state tracking; new test verifies DNS-proxy repair triggers when inference-local returns 503 during connect.

Agent-Aware Policy Preset Filtering

Layer / File(s) Summary
Preset Filtering Logic and Exclusion Map
src/lib/onboard.ts
Introduces AGENT_POLICY_PRESET_EXCLUSIONS, filterPolicyPresetsForAgent, and resolvePolicyPresetAgentName to normalize agent name and filter presets.
Preset Suggestion Refinement
src/lib/onboard.ts
computeSetupPresetSuggestions accepts optional knownPresetNames allowlist and filters tier defaults accordingly.
Setup Flow Agent Integration
src/lib/onboard.ts
setupPoliciesWithSelection gains agentName option and uses an agent-filtered preset universe; onboarding resume passes agentName into setup.
Agent Filter Unit Tests
test/onboard.test.ts
Tests add filterPolicyPresetsForAgent helper and verify knownPresetNames and agent-specific exclusions (e.g., Hermes excludes brave).
Agent Filter Integration Tests
test/policy-tiers-onboard.test.ts
Integration tests validate Hermes excludes brave, removes it if present, and clamps resume selections to allowed presets; also test preserving custom brave.
Public API Export
src/lib/onboard.ts
filterPolicyPresetsForAgent exported from module.

Uninstall Command Branding Fix

Layer / File(s) Summary
Command Usage Metadata
src/commands/uninstall.ts
Uninstall command usage string hardcoded to "nemoclaw uninstall" (removed CLI_NAME usage).
Uninstall Help Text Routing Test
test/nemohermes-alias.test.ts
New test ensures nemohermes uninstall --help routes to Hermes uninstaller and shows expected help text.

Documentation: Inference Route Repair Flag

Layer / File(s) Summary
Commands Reference Docs
.agents/skills/nemoclaw-user-reference/references/commands.md, docs/reference/commands.md
Documents NEMOCLAW_DISABLE_INFERENCE_ROUTE_REPAIR to skip automatic DNS-proxy repair for inference.local during connect flows.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant CLI
  participant Sandbox
  participant Openshell
  participant Docker
  participant DNSProxy
  User->>CLI: run sandbox connect
  CLI->>Sandbox: probe gateway/status
  CLI->>Openshell: sandbox exec -> GET inference.local/v1/models
  Openshell-->>CLI: probe response (200/503)
  alt probe unhealthy
    CLI->>Docker: kubectl get service kube-dns
    Docker-->>CLI: service IP / empty
    alt service empty
      CLI->>Docker: kubectl get endpoints kube-dns
      Docker-->>CLI: endpoint IP
    end
    CLI->>DNSProxy: runSetupDnsProxy -> repair
    DNSProxy-->>CLI: repaired
  end
  CLI->>User: connect result (non-fatal on repair failures)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Soft paws tap on config keys,

DNS first, then endpoints please,
When inference sighs and answers 503,
I stitch its route and let it be,
Presets hop home — brave skips Hermes' tree.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: repairing stale inference DNS routes during sandbox connect/probe operations, which is the primary technical objective across multiple file changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/inference-local-dns-recovery

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 8603-8605: The code computes policyPresetAgentName, allPresets
(via filterPolicyPresetsForAgent(policies.listPresets(), policyPresetAgentName))
and applied (policies.getAppliedPresets(sandboxName)) but later logic can
reintroduce excluded presets into chosen via previously applied or
resume-selected state; update the selection flow so that any source of presets
(applied, resume-selected, or other previously persisted lists) is intersected
with allPresets before building chosen — specifically clamp
policies.getAppliedPresets(sandboxName) and any resume/restore logic to only
include items present in the allPresets set (use resolvePolicyPresetAgentName,
filterPolicyPresetsForAgent, policies.listPresets(), and the chosen assignment
points as anchors).

In `@test/onboard.test.ts`:
- Around line 55-58: The type-guard isOnboardTestInternals currently doesn't
verify the newly added member filterPolicyPresetsForAgent, so objects can be
narrowed while that property is undefined; update isOnboardTestInternals to
check that the candidate has a truthy typeof === "function" for
filterPolicyPresetsForAgent (in addition to existing checks) so that
OnboardTestInternals-typed values truly provide filterPolicyPresetsForAgent
before any code calls it.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a748ba7a-d217-4570-bb36-29163776932a

📥 Commits

Reviewing files that changed from the base of the PR and between b1320d5 and 9610b04.

📒 Files selected for processing (10)
  • src/commands/uninstall.ts
  • src/lib/actions/dns/index.test.ts
  • src/lib/actions/dns/index.ts
  • src/lib/actions/sandbox/connect.ts
  • src/lib/onboard.ts
  • test/dns-proxy.test.ts
  • test/nemohermes-alias.test.ts
  • test/onboard.test.ts
  • test/policy-tiers-onboard.test.ts
  • test/sandbox-connect-inference.test.ts

Comment thread src/lib/onboard.ts Outdated
Comment thread test/onboard.test.ts Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 8588-8607: clampPolicyPresetNames currently filters out presets
solely by name using isPolicyPresetExcludedForAgent, which wrongly drops sandbox
custom presets that share excluded names; update clampPolicyPresetNames to
accept the sandboxName (or otherwise detect custom origin) and build a set of
custom preset names (via policies.listCustomPresets(sandboxName)) so that if a
preset name is a sandbox custom preset it is preserved even if
isPolicyPresetExcludedForAgent(name, agentName) would exclude it; apply the same
change to the other occurrences noted (around the blocks at 8638-8645 and
10319-10326) and ensure callers are updated to pass sandboxName so
syncPresetSelection and any resume/non-interactive code keep user-selected
custom presets.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 40fd1d92-d6b4-44a5-9287-72de0ae91754

📥 Commits

Reviewing files that changed from the base of the PR and between 5c91b8d and 3df09bd.

📒 Files selected for processing (1)
  • src/lib/onboard.ts

Comment thread src/lib/onboard.ts Outdated
@ericksoa ericksoa requested a review from jyaunches May 8, 2026 17:43
@jyaunches jyaunches self-assigned this May 8, 2026
@jyaunches

Copy link
Copy Markdown
Contributor

PR Review Notes

Triaged this PR in a fresh worktree. Overall the DNS repair work is solid, but there are a few blockers to resolve before merge — chiefly a supersession question against your own #3223, a rebase against current main, and one custom-preset correctness bug.

🔴 Blockers (must fix before merge)

  1. Merge conflicts with main. mergeStateStatus: DIRTY, branch is 4 commits behind origin/main. Rebase is needed, and in particular please verify that hasChatCompletionsToolCall / requireChatCompletionsToolCalling: true on the Ollama validation path (added by fix: complete #2731, #2697, and #2727 reliability fixes #2737) survives the rebase — the two-dot diff currently shows them being "removed" as an artifact of the stale base.

  2. Supersession vs. PR fix: skip Brave policy preset for unsupported agents #3223. fix: skip Brave policy preset for unsupported agents #3223 targets the same Hermes-excludes-Brave problem with a different, more extensible, capability-based approach (webSearchSupported: boolean threaded through computeSetupPresetSuggestions / setupPoliciesWithSelection). This PR uses a hardcoded AGENT_POLICY_PRESET_EXCLUSIONS: { hermes: ["brave"] } table. Both cannot land. My recommendation: take fix: skip Brave policy preset for unsupported agents #3223's capability-based API and drop the exclusion table from this PR, so adding a new agent that doesn't support Brave is "set agentSupportsWebSearch to false" rather than "edit a CLI-side agent-name table."

  3. clampPolicyPresetNames silently drops custom presets whose name collides with an exclusion. listPolicyPresetsForAgent() intentionally keeps policies.listCustomPresets(sandboxName), but clampPolicyPresetNames() filters purely by name — so a user's custom brave preset in a Hermes sandbox is interactively selectable, then wiped by resume / non-interactive paths. clampPolicyPresetNames needs to be source-aware (accept sandboxName and skip exclusion for names that are registered custom presets). CodeRabbit flagged the same issue.

🟡 Warnings (should fix)

  1. src/lib/onboard.ts monolith growth: +69 net lines (10,488 → 10,557). This is over the +20-line push-back threshold we're holding refactoring PRs to. The new helpers — AGENT_POLICY_PRESET_EXCLUSIONS, isPolicyPresetExcludedForAgent, filterPolicyPresetsForAgent, resolvePolicyPresetAgentName, listPolicyPresetsForAgent, clampPolicyPresetNames — are a coherent policy-domain concept and belong in src/lib/policies.ts or a new src/lib/policies/agent-exclusions.ts with co-located tests. They're extraction-ready. Ref Architectural improvement: Extract rebuild recreate path from onboard monolith and canonicalize credential resolution #2306.

  2. Exclusions can leak back in reuse / resume flows. allPresets is filtered at the top, but chosen can be re-populated from previously-applied state or resume-selected presets before the final syncPresetSelection call, so Hermes may end up re-applying brave across a resume. Either clamp at every mutation site of chosen, or clamp once at the bottom just before the sync call. (Also noted by CodeRabbit.)

  3. test/onboard.test.ts:58isOnboardTestInternals runtime guard not updated for the new filterPolicyPresetsForAgent member. The type and destructuring were updated but the typeof value.filterPolicyPresetsForAgent === "function" check is missing from the narrowing guard. If the module ever stops exporting it, the test crashes at use-site instead of failing the guard clearly.

  4. src/lib/actions/sandbox/connect.ts — new sh -c "..." shell-string probe. Not a security regression (all args are literal), but it's a new shell-string callsite for fix(security): migrate remaining shell-string callsites to argv arrays #1889. Please either factor isSandboxInferenceRouteHealthy into an argv form or leave a comment explaining why the case / output capture makes sh -c unavoidable here.

  5. Scope creep. The PR title is DNS-only but the branch contains three concerns: DNS repair, Hermes/Brave preset clamping, and nemohermes uninstall routing. At minimum the uninstall commit is unrelated and would merge faster as its own PR. The DNS repair commit (9610b04a2) is the cleanest piece here and could land on its own tomorrow if separated from the policy-preset work.

  6. src/commands/uninstall.tsCLI_NAME replaced with literal "nemoclaw uninstall". If the intent is to force NemoClaw branding even under the nemohermes alias, that should be justified in a comment. As written it reads like a regression against fix: brand NemoHermes uninstall goodbye #3220's branding work.

🔵 Suggestions

  1. DNS upstream discovery — consider also accepting kube-dns / coredns in namespaces other than kube-system for future flexibility. Not blocking; matches existing endpoints-fallback scope.

  2. Document NEMOCLAW_DISABLE_INFERENCE_ROUTE_REPAIR. Good escape hatch; please mention it in the PR body and a docs file so operators can discover it.

  3. Colocate INFERENCE_ROUTE_PROBE_TIMEOUT_MS = 10_000 with OPENSHELL_PROBE_TIMEOUT_MS in openshell/timeouts.ts. Makes future timeout tuning one-file.

✅ What's Good

  • The DNS preference (kube-dns service ClusterIP → CoreDNS endpoint IP → 8.8.8.8) is a real, defensible improvement — service IPs survive pod rotation, endpoint IPs don't.
  • The behavioral test in test/sandbox-connect-inference.test.ts is excellent: it stubs docker + openshell, uses a real execFileSync of the connect path, and asserts on both docker-call sequencing (new path preferred, fallback not taken) and user-visible log strings. Right level of test for this logic short of full E2E.
  • ensureSandboxInferenceRoute extraction is clean — one function called at all three probe outcomes (running / recovered / failed), with consistent quiet handling. Good internal refactor even without the agent-exclusion work.
  • The repair → re-probe → succeed-or-warn flow fails open with a clear warning instead of blocking connect. Correct choice for a DNS-recovery fix.

Test depth recommendation: 🔴 E2E required

The DNS repair path runs a real-world sequence — openshell sandbox exec → sh -c → curl → inference.local → DNS proxy repair → re-probe. The stubbed behavioral test proves the CLI makes the right calls in the right order, but it cannot prove the repair actually fixes the broken DNS route on a real cluster — which is the whole point of the change.

Concrete suggested E2E scenario to add before merge:

  1. Create a sandbox, run nemoclaw sandbox connect once to confirm inference.local works.
  2. Artificially break the DNS proxy (e.g., rewrite /etc/resolv.conf inside the sandbox, or restart DNS proxy with a bogus upstream).
  3. Run nemoclaw sandbox connect again.
  4. Assert stdout contains inference.local is unavailable inside '<name>'. Repairing sandbox DNS proxy... followed by inference.local route repaired., and that a subsequent openshell sandbox exec -- curl -sk https://inference.local/v1/models returns HTTP 200.
  5. Negative case: set NEMOCLAW_DISABLE_INFERENCE_ROUTE_REPAIR=1 and verify repair is skipped.

Trigger via nightly-dispatch.yml with a test-filter targeting the new scenario, or run locally on sparky.

@jyaunches

Copy link
Copy Markdown
Contributor

🤖 E2E Advisor Recommendation

Ran the e2e-advisor on this PR from my fork (workflow not yet available upstream). Full run: jyaunches/NemoClaw#25580284587

Recommended E2E jobs to run before merge

High-priority (required):

  • sandbox-operations-e2e — exercises the new inference.local probe + DNS-proxy auto-repair on connect / --probe-only (this is the core change)
  • hermes-e2e — covers the new AGENT_POLICY_PRESET_EXCLUSIONS map (hermes drops brave)
  • cloud-onboard-e2e — policy preset validation + resume reapply path changed

Medium-priority:

  • inference-routing-e2e — DNS proxy upstream selection + inference route swap both affect routing/credential isolation
  • cloud-inference-e2e — the kube-dns service → endpoints change alters the forwarder's upstream; could silently break inference.local resolution

Suggested dispatch

Workflow: .github/workflows/nightly-e2e.yaml
jobs: sandbox-operations-e2e,inference-routing-e2e,cloud-inference-e2e,hermes-e2e,cloud-onboard-e2e

Coverage gaps flagged (worth adding)

  1. sandbox-connect DNS-repair path — no existing E2E exercises it. Suggested scenario: start a sandbox, disrupt the in-sandbox DNS proxy (kill /tmp/dns-proxy.py or clobber /etc/resolv.conf), run nemoclaw <name> connect --probe-only, and assert curl https://inference.local/v1/models recovers.
  2. hermes-onboard preset exclusionAGENT_POLICY_PRESET_EXCLUSIONS has no dedicated test. A future agent added to the map or a regression removing the filter would silently re-allow disallowed presets. Suggested: non-interactive Hermes onboard with NEMOCLAW_POLICY_PRESETS=brave,pypi,npm and assert brave is dropped from the Hermes sandbox but still allowed on an openclaw onboard.

Optional (lower priority)

onboard-resume-e2e, sandbox-survival-e2e, issue-2478-crash-loop-recovery-e2e, hermes-discord-e2e, docs-validation-e2e


Generated by the E2E advisor prototype (deterministic + Pi semantic analysis). Static diff analysis only — no PR code executed.

ericksoa added 3 commits May 8, 2026 21:26
Preserve custom policy presets when clamping agent exclusions.
Move setup policy preset filtering into the policy domain and drive Brave omission from the web-search support signal instead of an agent-name exclusion table. Preserve sandbox custom presets whose names collide with unsupported built-ins across resume and non-interactive flows.

Document the intentional sandbox shell probe shape and the global nemohermes uninstall routing.
@ericksoa

ericksoa commented May 9, 2026

Copy link
Copy Markdown
Contributor Author

@jyaunches I went through both of your comments and addressed them point by point.

PR Review Notes

Blockers

  1. Merge conflicts with current main
    Resolved. I merged current origin/main into the PR branch in 9a3739336. The only conflict surface was the test/onboard.test.ts internals guard, and the resolved guard keeps all three relevant checks: hasChatCompletionsToolCall, hasChatCompletionsToolCallLeak, and filterSetupPolicyPresets. The PR is mergeable against current main after the merge.

  2. Supersession vs. fix: skip Brave policy preset for unsupported agents #3223
    Addressed in a0426e414. I took the fix: skip Brave policy preset for unsupported agents #3223 direction and removed the hardcoded AGENT_POLICY_PRESET_EXCLUSIONS / agent-name filtering table from this PR. Brave is now filtered from setup policy presets through the existing agentSupportsWebSearch(...) capability signal, threaded as webSearchSupported into computeSetupPresetSuggestions(...) and setupPoliciesWithSelection(...). This means the decision is capability-based rather than tied to a CLI-side agent-name table.

    I also moved the setup preset support helpers into the policy domain in src/lib/policies.ts: setupPolicyPresetSupported, filterSetupPolicyPresets, listSetupPolicyPresets, and clampSetupPolicyPresetNames. That keeps the policy filtering/clamping logic out of the onboard monolith.

  3. Custom preset collision in clampPolicyPresetNames
    Addressed. The clamp path is now source-aware: clampSetupPolicyPresetNames(...) receives the sandbox's custom preset names and preserves those names even when the same name would be unsupported as a built-in. I added coverage for both resume and non-interactive flows where a sandbox custom preset named brave must survive while the built-in brave preset is removed when web search is unsupported.

Warnings

  1. src/lib/onboard.ts growth / policy-domain extraction
    Addressed in a0426e414. The new setup preset helpers are now in src/lib/policies.ts, and src/lib/onboard.ts now calls into the policy module instead of carrying an agent-exclusion helper cluster inline.

  2. Exclusions leaking back through reuse / resume
    Addressed. Current applied presets, recorded resume presets, and explicit selected resume presets are clamped through the same support-aware helper before syncPresetSelection(...). I added a specific unsupported-only resume test so selectedPresets: ["brave"] with webSearchSupported: false removes Brave instead of falling back to tier defaults.

  3. isOnboardTestInternals guard
    Addressed. The guard now checks the exported preset-filter helper (filterSetupPolicyPresets) before tests destructure/use it.

  4. New sh -c inference probe
    Addressed. I left it as sh -c because the curl write-out, response body capture, and status classification need to execute inside the sandbox as one bounded probe; I added a comment explaining that and noting that sandboxName remains an argv value, so user input is not interpolated into the shell script.

  5. Scope creep
    I did not split the PR at this point. The branch still contains DNS route repair, policy preset capability filtering, and nemohermes uninstall routing, but each path now has focused tests and the policy work has been aligned with fix: skip Brave policy preset for unsupported agents #3223's capability-based shape. The uninstall change is intentionally small and directly covered by test/nemohermes-alias.test.ts / test/uninstall.test.ts.

  6. src/commands/uninstall.ts literal usage string
    Addressed. I added an inline comment explaining that the usage is intentionally global under the nemohermes alias because nemohermes uninstall is the package uninstaller, not a sandbox-scoped action.

Suggestions

  1. Alternate CoreDNS namespaces
    Not changed in this patch. The current PR keeps the existing kube-system/kube-dns contract and only changes the preference order from endpoint IP first to service ClusterIP first with endpoint fallback.

  2. Document NEMOCLAW_DISABLE_INFERENCE_ROUTE_REPAIR
    Addressed in 5c25526a and reflected in the PR body. The env var is now documented in docs/reference/commands.md and the generated user-reference skill mirror.

  3. Move inference route timeout constant
    Addressed in a0426e414. OPENSHELL_INFERENCE_ROUTE_PROBE_TIMEOUT_MS now lives next to the other OpenShell timeout constants in src/lib/adapters/openshell/timeouts.ts.

E2E Advisor Recommendation

I dispatched the exact recommended targeted nightly run on this PR branch:

nightly-e2e jobs: sandbox-operations-e2e,inference-routing-e2e,cloud-inference-e2e,hermes-e2e,cloud-onboard-e2e

Run: https://github.com/NVIDIA/NemoClaw/actions/runs/25592421054

Local validation after the reviewer-response patch

  • npm run build:cli
  • npm run typecheck:cli
  • npx vitest run test/policy-tiers-onboard.test.ts test/onboard-preset-diff.test.ts
  • npx vitest run test/onboard.test.ts -t "computeSetupPresetSuggestions|agentSupportsWebSearch|configureWebSearch|prints numbered step headers"
  • npx vitest run test/sandbox-connect-inference.test.ts src/lib/actions/dns/index.test.ts test/nemohermes-alias.test.ts test/uninstall.test.ts src/lib/actions/sandbox/oclif-command-adapters.test.ts src/lib/commands/simple-global-oclif-adapters.test.ts
  • git diff --check

@github-actions

github-actions Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 25592421054
Branch: fix/inference-local-dns-recovery
Requested jobs: sandbox-operations-e2e,inference-routing-e2e,cloud-inference-e2e,hermes-e2e,cloud-onboard-e2e
Summary: 5 passed, 0 failed, 0 skipped

Job Result
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
hermes-e2e ✅ success
inference-routing-e2e ✅ success
sandbox-operations-e2e ✅ success

@ericksoa ericksoa merged commit c4aaec3 into main May 9, 2026
64 checks passed
@miyoungc miyoungc mentioned this pull request May 12, 2026
12 tasks
miyoungc added a commit that referenced this pull request May 12, 2026
## Summary
Refreshes the release-prep docs for v0.0.39 based on changes merged
since the Friday 4pm doc refresh. Updates the source docs, bumps the
docs version metadata, and regenerates the NemoClaw user skills from the
refreshed docs.

## Changes
- #3314 -> `docs/get-started/prerequisites.md`,
`docs/get-started/quickstart.md`, `docs/reference/troubleshooting.md`:
Documents installer Docker setup, Docker group activation, and retry
guidance.
- #3317 -> `docs/get-started/quickstart.md`,
`docs/reference/commands.md`: Documents the DGX Spark and DGX Station
express install prompt and `NEMOCLAW_NO_EXPRESS`.
- #3328 and #3329 -> `docs/security/best-practices.md`,
`docs/deployment/sandbox-hardening.md`: Updates sandbox capability
hardening docs for the stricter bounding-set and `setpriv` step-down
behavior.
- #3330, #3335, and #3346 -> `docs/inference/use-local-inference.md`:
Documents Windows-host Ollama relaunch behavior, NIM key passthrough,
early health-fail diagnostics, and mixed-GPU preflight detail.
- #2406, #2883, #3001, #3244, #3267, #3318, #3320, and #3354 ->
`docs/about/release-notes.md`: Adds the v0.0.39 release-prep section
while keeping the v0.0.38 release notes intact.
- Advances the release-prep docs metadata from v0.0.38 to v0.0.39.
- Regenerates `.agents/skills/nemoclaw-user-*` from the updated source
docs.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes v0.0.39

* **New Features**
  * Host alias management commands for easier configuration
  * Sandbox GPU control options during onboarding
  * Update command with check and confirmation modes

* **Documentation**
* Enhanced Linux installer guidance with Docker and group membership
handling
  * Expanded troubleshooting for permission and connectivity issues
  * Improved capability-dropping security documentation
  * Updated inference model switching commands
  * Brev environment-specific troubleshooting

* **Improvements**
  * DGX Spark/Station express install flow
  * Windows Ollama relay and health-check enhancements
  * NVIDIA NIM preflight GPU reporting

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3375)

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@wscurran wscurran added bug-fix PR fixes a bug or regression feature PR adds or expands user-visible functionality area: performance Latency, throughput, resource use, benchmarks, or scaling and removed fix feature PR adds or expands user-visible functionality labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: performance Latency, throughput, resource use, benchmarks, or scaling bug-fix PR fixes a bug or regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants