Skip to content

refactor(cli): replace race-condition sleeps with readiness probes#2442

Merged
ericksoa merged 23 commits into
NVIDIA:mainfrom
ksapru:fix/replace-race-condition-sleeps
May 13, 2026
Merged

refactor(cli): replace race-condition sleeps with readiness probes#2442
ericksoa merged 23 commits into
NVIDIA:mainfrom
ksapru:fix/replace-race-condition-sleeps

Conversation

@ksapru

@ksapru ksapru commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Replaced arbitrary sleep() delays with deterministic readiness probes (waitForPort, waitForHttp, waitUntil) across the NemoClaw CLI onboarding flow. This improves performance by proceeding as soon as readiness is achieved and prevents race conditions on slower systems.

Refs #2001

Changes

  • Added synchronous wait utilities (waitForPort, waitForHttp, waitUntil) to src/lib/wait.ts.
  • Replaced sleep(1) with waitForPort(OLLAMA_PROXY_PORT, 2) during proxy initialization.
  • Replaced sleep(2) with waitForHttp(http://127.0.0.1:${OLLAMA_PORT}/, 10) when starting the Ollama server.
  • Replaced sleep(5) with waitUntil polling CoreDNS readiness via kubectl get pods.
  • Replaced hardcoded retry loops and sleep(2) calls with waitUntil for applying and removing policy presets.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

AI Disclosure

  • AI-assisted — tool: Antigravity

Signed-off-by: Krish Sapru ksapru@bu.edu

Summary by CodeRabbit

  • New Features

    • Added active readiness checks for ports and HTTP endpoints to improve service startup detection.
  • Bug Fixes

    • Replaced fixed delays with polling-based waits across startup and configuration flows, improving reliability and clearer error reporting on timeouts.
    • Policy application/removal now retries intelligently for transient "sandbox not found" cases and fails fast for other errors.
  • Tests

    • Specified an explicit timeout for the sandbox readiness test.

@copy-pr-bot

copy-pr-bot Bot commented Apr 24, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces fixed sleep delays with polling-based readiness utilities. Adds waitForPort, waitForHttp, and waitUntil, and updates startup/retry flows (auth proxy, CoreDNS patching, Ollama serve, policy preset apply/remove) to poll for readiness and handle timeouts/errors accordingly.

Changes

Cohort / File(s) Summary
Readiness Utilities
src/lib/wait.ts
Adds waitUntil(conditionFn, timeoutSeconds, pollIntervalMs), waitForPort(port, timeoutSeconds) (uses nc), and waitForHttp(url, timeoutSeconds) (uses curl) — synchronous polling helpers that return boolean on success/failure.
Startup & Recovery Flows
src/lib/onboard.ts
Replaces numerous fixed sleep() calls with readiness polling: auth proxy startup now blocks until port reachable (returns/aborts on timeout), CoreDNS patch step polls kube-dns pod readiness (warns on timeout but continues), Ollama serve waits for HTTP root endpoint (errors out or returns to provider selection on timeout), and policy preset apply/remove use waitUntil with selective retry only for sandbox not found errors.
Tests
test/onboard.test.ts
Specifies an explicit 10s Vitest timeout for one test (it(..., { timeout: 10000 })).

Sequence Diagram(s)

sequenceDiagram
    participant CLI as Onboard CLI
    participant Proxy as Auth Proxy
    participant K8s as CoreDNS / kube-dns
    participant Ollama as Ollama Server
    participant Policy as Policy Service

    CLI->>Proxy: spawn auth-proxy process
    CLI->>Proxy: waitForPort(port)
    alt port reachable
        Proxy-->>CLI: success
    else timeout
        Proxy-->>CLI: log error, return/abort
    end

    CLI->>K8s: run fix-coredns.sh
    CLI->>K8s: waitUntil(kube-dns ready)
    alt ready
        K8s-->>CLI: continue
    else timeout
        K8s-->>CLI: warn, continue startup
    end

    CLI->>Ollama: start `ollama serve`
    CLI->>Ollama: waitForHttp(root URL)
    alt http OK
        Ollama-->>CLI: continue
    else timeout
        Ollama-->>CLI: log error, exit or return to selection
    end

    CLI->>Policy: apply/remove preset
    CLI->>Policy: waitUntil(apply/remove succeeds, retry only on "sandbox not found")
    alt success
        Policy-->>CLI: done
    else timeout or non-retry error
        Policy-->>CLI: fail/throw
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Poem

🐇 A rabbit hums while cables wake,

Polling ports by moonlit lake,
No blind sleeps, just patient cheer,
Services rise when pathways clear,
Hops of code — startup is near! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title accurately describes the main change: replacing arbitrary sleep delays with readiness probes to eliminate race conditions.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@ksapru ksapru force-pushed the fix/replace-race-condition-sleeps branch from b248aa5 to 1961db8 Compare April 24, 2026 19:56
@ksapru ksapru marked this pull request as ready for review April 24, 2026 19:57

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

5631-5647: Extract shared retry wrapper for sandbox-not-found transient handling.

The same waitUntil + lastErr + sandbox not found pattern is repeated in three places. A small helper would reduce drift and simplify maintenance.

Also applies to: 6303-6321, 6326-6348

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5631 - 5647, Extract the repeated "waitUntil
+ lastErr + sandbox not found" retry logic into a small helper (e.g.,
retryUntilSandboxReady or runWithSandboxRetry) that accepts an action function
and retries using the existing waitUntil semantics; inside the helper call the
provided action (the body that currently calls policies.applyPreset(sandboxName,
name)), catch errors, store the last error, treat errors whose message includes
"sandbox not found" as transient (return false for retry) and re-throw other
errors, and finally re-throw lastErr if retries fail; then replace the three
duplicated blocks (the code around policies.applyPreset(...) and the similar
blocks at the other locations) with calls to this helper passing the appropriate
parameters (sandboxName, name or other action closures).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Line 2064: The waitForPort readiness checks are currently invoked without
checking their boolean result, allowing execution to continue on timeout; update
every call site of waitForPort (e.g., the calls in src/lib/onboard.ts where
OLLAMA_PROXY_PORT and the other ports are waited on) to handle failure: await
the boolean result and if it is false, log a clear error and abort the
onboarding flow (throw or return a failure status) instead of proceeding, so the
flow does not race ahead when services aren't ready. Ensure you apply this
pattern consistently to all occurrences of waitForPort in the file.

---

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 5631-5647: Extract the repeated "waitUntil + lastErr + sandbox not
found" retry logic into a small helper (e.g., retryUntilSandboxReady or
runWithSandboxRetry) that accepts an action function and retries using the
existing waitUntil semantics; inside the helper call the provided action (the
body that currently calls policies.applyPreset(sandboxName, name)), catch
errors, store the last error, treat errors whose message includes "sandbox not
found" as transient (return false for retry) and re-throw other errors, and
finally re-throw lastErr if retries fail; then replace the three duplicated
blocks (the code around policies.applyPreset(...) and the similar blocks at the
other locations) with calls to this helper passing the appropriate parameters
(sandboxName, name or other action closures).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a9fc8c1c-769e-4fdc-ae2a-a41cfca3a64b

📥 Commits

Reviewing files that changed from the base of the PR and between f7dff7b and 1961db8.

📒 Files selected for processing (3)
  • src/lib/onboard.ts
  • src/lib/wait.ts
  • test/onboard.test.ts

Comment thread src/lib/onboard.ts Outdated
Readiness probes (waitForPort, waitForHttp, waitUntil) were returning boolean outcomes that were ignored at their call sites. This allowed the onboarding flow to race ahead even if services like Ollama or CoreDNS failed to become ready within their timeout windows.

This adds explicit error handling to abort initialization or surface warnings if these components fail to start, matching the expected fail-fast behavior.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/lib/onboard.ts (1)

2098-2102: ⚠️ Potential issue | 🟠 Major

Propagate proxy restart failure instead of only logging it.

On Lines 2098-2102 this helper returns early on timeout, but setupInference() still continues with the persisted token. That leaves onboarding configured for ollama-local even though nothing is listening on OLLAMA_PROXY_PORT.

Suggested fix
-function ensureOllamaAuthProxy(): void {
+function ensureOllamaAuthProxy(): boolean {
   // Try to load persisted token first — if none, this isn't an Ollama setup.
   const token = loadPersistedProxyToken();
-  if (!token) return;
+  if (!token) return true;

   const pid = loadPersistedProxyPid();
   if (isOllamaProxyProcess(pid)) {
     ollamaProxyToken = token;
-    return;
+    return true;
   }

   // Proxy not running — restart it with the persisted token.
   killStaleProxy();
   ollamaProxyToken = token;
   spawnOllamaAuthProxy(token);
   if (!waitForPort(OLLAMA_PROXY_PORT, 2)) {
     console.error(`  Error: Ollama auth proxy did not become ready on :${OLLAMA_PROXY_PORT} within timeout.`);
-    // Void function, return early
-    return;
+    ollamaProxyToken = null;
+    return false;
   }
+  return true;
 }
if (!ensureOllamaAuthProxy()) {
  process.exit(1);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2098 - 2102, The current check using
waitForPort(OLLAMA_PROXY_PORT, 2) only logs and returns early, allowing
setupInference() to continue with a persisted token despite the proxy not
running; change the helper so the failure is propagated (either throw an error
or return a boolean) and ensure callers handle it—e.g., implement/replace with
ensureOllamaAuthProxy() that returns false on timeout (or throws), update the
caller of this check to call ensureOllamaAuthProxy() and call process.exit(1)
(or rethrow) when it returns false to prevent setupInference() from proceeding
without a running Ollama auth proxy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 3342-3348: The current coredns readiness check (corednsReady using
waitUntil and runCaptureOpenshell with "kubectl get pods" looking for the string
"coredns") only detects pod listing and can be true while the pod is not Ready;
move this check inside the branch that runs when shouldPatchCoredns(runtime) is
true (so it only runs when we actually patched CoreDNS), and change the kubectl
invocation to assert actual readiness (e.g., query pod status/conditions or
containerStatuses via "kubectl get pods -n kube-system -l <coredns-label> -o
json" or use jsonpath to ensure status.phase == Running and all
containerStatuses[].ready == true) and have waitUntil rely on that more precise
readiness condition instead of a substring match from runCaptureOpenshell.

---

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 2098-2102: The current check using waitForPort(OLLAMA_PROXY_PORT,
2) only logs and returns early, allowing setupInference() to continue with a
persisted token despite the proxy not running; change the helper so the failure
is propagated (either throw an error or return a boolean) and ensure callers
handle it—e.g., implement/replace with ensureOllamaAuthProxy() that returns
false on timeout (or throws), update the caller of this check to call
ensureOllamaAuthProxy() and call process.exit(1) (or rethrow) when it returns
false to prevent setupInference() from proceeding without a running Ollama auth
proxy.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f00ad4ce-2dfc-456b-92d6-6b88e596fdfe

📥 Commits

Reviewing files that changed from the base of the PR and between 1961db8 and ef67b43.

📒 Files selected for processing (2)
  • .agents/skills/nemoclaw-user-reference/references/commands.md
  • src/lib/onboard.ts
✅ Files skipped from review due to trivial changes (1)
  • .agents/skills/nemoclaw-user-reference/references/commands.md

Comment thread src/lib/onboard.ts Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
src/lib/onboard.ts (2)

3341-3360: ⚠️ Potential issue | 🟠 Major

This CoreDNS probe can still pass before all pods are ready.

The combined jsonpath string only proves that at least one matched pod is Running, at least one container reported true, and no literal false appeared. If one CoreDNS pod is ready and another is still Pending with no containerStatuses, this returns true and the flow moves on too early. Parse the pod list structurally and require every matched pod to be Running with all containers ready, or switch to a direct kubectl wait --for=condition=Ready check.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 3341 - 3360, The current coredns readiness
probe in corednsReady (using waitUntil and runCaptureOpenshell) can return true
prematurely because the jsonpath check only verifies at-least-one-running and
presence/absence of literal strings; change the probe to either (A) run kubectl
get pods -n kube-system -l k8s-app=kube-dns -o json, parse the returned JSON and
ensure every item has .status.phase === "Running" and every container in
.status.containerStatuses has ready === true before returning success, or (B)
replace the check with a direct kubectl wait --for=condition=Ready pod -n
kube-system -l k8s-app=kube-dns --timeout=<sensible timeout> and use
runCaptureOpenshell to evaluate its success; update corednsReady accordingly.

2098-2102: ⚠️ Potential issue | 🟠 Major

Propagate proxy restart failure to the caller.

On Line 2098, ensureOllamaAuthProxy() only logs and returns, but it leaves ollamaProxyToken populated. setupInference() can then continue configuring ollama-local against a dead proxy, so resume/reconnect flows fail later and less clearly. Make this helper return a boolean like startOllamaAuthProxy() and abort the caller on false.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2098 - 2102, ensureOllamaAuthProxy currently
logs and returns on proxy start timeout but leaves ollamaProxyToken set,
allowing setupInference to proceed against a dead proxy; change
ensureOllamaAuthProxy to return a boolean (e.g., rename or implement as
startOllamaAuthProxy/ensureOllamaAuthProxy returning true on success, false on
failure), ensure it clears ollamaProxyToken when failing, and update callers
(notably setupInference) to abort/return early when the helper returns false so
the caller does not continue configuring ollama-local against a nonfunctional
proxy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 5665-5680: The waitUntil loop assumes
policies.applyPreset(sandboxName, name) either throws or succeeds, but
applyPreset can return false; update the waitUntil callback to capture the
return value of policies.applyPreset(sandboxName, name) (call it e.g. applied),
and if applied is falsy set lastErr = new Error(`applyPreset returned false for
preset ${name}`) and return false so the loop will retry and ultimately trigger
the existing failure path; keep the existing catch behavior for thrown errors
unchanged.

---

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 3341-3360: The current coredns readiness probe in corednsReady
(using waitUntil and runCaptureOpenshell) can return true prematurely because
the jsonpath check only verifies at-least-one-running and presence/absence of
literal strings; change the probe to either (A) run kubectl get pods -n
kube-system -l k8s-app=kube-dns -o json, parse the returned JSON and ensure
every item has .status.phase === "Running" and every container in
.status.containerStatuses has ready === true before returning success, or (B)
replace the check with a direct kubectl wait --for=condition=Ready pod -n
kube-system -l k8s-app=kube-dns --timeout=<sensible timeout> and use
runCaptureOpenshell to evaluate its success; update corednsReady accordingly.
- Around line 2098-2102: ensureOllamaAuthProxy currently logs and returns on
proxy start timeout but leaves ollamaProxyToken set, allowing setupInference to
proceed against a dead proxy; change ensureOllamaAuthProxy to return a boolean
(e.g., rename or implement as startOllamaAuthProxy/ensureOllamaAuthProxy
returning true on success, false on failure), ensure it clears ollamaProxyToken
when failing, and update callers (notably setupInference) to abort/return early
when the helper returns false so the caller does not continue configuring
ollama-local against a nonfunctional proxy.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 21700ac2-f12a-4a5a-bd5f-6bab1ffa07bf

📥 Commits

Reviewing files that changed from the base of the PR and between ef67b43 and d3c5e5c.

📒 Files selected for processing (1)
  • src/lib/onboard.ts

Comment thread src/lib/onboard.ts Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
src/lib/onboard.ts (3)

2314-2318: ⚠️ Potential issue | 🟠 Major

Propagate proxy restart failure out of ensureOllamaAuthProxy().

Returning from this void helper only stops the helper. Line 5572 still continues with the persisted token, so onboarding can configure ollama-local even though the proxy never came back. Make this helper return a boolean and abort the caller on false.

Suggested fix
-function ensureOllamaAuthProxy(): void {
+function ensureOllamaAuthProxy(): boolean {
   // Try to load persisted token first — if none, this isn't an Ollama setup.
   const token = loadPersistedProxyToken();
-  if (!token) return;
+  if (!token) return true;
@@
   if (isOllamaProxyProcess(pid)) {
     ollamaProxyToken = token;
-    return;
+    return true;
   }
@@
   if (!waitForPort(OLLAMA_PROXY_PORT, 2)) {
     console.error(`  Error: Ollama auth proxy did not become ready on :${OLLAMA_PROXY_PORT} within timeout.`);
-    // Void function, return early
-    return;
+    return false;
   }
+
+  return true;
 }
if (!isWsl() && !ensureOllamaAuthProxy()) {
  process.exit(1);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2314 - 2318, The helper
ensureOllamaAuthProxy currently returns void and simply returns early when
waitForPort(OLLAMA_PROXY_PORT, 2) fails, allowing the caller to continue with a
persisted token; change ensureOllamaAuthProxy to return a boolean (true on
success, false on failure), have the waitForPort failure branch return false
instead of void, and update callers (e.g., the code path that checks isWsl() and
later configures ollama-local) to check the boolean result and abort/exit when
false to prevent proceeding without a ready proxy.

3589-3611: ⚠️ Potential issue | 🟠 Major

This CoreDNS predicate can still pass before all pods are ready.

The aggregated substring check succeeds for states like Running true Pending—one old pod ready, one replacement pod still pending—because there's no literal false to trip the guard. That reopens the DNS race window, and the recovery path at Lines 3690-3709 drops the timeout result entirely. Parse pod JSON per item and require every matched pod to be Running with all containerStatuses[].ready === true, then surface timeout consistently in both call sites.

Also applies to: 3690-3709


6047-6062: ⚠️ Potential issue | 🟠 Major

Handle applyPreset() returning false in the non-interactive path.

waitUntil() currently treats any non-throwing applyPreset() call as success. If applyPreset() returns false, this loop exits immediately and onboarding reports success even though the preset was never applied. syncPresetSelection() already handles this correctly later in the file; this path should match it.

Suggested fix
       let lastErr: Error | null = null;
       const success = waitUntil(() => {
         try {
-          policies.applyPreset(sandboxName, name);
+          if (!policies.applyPreset(sandboxName, name)) {
+            lastErr = new Error(`applyPreset returned false for preset '${name}'.`);
+            return false;
+          }
           return true;
         } catch (err) {
           lastErr = err as Error;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 6047 - 6062, The waitUntil predicate
currently treats any non-throwing call to policies.applyPreset(sandboxName,
name) as success; change it so you call policies.applyPreset(...) and inspect
its boolean return: return true only when the call returns true, and when it
returns false set lastErr to an Error describing that the preset was not applied
for the given sandboxName/name and return false so the loop continues; keep the
existing catch-path that captures thrown errors and rethrows non-"sandbox not
found" errors, and preserve the final if (!success && lastErr) throw lastErr
behavior to surface failures (this makes the non-interactive path match
syncPresetSelection).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 2314-2318: The helper ensureOllamaAuthProxy currently returns void
and simply returns early when waitForPort(OLLAMA_PROXY_PORT, 2) fails, allowing
the caller to continue with a persisted token; change ensureOllamaAuthProxy to
return a boolean (true on success, false on failure), have the waitForPort
failure branch return false instead of void, and update callers (e.g., the code
path that checks isWsl() and later configures ollama-local) to check the boolean
result and abort/exit when false to prevent proceeding without a ready proxy.
- Around line 6047-6062: The waitUntil predicate currently treats any
non-throwing call to policies.applyPreset(sandboxName, name) as success; change
it so you call policies.applyPreset(...) and inspect its boolean return: return
true only when the call returns true, and when it returns false set lastErr to
an Error describing that the preset was not applied for the given
sandboxName/name and return false so the loop continues; keep the existing
catch-path that captures thrown errors and rethrows non-"sandbox not found"
errors, and preserve the final if (!success && lastErr) throw lastErr behavior
to surface failures (this makes the non-interactive path match
syncPresetSelection).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: da63b43f-6a28-4d43-8dc9-f2f503a3effc

📥 Commits

Reviewing files that changed from the base of the PR and between d3c5e5c and be73a0b.

📒 Files selected for processing (2)
  • src/lib/onboard.ts
  • test/onboard.test.ts
✅ Files skipped from review due to trivial changes (1)
  • test/onboard.test.ts

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
src/lib/onboard.ts (2)

2310-2318: ⚠️ Potential issue | 🟠 Major

Propagate proxy restart failure instead of silently continuing.

ensureOllamaAuthProxy() still can't tell its caller that restart failed. On Line 2312 it publishes the persisted token before the new proxy is verified, and this path never checks that the spawned PID is actually ollama-auth-proxy.js. If another process is already listening on :${OLLAMA_PROXY_PORT}, waitForPort() returns true and setupInference() continues with a stale token against the wrong service. Return a boolean here, verify the spawned PID like startOllamaAuthProxy(), and abort the Ollama setup when it returns false.

Proposed fix
-function ensureOllamaAuthProxy(): void {
+function ensureOllamaAuthProxy(): boolean {
   // Try to load persisted token first — if none, this isn't an Ollama setup.
   const token = loadPersistedProxyToken();
-  if (!token) return;
+  if (!token) return true;

   const pid = loadPersistedProxyPid();
   if (isOllamaProxyProcess(pid)) {
     ollamaProxyToken = token;
-    return;
+    return true;
   }

   // Proxy not running — restart it with the persisted token.
   killStaleProxy();
-  ollamaProxyToken = token;
-  spawnOllamaAuthProxy(token);
+  const restartedPid = spawnOllamaAuthProxy(token);
   if (!waitForPort(OLLAMA_PROXY_PORT, 2)) {
     console.error(`  Error: Ollama auth proxy did not become ready on :${OLLAMA_PROXY_PORT} within timeout.`);
-    // Void function, return early
-    return;
+    return false;
   }
+  if (!isOllamaProxyProcess(restartedPid)) {
+    console.error(`  Error: Ollama auth proxy failed to restart on :${OLLAMA_PROXY_PORT}`);
+    return false;
+  }
+  ollamaProxyToken = token;
+  return true;
 }
-      ensureOllamaAuthProxy();
+      if (!ensureOllamaAuthProxy()) {
+        process.exit(1);
+      }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2310 - 2318, ensureOllamaAuthProxy()
currently publishes ollamaProxyToken and spawns a proxy without verifying the
new process or port owner; change it to return a boolean indicating success,
move assignment of ollamaProxyToken until after verification, and replace the
spawn/verify sequence with a call to startOllamaAuthProxy() (which should return
the spawned PID or process identity) so you can check the PID/process matches
the expected ollama-auth-proxy; only set ollamaProxyToken and return true if
waitForPort(OLLAMA_PROXY_PORT, ...) succeeds and the spawned PID/process matches
the expected binary, otherwise kill any started process, log the failure, return
false so setupInference() can abort when ensureOllamaAuthProxy() returns false.

3589-3611: ⚠️ Potential issue | 🟠 Major

Tighten the CoreDNS readiness predicate before treating the patch as complete.

These probes can still pass too early. The current check only requires some "Running" and some "true" with no "false", so a rollout with one ready pod and another Pending pod that has no containerStatuses yet will be treated as ready. In the recovery path, the timeout result is also discarded entirely. Parse the pod JSON and require every matched pod to be Running with every container ready === true, then handle timeout consistently in both call sites.

Proposed fix
+function areCorednsPodsReady(): boolean {
+  const raw = runCaptureOpenshell(
+    [
+      "doctor",
+      "exec",
+      "--",
+      "kubectl",
+      "get",
+      "pods",
+      "-n",
+      "kube-system",
+      "-l",
+      "k8s-app=kube-dns",
+      "-o",
+      "json",
+    ],
+    { ignoreError: true },
+  );
+  try {
+    const parsed = JSON.parse(raw || "{}");
+    const items = Array.isArray(parsed.items) ? parsed.items : [];
+    return (
+      items.length > 0 &&
+      items.every((item) => {
+        const statuses = Array.isArray(item?.status?.containerStatuses)
+          ? item.status.containerStatuses
+          : [];
+        return (
+          item?.status?.phase === "Running" &&
+          statuses.length > 0 &&
+          statuses.every((status) => status?.ready === true)
+        );
+      })
+    );
+  } catch {
+    return false;
+  }
+}
-    const corednsReady = waitUntil(() => {
-      const check = runCaptureOpenshell(...);
-      return check.includes("Running") && !check.includes("false") && check.includes("true");
-    }, 10);
+    const corednsReady = waitUntil(areCorednsPodsReady, 10);
     if (!corednsReady) {
       console.warn("  CoreDNS did not report ready within timeout; continuing may cause DNS flakiness.");
     }
-        waitUntil(() => {
-          const check = runCaptureOpenshell(...);
-          return check.includes("Running") && !check.includes("false") && check.includes("true");
-        }, 10);
+        const corednsReady = waitUntil(areCorednsPodsReady, 10);
+        if (!corednsReady) {
+          console.warn("  CoreDNS did not report ready within timeout during recovery; continuing may cause DNS flakiness.");
+        }

Also applies to: 3690-3709

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 3589 - 3611, The current CoreDNS readiness
predicate (corednsReady) uses string checks on runCaptureOpenshell output and
can falsely report ready; update the predicate in waitUntil to call kubectl with
'-o json' (via runCaptureOpenshell), JSON.parse the output, and require that
every returned pod has status.phase === "Running" and that every
pod.status.containerStatuses exists and every containerStatus.ready === true;
ensure waitUntil returns false on timeout and use that false value consistently
at both call sites (the corednsReady check and the duplicate site) to log a
warning or abort as appropriate rather than discarding the timeout result.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 2310-2318: ensureOllamaAuthProxy() currently publishes
ollamaProxyToken and spawns a proxy without verifying the new process or port
owner; change it to return a boolean indicating success, move assignment of
ollamaProxyToken until after verification, and replace the spawn/verify sequence
with a call to startOllamaAuthProxy() (which should return the spawned PID or
process identity) so you can check the PID/process matches the expected
ollama-auth-proxy; only set ollamaProxyToken and return true if
waitForPort(OLLAMA_PROXY_PORT, ...) succeeds and the spawned PID/process matches
the expected binary, otherwise kill any started process, log the failure, return
false so setupInference() can abort when ensureOllamaAuthProxy() returns false.
- Around line 3589-3611: The current CoreDNS readiness predicate (corednsReady)
uses string checks on runCaptureOpenshell output and can falsely report ready;
update the predicate in waitUntil to call kubectl with '-o json' (via
runCaptureOpenshell), JSON.parse the output, and require that every returned pod
has status.phase === "Running" and that every pod.status.containerStatuses
exists and every containerStatus.ready === true; ensure waitUntil returns false
on timeout and use that false value consistently at both call sites (the
corednsReady check and the duplicate site) to log a warning or abort as
appropriate rather than discarding the timeout result.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 67ec12f5-c28b-43bc-b699-7eb1fda0245c

📥 Commits

Reviewing files that changed from the base of the PR and between be73a0b and 761374d.

📒 Files selected for processing (1)
  • src/lib/onboard.ts

@wscurran wscurran added NemoClaw CLI refactor PR restructures code without intended behavior change labels Apr 27, 2026
@wscurran

Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this pull request that proposes a way to refactor the NemoClaw CLI by replacing arbitrary sleep delays with readiness probes. This change improves performance by proceeding as soon as readiness is achieved and prevents race conditions on slower systems.

@cv cv closed this May 12, 2026
@cv cv reopened this May 12, 2026
@ksapru ksapru force-pushed the fix/replace-race-condition-sleeps branch from bf098d1 to c0b43c2 Compare May 12, 2026 16:18

@cv cv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it!

@cv cv added the v0.0.40 label May 12, 2026
@ksapru ksapru force-pushed the fix/replace-race-condition-sleeps branch from 76c9220 to c0b43c2 Compare May 12, 2026 20:49
jyaunches and others added 7 commits May 12, 2026 16:54
…recover modules (NVIDIA#2398)

## Summary

Extract the scattered dashboard delivery chain logic — URL derivation,
CORS origins, port forwarding target, health probing, and recovery —
from the monolith files (onboard.ts, nemoclaw.ts) into three focused
modules that serve as the single source of truth for dashboard
reachability. Fixes NVIDIA#2342 false "Offline" reading when device auth is
enabled.

## Related Issue

Fixes NVIDIA#2390
Fixes NVIDIA#2342

## Changes

- **New `src/lib/dashboard-contract.ts`**: Pure `buildChain()` function
— single source of truth for all dashboard config (access URL, CORS
origins, forward target, health endpoint, port, bind address). No
`process.env` reads.
- **New `src/lib/dashboard-health.ts`**: `verifyDashboardChain()` checks
all 3 delivery chain links (gateway, forward, CORS) with injected deps.
Accepts HTTP 200 and 401 as "alive" (fixes NVIDIA#2342).
- **New `src/lib/dashboard-recover.ts`**: `recoverDashboardChain()` —
link-aware, idempotent recovery that only fixes broken links (gateway →
forward, CORS diagnose-only).
- **Deleted `src/lib/dashboard.ts`** — replaced by
dashboard-contract.ts.
- **Deleted 7 dashboard helper functions from `src/lib/onboard.ts`**
(~90 lines removed): `getDashboardForwardPort`,
`getDashboardForwardTarget`, `getDashboardForwardStartCommand`,
`buildAuthenticatedDashboardUrl`, `getWslHostAddress`,
`getDashboardAccessInfo`, `getDashboardGuidanceLines`.
- **Rewritten `ensureDashboardForward`** in onboard.ts to use
`buildChain()` instead of re-deriving config.
- **Fixed `isSandboxGatewayRunning()`** in nemoclaw.ts to probe
`/health` and accept 401 as alive.
- **Deleted `recoverSandboxProcesses()` and
`ensureSandboxPortForward()`** from nemoclaw.ts —
`checkAndRecoverSandboxProcesses()` delegates to
`recoverDashboardChain()`.
- **Updated `test/onboard.test.ts`** — imports from new modules,
assertions rewritten for contract API.
- **52 new tests** across 3 test files (32 contract + 13 health + 7
recovery).

## Type of Change
- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only)
- [ ] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

## AI Disclosure
- [x] AI-assisted — tool: Claude Code (pi agent)

---
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Improved dashboard health checks with automated, link-aware recovery
and better WSL URL/forwarding support.

* **Bug Fixes**
* More accurate gateway liveness detection and more reliable
port/forwarding behavior across loopback, IPv6, and remote hosts.

* **Tests**
* Added comprehensive test suites covering dashboard configuration,
health verification, and recovery flows.

* **Refactor**
* Consolidated dashboard URL/forwarding logic into a dedicated contract
for clearer, consistent behavior.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
…VIDIA#2343) (NVIDIA#2408)

## Summary

Follow-up to NVIDIA#2356. The adversarial PR review found that `procps` tools
(`ps`, `top`, `free`, `uptime`, `vmstat`) were still missing inside a
real sandbox runtime even after NVIDIA#2356 added `procps` to
`Dockerfile.base`. Root cause: the GHCR base image may not have been
rebuilt yet, and the `Dockerfile` hardening step (`apt-get autoremove
--purge`) could strip it.

This PR adds a three-layer defense in the production `Dockerfile`:
- **`apt-mark manual procps`** before the autoremove step, so apt never
considers it auto-removable
- **Conditional `apt-get install procps`** if `ps` is still missing
after hardening (handles stale GHCR base images that predate NVIDIA#2356)
- **Build-time `ps --version`** smoke test that fails the Docker build
if procps didn't make it through

Plus regression tests:
- Static guards in `sandbox-provisioning.test.ts` verifying both
Dockerfiles contain procps provisions
- Runtime E2E test in `e2e-test.sh` verifying all five procps tools are
executable inside the sandbox container

## Test plan

- [ ] `hadolint Dockerfile` passes
- [ ] `hadolint Dockerfile.base` passes
- [ ] `vitest run test/sandbox-provisioning.test.ts` passes
- [ ] CI `test-e2e-sandbox` job passes (runs `e2e-test.sh` inside
container, including new test 11)
- [ ] Docker build succeeds with both fresh base image and stale GHCR
base

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Added end-to-end checks that verify common system debug tools (ps,
top, free, uptime, vmstat) run in the sandbox.
* Added regression tests ensuring Docker provisioning includes the debug
tooling.

* **Chores**
* Improved Docker provisioning to ensure procps/debug tools are present
at runtime, with a fallback install path for older base images.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Julie Yaunches <jyaunches@nvidia.com>
This PR completes the repo-wide type-safety cleanup by removing the
remaining
`// @ts-nocheck` markers and tightening broad escape hatches across the
CLI,
plugin, blueprint, scripts, and tests. It replaces
`any`/`unknown`/assertion-
heavy paths with narrower local domains while keeping the full
correctness gate
green.

- remove all remaining `// @ts-nocheck` markers and retire explicit
`any`,
`Record<string, unknown>`, type assertions, and non-null assertions that
were
  still inflating the hotspot report
- tighten high-churn CLI/onboarding paths in `src/lib/onboard.ts`,
`src/lib/onboard-session.ts`, `src/lib/runner.ts`,
`src/lib/credentials.ts`,
`src/lib/deploy.ts`, and related helpers with narrower
persisted-session,
  errno, config, and subprocess types
- tighten plugin and blueprint parsing paths in
  `nemoclaw/src/blueprint/runner.ts`, `snapshot.ts`, `state.ts`,
`onboard/config.ts`, and `commands/migration-state.ts` with concrete
source
  shapes around YAML and JSON boundaries
- update scripts and tests, including `scripts/type-safety-hotspots.ts`,
`scripts/bump-version.ts`, and many `test/*.ts` files, so the stricter
typing
  is exercised without weakening lint, TypeScript, or test gates
- reduce the tracked type-safety hotspots from
`59 @ts-nocheck / 22 explicit any / 83 record unknown / 229 type
assertions /
  50 non-null assertions / 251 unknowns` to `0 / 0 / 0 / 0 / 0 / 32`

- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

- [x] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only)
- [ ] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

- [x] AI-assisted — tool: pi coding agent

---
Signed-off-by: Carlos Villela <cvillela@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

* **Improvements**
* Stronger runtime validation and safer parsing across configs,
snapshots, and state—reduces crashes on corrupt or malformed files and
yields clearer error messages.
* Tighter input handling for plugins, onboarding, and runtime commands
to improve reliability and predictability.

* **Bug Fixes**
* More robust network/proxy and TLS handling during websocket/https
upgrades to avoid misrouted requests and fallback safely.

* **Tests**
* Expanded and hardened test coverage to prevent regressions and improve
stability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
) (NVIDIA#2446)

## Summary

Create a typed command registry (`src/lib/command-registry.ts`) as the
single source of truth for all 46 NemoClaw CLI commands. Derive help
output, dispatch arrays, and docs-parity checking from it, replacing
three independent hand-maintained sources. Re-enable the
`cloud-experimental-e2e` nightly job now that the check-docs parity
check passes reliably.

## Related Issue

Fixes NVIDIA#2388

## Changes

- **New `src/lib/command-registry.ts`**: Defines `CommandDef` interface,
`CommandGroup` union type, and `COMMANDS` array with all 46 CLI commands
(23 global + 23 sandbox). Exports helper functions: `globalCommands`,
`sandboxCommands`, `visibleCommands`, `commandsByGroup`,
`canonicalUsageList`, `globalCommandTokens`, `sandboxActionTokens`.
- **New `src/lib/command-registry.test.ts`**: 21 unit tests covering all
registry invariants — counts, deduplication, ordering, token extraction,
group membership.
- **Rewritten `help()` in `src/nemoclaw.ts`**: Now iterates
`commandsByGroup()` from registry instead of hand-written console.log
lines. Reconfiguration section rewritten with bullet points to prevent
phantom `nemoclaw`-prefixed parser matches.
- **Derived dispatch arrays**: `GLOBAL_COMMANDS` now calls
`globalCommandTokens()` and `sandboxActions` now calls
`sandboxActionTokens()` — no more hand-maintained sets/arrays.
- **Added `--dump-commands` internal flag**: Prints canonical command
list for check-docs.sh parity checks, replacing the fragile perl
help-parsing pipeline.
- **Simplified `check-docs.sh`**: Phase 1 now uses `--dump-commands`
instead of perl regex extraction. Phase 2 strips `<arg>` and
`[optional]` placeholders from doc headings for clean comparison.
- **Added missing doc headings**: `nemoclaw onboard --from`, `nemoclaw
setup`, `nemoclaw start`, `nemoclaw stop` in
`docs/reference/commands.md`.
- **Re-enabled `cloud-experimental-e2e`**: Removed DISABLED comment,
`CLOUD_EXPERIMENTAL_E2E_ENABLED` variable gate, and `SKIP_CHECK_DOCS`
flag.
- **Updated `test/image-cleanup.test.ts`**: Adapted to verify `gc` via
registry source instead of regex-matching the old `GLOBAL_COMMANDS` Set
literal.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [x] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only)
- [ ] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

## AI Disclosure
- [x] AI-assisted — tool: Claude Code (pi agent)

---
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Added `nemoclaw onboard --from <Dockerfile>` usage guide and
documented deprecated aliases (`nemoclaw start`, `nemoclaw stop`,
`nemoclaw setup`) with warnings.

* **New Features**
* Added a developer-facing `--dump-commands` option to emit the
canonical command list.

* **Improvements**
* Overhauled help/command organization for clearer grouping and more
accurate, stable command listings.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Carlos Villela <cvillela@nvidia.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
)

Add tokenFormat/tokenFormatHint and appTokenFormat/appTokenFormatHint to ChannelDef
interface. Wire format checks into setupMessagingChannels() so invalid tokens are
rejected at the prompt instead of silently saved and surfacing as cryptic auth
failures after sandbox build.

- Slack bot token: ^xoxb-[A-Za-z0-9_-]+$
- Slack app token: ^xapp-[A-Za-z0-9_-]+$
- Non-interactive mode intentionally unchanged (env-var tokens assumed intentional)
- Future channels can opt in by adding tokenFormat/appTokenFormat to their definition

Closes NVIDIA#1912

Co-authored-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
…VIDIA#2447)

## Summary

Follow-up to PR NVIDIA#2422 (`refactor(types): tighten repo-wide type
boundaries`). Addresses the review warnings and suggestions from triage.

## Changes

### 1. Shared `isErrnoException` / `isPermissionError` —
`src/lib/errno.ts`

PR NVIDIA#2422 removed all `@ts-nocheck` markers, which required each module
to define its own `isErrnoException` + `ErrnoLike` type locally. This
resulted in 6 slightly different copies across `config-io`,
`credentials`, `http-probe`, `onboard`, `onboard-session`, and
`registry`.

This PR extracts a single canonical version into `src/lib/errno.ts`
that:
- Accepts `unknown` so callers never need an `instanceof Error` pre-cast
- Checks for both `code` and `errno` properties (covering all Node.js
error shapes)
- Includes a convenience `isPermissionError` for the common EACCES/EPERM
guard

This also simplifies many catch blocks that previously did `error
instanceof Error && isErrnoException(error)` — the redundant `instanceof
Error` check is now handled inside the shared function.

### 2. Shared JSON recursive types — `src/lib/json-types.ts`

PR NVIDIA#2422 introduced identical Scalar/Value/Object type triples in
multiple modules:
- `LooseScalar` / `LooseValue` / `LooseObject` in `onboard.ts` and
`agent-onboard.ts`
- `PolicyScalar` / `PolicyValue` / `PolicyObject` in `policies.ts`
- `SessionJsonPrimitive` / `SessionJsonValue` / `UnknownRecord` in
`onboard-session.ts`

This PR extracts canonical `JsonScalar` / `JsonValue` / `JsonObject`
types into `src/lib/json-types.ts`. Each module re-aliases them under
its domain names for readability.

> **Note:** The plugin types
(`PluginScalar`/`PluginValue`/`PluginRecord` in `nemoclaw/src/index.ts`)
are intentionally kept in-place because the plugin compiles separately
from the CLI and cannot share imports.

### 3. `Reflect.get` / `Reflect.set` convention comment

Adds a clarifying comment at the first usage site (`usage-notice.ts`)
explaining why `Reflect.get` is preferred over `as Record<…>` casts
throughout the codebase — it avoids widening the target type and
satisfies `no-unsafe-member-access` lint rules.

### 4. `ws-proxy-fix.ts` behavioral guard documentation

Documents the `!host` null guard in the Discord WebSocket tunnel path,
explaining that the previous code would have attempted a tunnel with an
undefined host (which would fail downstream anyway). The guard now
explicitly falls through to the original `https.request` unchanged.

## Type of Change

- [x] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [x] `tsc -p tsconfig.cli.json --noEmit` passes
- [x] `npm test` passes (only pre-existing failures in install-preflight
and ssrf-parity)
- [x] Tests added for new `errno.ts` module (15 test cases)
- [x] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only)

## AI Disclosure
- [x] AI-assisted — tool: pi coding agent

---
Signed-off-by: Jess Yaunches <jyaunches@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Improved WebSocket proxy handling to avoid creating tunnels when the
request target host is undefined.

* **Tests**
  * Added tests validating errno/permission detection utilities.

* **Documentation**
  * Added explanatory comment clarifying property-access rationale.

* **Chores**
* Centralized JSON-type aliases and errno helpers for consistent error
handling across modules.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jess Yaunches <jyaunches@nvidia.com>
…VIDIA#2458)

## Summary

Tip of `main` currently fails `npm run typecheck:cli` with 8 errors in
`test/onboard.test.ts` — patterns introduced by NVIDIA#2130 (Slack token
validation tests) don't satisfy the `strict: true` setting introduced by
NVIDIA#2422. Every PR opened against main inherits these errors, which makes
the `checks` CI job red on unrelated PRs (e.g. NVIDIA#2454).

## Related

Follows from NVIDIA#2130 + NVIDIA#2422 interaction. No linked issue — local-only
breakage caught by CI on downstream PRs.

## Changes

`test/onboard.test.ts` — 8 annotations, zero behavior change:

- **4× `.pop()!`** — `Array.pop()` returns `string | undefined`. Each
site is immediately preceded by an `assert.equal(result.status, 0)` that
guarantees non-empty stdout, so the non-null assertion is safe.
- **3× `(c: { key: string }) => c.key === …`** — `saveCalls` entries
only read `.key`.
- **1× `(c: { name: string }) => c.name === …`** — `MESSAGING_CHANNELS`
entry lookup only reads `.name`.

The tests themselves still exercise the same cases; these changes are
purely to make the TypeScript compiler happy.

## Type of Change

- [x] Code change (feature, bug fix, or refactor)

## Verification

- [x] `npx tsc -p tsconfig.cli.json` — exits 0 (was 2 with 8 errors)
- [x] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed

## AI Disclosure
- [x] AI-assisted — tool: Claude Code

---
Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Tests
* Improved reliability of JSON payload extraction in onboarding tests
with defensive null-checks before parsing.
* Enhanced type safety in test callbacks and iterators with explicit
TypeScript type annotations.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cjagwani and others added 11 commits May 12, 2026 16:54
## Summary

Prevents the class of regression that required NVIDIA#2458. Adds an
unconditional `typecheck:cli` step to the shared CI action so future
hook-filter drift can't hide a tsc error.

## Related

Follow-up to NVIDIA#2458 (which fixed the symptom — this addresses the root
cause).

## Root cause recap

- `.pre-commit-config.yaml` had `files: ^(bin|scripts)/` on the
`tsc-cli` hook.
- PRs that touched only `test/` or `src/` never triggered the check.
- NVIDIA#2130 (test-only Slack validation) slipped through that gap.
- NVIDIA#2422 later turned on `strict: true`, and the latent errors surfaced —
but for every downstream PR, not the PR that introduced them.
- Six open PRs ended up blocked on the same error cluster before anyone
noticed.

## Changes

`.github/actions/basic-checks/action.yaml` — new **Typecheck CLI + tests
(strict)** step running `npm run typecheck:cli` unconditionally.
Independent of prek hook configuration, so future filter drift can't
hide a tsc error from CI.

## Type of Change

- [x] Code change (feature, bug fix, or refactor)

## Verification

- [x] YAML parses
- [x] `npm run typecheck:cli` exits 0 on this branch (requires NVIDIA#2458
merged to main first)
- [x] Tests added or updated for new or changed behavior (N/A — infra
change, existing suite covers)
- [x] No secrets, API keys, or credentials committed

## AI Disclosure
- [x] AI-assisted — tool: Claude Code

---
Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Enhanced continuous integration pipeline with additional TypeScript
typechecking to improve code quality assurance and detect type-related
issues earlier in the development process.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NVIDIA#2436)

## Summary

- Adds `lstatSync`-based symlink rejection to `createSnapshot()` and
`rollbackFromSnapshot()` in `nemoclaw/src/blueprint/snapshot.ts`,
closing two attack surfaces reported by an external contributor doing
Yocto integration:
1. **`createSnapshot()`** calls `cpSync(OPENCLAW_DIR, dest)` with no
`lstatSync` check on the source — if `~/.openclaw` is a symlink (e.g. to
`/etc`), `cpSync` follows it and copies arbitrary files into the
snapshot
2. **`rollbackFromSnapshot()`** calls `cpSync(source, OPENCLAW_DIR)`
without verifying the destination isn't a symlink — an attacker-planted
symlink could redirect restored content to an arbitrary directory
- Updates `collectFiles()` to detect symlinks via `isSymbolicLink()` and
record them in the snapshot manifest (previously symlinks were silently
omitted)
- Follows the `rejectSymlinksOnPath()` pattern established in PR NVIDIA#2290
for the `~/.nemoclaw` credentials directory
- **Also guards the CLI code path**
(`src/lib/sandbox-state.ts::backupSandboxState()`) that `nemoclaw
<sandbox> snapshot create` actually uses — the blueprint helper is a
separate implementation not called from the CLI
- `rollbackFromSnapshot()` symlink rejection returns `false` (preserving
the boolean-return contract) rather than throwing

### Severity

Low-Medium. Both attacks require the attacker to already have user-level
write access to `$HOME` to plant the symlink, and the sandbox's Landlock
/ `chattr +i` / read-only mount protections prevent in-sandbox
exploitation. However, this is the same symlink attack class fixed in
NVIDIA#2290 (different code paths: `cpSync` + directory walker vs.
`mkdirSync`), and defense-in-depth warrants closing the gap.

## Test plan

- [x] New test: `createSnapshot` rejects when `~/.openclaw` is a symlink
- [x] New test: `createSnapshot` rejects when an ancestor of
`~/.nemoclaw` is a symlink
- [x] New test: `createSnapshot` records symlinks in manifest when
present in tree
- [x] New test: `rollbackFromSnapshot` returns false (not throws) when
`~/.openclaw` is a symlink
- [x] All existing snapshot tests pass (no regressions)
- [x] Full plugin test suite: 362 passed
- [x] CLI test suite: 1959/1961 passed (2 pre-existing flaky installer
timeouts)
- [x] TypeScript build clean (`tsc --noEmit`)
- [x] All pre-commit and pre-push linters/formatters pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Snapshots now detect and record symbolic links in the snapshot
manifest and refuse snapshot creation or restoration when critical
config directories or their ancestor paths are symlinks.
* Backup destination paths are validated to prevent writing into
locations with symlinked ancestor paths.

* **Tests**
* Added tests covering symlink detection, manifest recording, rejection
scenarios, and rollback behavior when config paths are symlinks.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…VIDIA#2440)

## Summary
- When `CHAT_UI_URL` specifies a custom port, `NEMOCLAW_DASHBOARD_PORT`
was only injected into the sandbox if the host env var was explicitly
set, causing the gateway to listen on the wrong port
- Always derive the effective port from `chatUiUrl` via
`getDashboardForwardPort()` and inject it unconditionally
- Bump vitest timeouts on flaky openshell gateway tests

Closes NVIDIA#2267

## Test plan
- [x] All 2302 existing tests pass
- [ ] Verify sandbox starts with correct `NEMOCLAW_DASHBOARD_PORT` when
`CHAT_UI_URL` uses a non-default port (e.g. `:18790`)
- [ ] Verify default port (`18789`) still works when `CHAT_UI_URL` is
unset

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Dashboard forwarding now consistently derives and injects the
dashboard port from CHAT_UI_URL (including custom ports), preventing
mismatches that could block dashboard access.

* **New Features**
* Deterministic dashboard port derivation from CHAT_UI_URL for sandbox
startup and health checks.

* **Documentation**
* Updated onboarding, quickstart, reference, and troubleshooting guides
and examples to prefer CHAT_UI_URL with auto-derived ports.

* **Tests**
* Added/adjusted tests and extended timeouts to validate port injection
and improve CLI/onboard reliability.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Ubuntu <ubuntu@nemoclaw-f6a691-inst-3cs11f3ytmmzubymi5nzdncnwia.us-west1-a.c.brevdevprod.internal>
Co-authored-by: Aaron Erickson 🦞 <aerickson@nvidia.com>
…DIA#2445)

## Summary

Reorganize the Get Started section, document the full runtime topology
in the Architecture reference, and clarify how to uninstall NemoClaw.
Prerequisites now live on their own page (with the tested-platform
matrix included directly from the README), Windows pre-setup is renamed
to `windows-preparation.md`, and DGX Spark readers are pointed to the
NVIDIA Spark playbook rather than an in-repo tutorial.

## Changes

**Get Started reorganization**
- New `docs/get-started/prerequisites.md`. Extracts Hardware, Software,
and the tested-platform matrix from Quickstart into its own page. The
matrix is pulled in via Sphinx `{include}` from the README so there is a
single source of truth.
- `docs/get-started/windows-setup.md` →
`docs/get-started/windows-preparation.md` (renamed to "Prepare Windows
for NemoClaw"). Nested under Prerequisites in the toctree.
- `docs/get-started/quickstart.md` no longer duplicates the
Prerequisites section; it links to the new page instead.
- `docs/index.md` Get Started toctree is now `Prerequisites` then
`Quickstart`.
- `docs/conf.py` redirects `get-started/windows-setup` →
`windows-preparation.html` so the pre-reorg URL still resolves.

**Architecture reference: Deployment Topology**
- New `## Deployment Topology` section in
`docs/reference/architecture.md` with a mermaid diagram showing the
runtime substrate — host CLI → Docker daemon → OpenShell gateway
container → embedded k3s cluster → sandbox pod → OpenClaw agent + plugin
— plus the inference router's role on the path to external providers.
- Per-layer role table (host CLI, Docker daemon, gateway container, k3s,
sandbox pod, inference router).
- Points DGX Spark readers to the [NVIDIA Spark
playbook](https://build.nvidia.com/spark/nemoclaw) for the
Spark-specific variant.

**Uninstall documentation**
- `docs/reference/commands.md` `nemoclaw uninstall` section corrects a
drift bug (the CLI does *not* auto-fall-back to a remote script; it
prints the versioned URL for manual review) and adds a comparison table
between `nemoclaw uninstall` and the hosted `curl … | bash` form.
- README and Quickstart Uninstall sections now lead with `nemoclaw
uninstall` (version-pinned, no network fetch) and keep the `curl … |
bash` form as a fallback for when the CLI is missing or broken. Both
link to the comparison table.

**Troubleshooting reference**
- New `## DGX Spark` section in `docs/reference/troubleshooting.md`
covering CoreDNS CrashLoop, `k3s` image-pull failure after a local
build, GPU passthrough (not CI-tested on Spark), `pip install`
system-packages errors, and the AI Workbench port 3000 conflict.
- Section header points to the Spark playbook for end-to-end
walkthroughs.
- Windows cross-links retargeted to
`../get-started/windows-preparation.md`.

**Platform matrix source of truth**
- `ci/platform-matrix.json` DGX Spark notes now link the Spark playbook.
README matrix regenerated.
- README Documentation table gains a Prerequisites row and updates the
pointer below the matrix to point at Prerequisites instead of a
now-removed hub.
- `scripts/generate-platform-docs.py` drops
`docs/get-started/quickstart.md` from its patch targets — the matrix
moved to `prerequisites.md` and is pulled via `{include}` from README,
so only README needs patching.
- `.pre-commit-config.yaml` `platform-matrix-sync` hook updated to match
(removed the quickstart path from both `git add` and `files`).

**Agent skill generator (`scripts/docs-to-skills.py`)**

Goal: skill files are self-contained. They never reference repository
paths like `../../../docs/…`, and they never couple the skill tree to
the repo layout.

- Rewrite precedence in `rewrite_doc_paths`:
1. If the link target maps to a generated skill, emit ``text (use the
`<skill>` skill)``.
2. Otherwise, if the target is a page inside `docs/`, emit an absolute
`[text](<html_baseurl><page>.html#frag)`.
3. Otherwise (target outside `docs/`, or no base URL), strip the
hyperlink and keep the link text. Self-containment wins over
navigability in the fallback.
- New `load_html_baseurl(docs_dir)` reads the `html_baseurl` assignment
from Sphinx's `conf.py` via `ast` (no `exec` side effects), so the
script is project-agnostic: any docs tree with that assignment works. A
warning is printed (and the script continues) if the base URL is
missing.
- Include placeholders (`> *Content included from … — see the original
doc for full text.*`) follow the same precedence: emit a published URL
if available, else drop the breadcrumb since the content is already
inlined.
- Fragment anchors are now preserved across the rewrite (the
pre-existing rewrite silently stripped them).
- Scope the `index.md` exclusion to the top-level `docs/index.md` only.
Subdirectory index pages (for example
`docs/get-started/platform-setup/index.md`) are legitimate hub content
and can now participate in skill grouping — which is what makes rule 1
above kick in for links to those hubs.
- Because emitted content no longer depends on where it is written, the
earlier `output_file` parameter plumbing and the multi-output-dir
invariant guard are gone. `output_dirs` can be any mirrors (symlinks,
bind mounts, independent trees) without caveat.

**Repo-root stub**
- `spark-install.md` rewritten as a short redirect paragraph pointing to
the NVIDIA Spark playbook and the Quickstart.

**Agent skills**
- `.agents/skills/` regenerated from the new doc tree.
`nemoclaw-user-get-started` now covers `prerequisites.md` +
`windows-preparation.md`; `nemoclaw-user-reference` picks up the revised
architecture, commands, and troubleshooting pages.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

The only non-doc code changes are to the skill-generator script
(`scripts/docs-to-skills.py`), the platform-matrix-generator target list
(`scripts/generate-platform-docs.py`), and the pre-commit hook entry.
Everything else is docs or generated docs.

## Verification
- [x] `check-docs.sh --only-links` phase 1 (local links) passes. Phase 2
flags `prerequisites.html` as "unreachable" because it is a new page not
yet published; it will resolve after this PR merges and the docs site
rebuilds.
- [ ] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [x] New doc pages include SPDX header and frontmatter (new pages only)

Note on unchecked boxes: prek's CLI vitest suite surfaces pre-existing
environmental failures on my machine
(`nemoclaw/dist/blueprint/private-networks.js` missing and integration
tests that try to reach a live gateway). No source file under `src/`,
`nemoclaw/src/`, `bin/`, or `test/` is touched by this diff, so those
failures are unrelated. `make docs` wasn't run locally.

## AI Disclosure
- [x] AI-assisted — tool: Claude Code

---

Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added a dedicated Prerequisites page with minimum/recommended
hardware, software versions, and platform notes.
* Added DGX Spark troubleshooting and an end-to-end local Ollama
walkthrough reference.

* **Documentation**
* Reorganized Get Started to surface Prerequisites and updated Windows
preparation wording.
* Reworked uninstall guidance to prefer the CLI `nemoclaw uninstall`
(local, version‑pinned) and document a hosted-script fallback and
comparison.
* Replaced a lengthy Spark install guide with an external playbook
pointer.

* **Reference**
* Expanded architecture and commands docs to clarify deployment
topology, credential flow, and uninstall behavior.

* **Chores**
* Updated docs generation/redirect tooling to emit published links and
target platform docs appropriately.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary

The squash merge of NVIDIA#2351 skipped pre-commit hooks, leaving shfmt
formatting violations that broke the `main` workflow checks. This
applies `shfmt -w` to fix them.

## Test plan

- [x] `shfmt -d` produces no diff on both files

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Added documentation for `nemoclaw onboard --from <Dockerfile>` option
for building sandbox images from custom Dockerfiles.
* Added documentation for deprecated CLI aliases `nemoclaw
start`/`nemoclaw stop` (use `nemoclaw tunnel start`/`nemoclaw tunnel
stop` instead).
* Added documentation for deprecated `nemoclaw setup` command (use
`nemoclaw onboard` instead).
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…NVIDIA#2435)

## Summary

Tests in `install-preflight.test.ts` and
`install-openshell-version-check.test.ts` flake persistently under the
default 5s vitest timeout. Each test spawns `bash install.sh` — these
are integration tests, not unit tests, and belong in CI.

## Changes

- Set `{ timeout: 15_000 }` on the `describe` block in both test files
(keeps them passing when run explicitly)
- Add a new `installer-integration` vitest project enabled only by
`CI=1` or `NEMOCLAW_RUN_INSTALLER_TESTS=1`
- Exclude the two files from the `cli` project so pre-push hooks skip
them

Run explicitly: `npx vitest run --project installer-integration`

## Type of Change

- [x] Code change (feature, bug fix, or refactor)

## Verification

- [x] `npx prek run --all-files` passes
- [x] Both test files pass: 65/65 tests
- [x] No secrets, API keys, or credentials committed

---
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Added explicit 15s timeouts to installer test suites to reduce flaky
runs.
* Moved installer-specific tests into a separate CI-gated project so
they run only in CI or when explicitly enabled, keeping regular test
runs lean.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
Catches up the user-facing docs for changes merged since v0.0.24 and
bumps `docs/versions1.json` so the next docs build publishes as 0.0.25.

## Changes
- `docs/reference/troubleshooting.md`: corrected the JetPack 7 (L4T
39.x) entry to reflect the new conditional `br_netfilter` auto-load on
R39 hosts that lack it (from NVIDIA#2419), and added a new "DNS resolution
from inside docker fails (corporate firewall)" section covering the new
container-DNS preflight probe and platform-specific remediation paths
(from NVIDIA#2349).
- `docs/get-started/quickstart.md`: added a "Review the Configuration
Before the Sandbox Build" subsection documenting the new `[Y/n]` review
gate that fires after the sandbox-name prompt (from NVIDIA#2221).
- `docs/versions1.json`: promoted 0.0.25 to `preferred: true`, demoted
0.0.24, and trimmed to ten entries to match what
`scripts/bump-version.ts` writes. `docs/project.json` regenerates from
`versions1.json` at build time and now reports `0.0.25`.
- `.agents/skills/nemoclaw-user-get-started/` and
`.agents/skills/nemoclaw-user-reference/`: regenerated via
`scripts/docs-to-skills.py`.

## Type of Change
- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [x] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [x] `npx prek run --all-files` passes
- [x] `npm test` passes
- [x] Tests added or updated for new or changed behavior
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

## AI Disclosure
- [x] AI-assisted — tool: Claude Code

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Onboarding now shows a configuration review and requires explicit
confirmation before building a sandbox; non-interactive mode prints the
summary without prompting.

* **Documentation**
* Expanded Jetson troubleshooting for JetPack 7 / L4T 39.x and
conditional br_netfilter guidance.
* Added corporate-firewall DNS troubleshooting with platform-specific
remediation and a verification nslookup step.
  * Quickstart updated to document the new confirmation flow.

* **Chores**
  * Documentation version index updated.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…VIDIA#2471)

## Summary

- Reverts the `nemoclaw.ts` changes from NVIDIA#2398 that caused `nemoclaw
status` to hang indefinitely
- Restores the original `isSandboxGatewayRunning`,
`recoverSandboxProcesses`, `ensureSandboxPortForward`, and
`checkAndRecoverSandboxProcesses` implementations
- Leaves the dashboard-contract/health/recover modules in place
(onboard.ts depends on them)

## Root cause

PR NVIDIA#2398 replaced the simple gateway recovery path with
`recoverDashboardChain()` which introduced unbounded calls that hang in
CI. The bisect confirmed NVIDIA#2398 as the sole culprit:

| Run | Commit | Result |
|-----|--------|--------|
| Last good | `de97a00d` (Apr 24 16:06) | pass |
| Bisect 4 | `9fbfbaca` (NVIDIA#2398 only) | **hang** |
| First bad | `79c8e2a9` (Apr 25 00:10) | hang |

sandbox-survival, skip-permissions, sandbox-operations, and cloud-e2e
all hang at `nemoclaw <name> status` after NVIDIA#2398.

## What this reopens

- NVIDIA#2342 (false "Offline" with device auth) — the old `curl -sf /` probe
returns failure on 401. This needs a targeted fix, not a refactor.
- NVIDIA#2390 — dashboard delivery chain consolidation goal. Can be
re-attempted with proper E2E validation.

## Test plan

- [ ] `npx tsc -p tsconfig.src.json --noEmit` passes
- [ ] Dashboard unit tests pass
- [ ] Nightly E2E: sandbox-survival, skip-permissions,
sandbox-operations, cloud-e2e pass

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved gateway health check reliability by using explicit status
outputs instead of HTTP code interpretation.
* Enhanced automatic recovery mechanism for sandbox gateway outages with
better restart and port forwarding restoration.
* Updated diagnostic logging to better reflect gateway restart and
connectivity restoration activities.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…IA#2437)

<!-- markdownlint-disable MD041 -->
## Summary

The installer used to auto-attach `--resume` to `nemoclaw onboard`
whenever a non-complete session file existed on disk. A run that failed
at step 3 or 4 (for example, Ollama proxy unreachable) writes
`status="failed"` with `resumable=true`, which matched that predicate,
so re-running `curl | bash` silently replayed the same failed choice.
The user had no way to pick a different provider short of deleting
`~/.nemoclaw/onboard-session.json`.

Fixed by tightening the installer's session classifier and adding an
end-to-end `--fresh` opt-in across both the installer and the `nemoclaw
onboard` CLI.

## Related Issue

Resolves NVIDIA#2430.

## Changes

**Installer (`scripts/install.sh`)**

The install script now classifies the on-disk session into one of four
states and branches on it instead of treating "non-complete" as one
bucket:

- `resume` — `status === "in_progress"` only. Auto-attaches `--resume`
(the legitimate Ctrl-C / interrupted-mid-step case).
- `failed` — `status === "failed"` or a populated `failure` field. In a
TTY, prompts `[R/f]`; in non-interactive mode, refuses with a non-zero
exit so scripted re-runs cannot loop. The refusal message points at
`--fresh` and `nemoclaw onboard --resume` as the two explicit opt-ins.
- `skip` — complete / `resumable: false` / missing. No `--resume`
attached.
- `corrupt` — file exists but cannot be parsed, or status does not match
any value `onboard-session.ts` actually writes. Warns and starts fresh
rather than auto-resuming an unreadable file.

A new `--fresh` flag (and `NEMOCLAW_FRESH=1` env var) short-circuits the
classifier entirely and is forwarded to the downstream CLI invocation.

**CLI (`nemoclaw onboard --fresh`)**

`onboard-command.ts` now accepts `--fresh`, exposed through the parser,
the help text, and the `OnboardCommandOptions` type. When set,
`onboard.ts` calls `onboardSession.clearSession()` before `createSession
+ saveSession`, so the previous session file is unlinked outright rather
than overwritten in place. `--fresh` and `--resume` are mutually
exclusive, validated at both the parser and the runtime entry point.

**Docs**

New troubleshooting entry under *Onboarding* (`Previous onboarding
session failed`) explaining the interactive / non-interactive /
`--fresh` / `nemoclaw onboard --resume` paths and the
`~/.nemoclaw/onboard-session.json` last-resort delete.

**Tests**

- `test/install-preflight.test.ts`: two new cases — non-interactive
refusal of a failed session, and `--fresh` short-circuiting auto-resume
+ forwarding to the CLI.
- `src/lib/onboard-command.test.ts`: `--fresh` parsing and the
`--resume` / `--fresh` mutual-exclusion error.

## Type of Change

- [X] Code change (feature, bug fix, or refactor)
- [X] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
<!-- Check each item you ran and confirmed. Leave unchecked items you
skipped. -->
- [X] `npx prek run --all-files` passes
- [X] `npm test` passes
- [X] Tests added or updated for new or changed behavior
- [X] No secrets, API keys, or credentials committed
- [X] Docs updated for user-facing behavior changes
- [X] `make docs` builds without warnings (doc changes only)
- [X] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

## AI Disclosure
<!-- If an AI agent authored or co-authored this PR, check the box and
name the tool. Remove this section for fully human-authored PRs. -->
- [X] AI-assisted — tool: Claude Code<!-- e.g., Claude Code, Cursor,
GitHub Copilot -->

---
<!-- DCO sign-off required by CI. Run: git config user.name && git
config user.email -->
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added a `--fresh` option to start onboarding without resuming previous
sessions; onboarding now avoids silent automatic retries after failures.
* Interactive runs prompt to resume (`R`/Enter) or discard (`f`) a
recorded failed session; non-interactive runs refuse to retry and exit
non-zero unless explicitly opted in.

* **Documentation**
* Help text and docs updated to explain session persistence, recovery
options, and how to use `--fresh` or `--resume`.

* **Tests**
* Added integration and unit tests covering fresh/resume behavior and
mutual-exclusion of options.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@ksapru ksapru force-pushed the fix/replace-race-condition-sleeps branch from c0b43c2 to 9507a85 Compare May 12, 2026 20:55
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa ericksoa merged commit d4ae22d into NVIDIA:main May 13, 2026
54 of 58 checks passed
@ksapru

ksapru commented May 14, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @cv, @ericksoa, and everyone who helped get this merged. Really appreciate the reviews and fixes.

Glad this landed, and I’ll keep the follow-ups in mind.

@wscurran wscurran added area: cli Command line interface, flags, terminal UX, or output feature PR adds or expands user-visible functionality area: performance Latency, throughput, resource use, benchmarks, or scaling and removed NemoClaw CLI feature PR adds or expands user-visible functionality labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: cli Command line interface, flags, terminal UX, or output area: performance Latency, throughput, resource use, benchmarks, or scaling refactor PR restructures code without intended behavior change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants