refactor(cli): replace race-condition sleeps with readiness probes by ksapru · Pull Request #2442 · NVIDIA/NemoClaw

ksapru · 2026-04-24T17:11:28Z

Summary

Replaced arbitrary sleep() delays with deterministic readiness probes (waitForPort, waitForHttp, waitUntil) across the NemoClaw CLI onboarding flow. This improves performance by proceeding as soon as readiness is achieved and prevents race conditions on slower systems.

Refs #2001

Changes

Added synchronous wait utilities (waitForPort, waitForHttp, waitUntil) to src/lib/wait.ts.
Replaced sleep(1) with waitForPort(OLLAMA_PROXY_PORT, 2) during proxy initialization.
Replaced sleep(2) with waitForHttp(http://127.0.0.1:${OLLAMA_PORT}/, 10) when starting the Ollama server.
Replaced sleep(5) with waitUntil polling CoreDNS readiness via kubectl get pods.
Replaced hardcoded retry loops and sleep(2) calls with waitUntil for applying and removing policy presets.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
make docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

AI Disclosure

AI-assisted — tool: Antigravity

Signed-off-by: Krish Sapru ksapru@bu.edu

Summary by CodeRabbit

New Features
- Added active readiness checks for ports and HTTP endpoints to improve service startup detection.
Bug Fixes
- Replaced fixed delays with polling-based waits across startup and configuration flows, improving reliability and clearer error reporting on timeouts.
- Policy application/removal now retries intelligently for transient "sandbox not found" cases and fails fast for other errors.
Tests
- Specified an explicit timeout for the sandbox readiness test.

copy-pr-bot · 2026-04-24T17:11:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-24T17:11:35Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Replaces fixed sleep delays with polling-based readiness utilities. Adds waitForPort, waitForHttp, and waitUntil, and updates startup/retry flows (auth proxy, CoreDNS patching, Ollama serve, policy preset apply/remove) to poll for readiness and handle timeouts/errors accordingly.

Changes

Cohort / File(s)	Summary
Readiness Utilities `src/lib/wait.ts`	Adds `waitUntil(conditionFn, timeoutSeconds, pollIntervalMs)`, `waitForPort(port, timeoutSeconds)` (uses `nc`), and `waitForHttp(url, timeoutSeconds)` (uses `curl`) — synchronous polling helpers that return boolean on success/failure.
Startup & Recovery Flows `src/lib/onboard.ts`	Replaces numerous fixed `sleep()` calls with readiness polling: auth proxy startup now blocks until port reachable (returns/aborts on timeout), CoreDNS patch step polls kube-dns pod readiness (warns on timeout but continues), Ollama `serve` waits for HTTP root endpoint (errors out or returns to provider selection on timeout), and policy preset apply/remove use `waitUntil` with selective retry only for `sandbox not found` errors.
Tests `test/onboard.test.ts`	Specifies an explicit 10s Vitest timeout for one test (`it(..., { timeout: 10000 })`).

Sequence Diagram(s)

sequenceDiagram
    participant CLI as Onboard CLI
    participant Proxy as Auth Proxy
    participant K8s as CoreDNS / kube-dns
    participant Ollama as Ollama Server
    participant Policy as Policy Service

    CLI->>Proxy: spawn auth-proxy process
    CLI->>Proxy: waitForPort(port)
    alt port reachable
        Proxy-->>CLI: success
    else timeout
        Proxy-->>CLI: log error, return/abort
    end

    CLI->>K8s: run fix-coredns.sh
    CLI->>K8s: waitUntil(kube-dns ready)
    alt ready
        K8s-->>CLI: continue
    else timeout
        K8s-->>CLI: warn, continue startup
    end

    CLI->>Ollama: start `ollama serve`
    CLI->>Ollama: waitForHttp(root URL)
    alt http OK
        Ollama-->>CLI: continue
    else timeout
        Ollama-->>CLI: log error, exit or return to selection
    end

    CLI->>Policy: apply/remove preset
    CLI->>Policy: waitUntil(apply/remove succeeds, retry only on "sandbox not found")
    alt success
        Policy-->>CLI: done
    else timeout or non-retry error
        Policy-->>CLI: fail/throw
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Poem

🐇 A rabbit hums while cables wake,

Polling ports by moonlit lake,
No blind sleeps, just patient cheer,
Services rise when pathways clear,
Hops of code — startup is near! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title accurately describes the main change: replacing arbitrary sleep delays with readiness probes to eliminate race conditions.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…x flaky test timeout

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/lib/onboard.ts (1)

5631-5647: Extract shared retry wrapper for sandbox-not-found transient handling.

The same waitUntil + lastErr + sandbox not found pattern is repeated in three places. A small helper would reduce drift and simplify maintenance.

Also applies to: 6303-6321, 6326-6348

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5631 - 5647, Extract the repeated "waitUntil
+ lastErr + sandbox not found" retry logic into a small helper (e.g.,
retryUntilSandboxReady or runWithSandboxRetry) that accepts an action function
and retries using the existing waitUntil semantics; inside the helper call the
provided action (the body that currently calls policies.applyPreset(sandboxName,
name)), catch errors, store the last error, treat errors whose message includes
"sandbox not found" as transient (return false for retry) and re-throw other
errors, and finally re-throw lastErr if retries fail; then replace the three
duplicated blocks (the code around policies.applyPreset(...) and the similar
blocks at the other locations) with calls to this helper passing the appropriate
parameters (sandboxName, name or other action closures).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Line 2064: The waitForPort readiness checks are currently invoked without
checking their boolean result, allowing execution to continue on timeout; update
every call site of waitForPort (e.g., the calls in src/lib/onboard.ts where
OLLAMA_PROXY_PORT and the other ports are waited on) to handle failure: await
the boolean result and if it is false, log a clear error and abort the
onboarding flow (throw or return a failure status) instead of proceeding, so the
flow does not race ahead when services aren't ready. Ensure you apply this
pattern consistently to all occurrences of waitForPort in the file.

---

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 5631-5647: Extract the repeated "waitUntil + lastErr + sandbox not
found" retry logic into a small helper (e.g., retryUntilSandboxReady or
runWithSandboxRetry) that accepts an action function and retries using the
existing waitUntil semantics; inside the helper call the provided action (the
body that currently calls policies.applyPreset(sandboxName, name)), catch
errors, store the last error, treat errors whose message includes "sandbox not
found" as transient (return false for retry) and re-throw other errors, and
finally re-throw lastErr if retries fail; then replace the three duplicated
blocks (the code around policies.applyPreset(...) and the similar blocks at the
other locations) with calls to this helper passing the appropriate parameters
(sandboxName, name or other action closures).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a9fc8c1c-769e-4fdc-ae2a-a41cfca3a64b

📥 Commits

Reviewing files that changed from the base of the PR and between f7dff7b and 1961db8.

📒 Files selected for processing (3)

src/lib/onboard.ts
src/lib/wait.ts
test/onboard.test.ts

Readiness probes (waitForPort, waitForHttp, waitUntil) were returning boolean outcomes that were ignored at their call sites. This allowed the onboarding flow to race ahead even if services like Ollama or CoreDNS failed to become ready within their timeout windows. This adds explicit error handling to abort initialization or surface warnings if these components fail to start, matching the expected fail-fast behavior.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/lib/onboard.ts (1)

2098-2102: ⚠️ Potential issue | 🟠 Major

Propagate proxy restart failure instead of only logging it.

On Lines 2098-2102 this helper returns early on timeout, but setupInference() still continues with the persisted token. That leaves onboarding configured for ollama-local even though nothing is listening on OLLAMA_PROXY_PORT.

Suggested fix

-function ensureOllamaAuthProxy(): void {
+function ensureOllamaAuthProxy(): boolean {
   // Try to load persisted token first — if none, this isn't an Ollama setup.
   const token = loadPersistedProxyToken();
-  if (!token) return;
+  if (!token) return true;

   const pid = loadPersistedProxyPid();
   if (isOllamaProxyProcess(pid)) {
     ollamaProxyToken = token;
-    return;
+    return true;
   }

   // Proxy not running — restart it with the persisted token.
   killStaleProxy();
   ollamaProxyToken = token;
   spawnOllamaAuthProxy(token);
   if (!waitForPort(OLLAMA_PROXY_PORT, 2)) {
     console.error(`  Error: Ollama auth proxy did not become ready on :${OLLAMA_PROXY_PORT} within timeout.`);
-    // Void function, return early
-    return;
+    ollamaProxyToken = null;
+    return false;
   }
+  return true;
 }

if (!ensureOllamaAuthProxy()) {
  process.exit(1);
}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2098 - 2102, The current check using
waitForPort(OLLAMA_PROXY_PORT, 2) only logs and returns early, allowing
setupInference() to continue with a persisted token despite the proxy not
running; change the helper so the failure is propagated (either throw an error
or return a boolean) and ensure callers handle it—e.g., implement/replace with
ensureOllamaAuthProxy() that returns false on timeout (or throws), update the
caller of this check to call ensureOllamaAuthProxy() and call process.exit(1)
(or rethrow) when it returns false to prevent setupInference() from proceeding
without a running Ollama auth proxy.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 3342-3348: The current coredns readiness check (corednsReady using
waitUntil and runCaptureOpenshell with "kubectl get pods" looking for the string
"coredns") only detects pod listing and can be true while the pod is not Ready;
move this check inside the branch that runs when shouldPatchCoredns(runtime) is
true (so it only runs when we actually patched CoreDNS), and change the kubectl
invocation to assert actual readiness (e.g., query pod status/conditions or
containerStatuses via "kubectl get pods -n kube-system -l <coredns-label> -o
json" or use jsonpath to ensure status.phase == Running and all
containerStatuses[].ready == true) and have waitUntil rely on that more precise
readiness condition instead of a substring match from runCaptureOpenshell.

---

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 2098-2102: The current check using waitForPort(OLLAMA_PROXY_PORT,
2) only logs and returns early, allowing setupInference() to continue with a
persisted token despite the proxy not running; change the helper so the failure
is propagated (either throw an error or return a boolean) and ensure callers
handle it—e.g., implement/replace with ensureOllamaAuthProxy() that returns
false on timeout (or throws), update the caller of this check to call
ensureOllamaAuthProxy() and call process.exit(1) (or rethrow) when it returns
false to prevent setupInference() from proceeding without a running Ollama auth
proxy.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f00ad4ce-2dfc-456b-92d6-6b88e596fdfe

📥 Commits

Reviewing files that changed from the base of the PR and between 1961db8 and ef67b43.

📒 Files selected for processing (2)

.agents/skills/nemoclaw-user-reference/references/commands.md
src/lib/onboard.ts

✅ Files skipped from review due to trivial changes (1)

.agents/skills/nemoclaw-user-reference/references/commands.md

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

src/lib/onboard.ts (2)
3341-3360: ⚠️ Potential issue | 🟠 Major

This CoreDNS probe can still pass before all pods are ready.

The combined jsonpath string only proves that at least one matched pod is Running, at least one container reported true, and no literal false appeared. If one CoreDNS pod is ready and another is still Pending with no containerStatuses, this returns true and the flow moves on too early. Parse the pod list structurally and require every matched pod to be Running with all containers ready, or switch to a direct kubectl wait --for=condition=Ready check.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 3341 - 3360, The current coredns readiness
probe in corednsReady (using waitUntil and runCaptureOpenshell) can return true
prematurely because the jsonpath check only verifies at-least-one-running and
presence/absence of literal strings; change the probe to either (A) run kubectl
get pods -n kube-system -l k8s-app=kube-dns -o json, parse the returned JSON and
ensure every item has .status.phase === "Running" and every container in
.status.containerStatuses has ready === true before returning success, or (B)
replace the check with a direct kubectl wait --for=condition=Ready pod -n
kube-system -l k8s-app=kube-dns --timeout=<sensible timeout> and use
runCaptureOpenshell to evaluate its success; update corednsReady accordingly.
2098-2102: ⚠️ Potential issue | 🟠 Major

Propagate proxy restart failure to the caller.

On Line 2098, ensureOllamaAuthProxy() only logs and returns, but it leaves ollamaProxyToken populated. setupInference() can then continue configuring ollama-local against a dead proxy, so resume/reconnect flows fail later and less clearly. Make this helper return a boolean like startOllamaAuthProxy() and abort the caller on false.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2098 - 2102, ensureOllamaAuthProxy currently
logs and returns on proxy start timeout but leaves ollamaProxyToken set,
allowing setupInference to proceed against a dead proxy; change
ensureOllamaAuthProxy to return a boolean (e.g., rename or implement as
startOllamaAuthProxy/ensureOllamaAuthProxy returning true on success, false on
failure), ensure it clears ollamaProxyToken when failing, and update callers
(notably setupInference) to abort/return early when the helper returns false so
the caller does not continue configuring ollama-local against a nonfunctional
proxy.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 5665-5680: The waitUntil loop assumes
policies.applyPreset(sandboxName, name) either throws or succeeds, but
applyPreset can return false; update the waitUntil callback to capture the
return value of policies.applyPreset(sandboxName, name) (call it e.g. applied),
and if applied is falsy set lastErr = new Error(`applyPreset returned false for
preset ${name}`) and return false so the loop will retry and ultimately trigger
the existing failure path; keep the existing catch behavior for thrown errors
unchanged.

---

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 3341-3360: The current coredns readiness probe in corednsReady
(using waitUntil and runCaptureOpenshell) can return true prematurely because
the jsonpath check only verifies at-least-one-running and presence/absence of
literal strings; change the probe to either (A) run kubectl get pods -n
kube-system -l k8s-app=kube-dns -o json, parse the returned JSON and ensure
every item has .status.phase === "Running" and every container in
.status.containerStatuses has ready === true before returning success, or (B)
replace the check with a direct kubectl wait --for=condition=Ready pod -n
kube-system -l k8s-app=kube-dns --timeout=<sensible timeout> and use
runCaptureOpenshell to evaluate its success; update corednsReady accordingly.
- Around line 2098-2102: ensureOllamaAuthProxy currently logs and returns on
proxy start timeout but leaves ollamaProxyToken set, allowing setupInference to
proceed against a dead proxy; change ensureOllamaAuthProxy to return a boolean
(e.g., rename or implement as startOllamaAuthProxy/ensureOllamaAuthProxy
returning true on success, false on failure), ensure it clears ollamaProxyToken
when failing, and update callers (notably setupInference) to abort/return early
when the helper returns false so the caller does not continue configuring
ollama-local against a nonfunctional proxy.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 21700ac2-f12a-4a5a-bd5f-6bab1ffa07bf

📥 Commits

Reviewing files that changed from the base of the PR and between ef67b43 and d3c5e5c.

📒 Files selected for processing (1)

src/lib/onboard.ts

coderabbitai

♻️ Duplicate comments (3)

src/lib/onboard.ts (3)
2314-2318: ⚠️ Potential issue | 🟠 Major

Propagate proxy restart failure out of ensureOllamaAuthProxy().

Returning from this void helper only stops the helper. Line 5572 still continues with the persisted token, so onboarding can configure ollama-local even though the proxy never came back. Make this helper return a boolean and abort the caller on false.
Suggested fix
-function ensureOllamaAuthProxy(): void {
+function ensureOllamaAuthProxy(): boolean {
   // Try to load persisted token first — if none, this isn't an Ollama setup.
   const token = loadPersistedProxyToken();
-  if (!token) return;
+  if (!token) return true;
@@
   if (isOllamaProxyProcess(pid)) {
     ollamaProxyToken = token;
-    return;
+    return true;
   }
@@
   if (!waitForPort(OLLAMA_PROXY_PORT, 2)) {
     console.error(`  Error: Ollama auth proxy did not become ready on :${OLLAMA_PROXY_PORT} within timeout.`);
-    // Void function, return early
-    return;
+    return false;
   }
+
+  return true;
 }
if (!isWsl() && !ensureOllamaAuthProxy()) {
  process.exit(1);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2314 - 2318, The helper
ensureOllamaAuthProxy currently returns void and simply returns early when
waitForPort(OLLAMA_PROXY_PORT, 2) fails, allowing the caller to continue with a
persisted token; change ensureOllamaAuthProxy to return a boolean (true on
success, false on failure), have the waitForPort failure branch return false
instead of void, and update callers (e.g., the code path that checks isWsl() and
later configures ollama-local) to check the boolean result and abort/exit when
false to prevent proceeding without a ready proxy.
3589-3611: ⚠️ Potential issue | 🟠 Major

This CoreDNS predicate can still pass before all pods are ready.

The aggregated substring check succeeds for states like Running true Pending—one old pod ready, one replacement pod still pending—because there's no literal false to trip the guard. That reopens the DNS race window, and the recovery path at Lines 3690-3709 drops the timeout result entirely. Parse pod JSON per item and require every matched pod to be Running with all containerStatuses[].ready === true, then surface timeout consistently in both call sites.

Also applies to: 3690-3709

6047-6062: ⚠️ Potential issue | 🟠 Major

Handle applyPreset() returning false in the non-interactive path.

waitUntil() currently treats any non-throwing applyPreset() call as success. If applyPreset() returns false, this loop exits immediately and onboarding reports success even though the preset was never applied. syncPresetSelection() already handles this correctly later in the file; this path should match it.
Suggested fix
       let lastErr: Error | null = null;
       const success = waitUntil(() => {
         try {
-          policies.applyPreset(sandboxName, name);
+          if (!policies.applyPreset(sandboxName, name)) {
+            lastErr = new Error(`applyPreset returned false for preset '${name}'.`);
+            return false;
+          }
           return true;
         } catch (err) {
           lastErr = err as Error;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 6047 - 6062, The waitUntil predicate
currently treats any non-throwing call to policies.applyPreset(sandboxName,
name) as success; change it so you call policies.applyPreset(...) and inspect
its boolean return: return true only when the call returns true, and when it
returns false set lastErr to an Error describing that the preset was not applied
for the given sandboxName/name and return false so the loop continues; keep the
existing catch-path that captures thrown errors and rethrows non-"sandbox not
found" errors, and preserve the final if (!success && lastErr) throw lastErr
behavior to surface failures (this makes the non-interactive path match
syncPresetSelection).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 2314-2318: The helper ensureOllamaAuthProxy currently returns void
and simply returns early when waitForPort(OLLAMA_PROXY_PORT, 2) fails, allowing
the caller to continue with a persisted token; change ensureOllamaAuthProxy to
return a boolean (true on success, false on failure), have the waitForPort
failure branch return false instead of void, and update callers (e.g., the code
path that checks isWsl() and later configures ollama-local) to check the boolean
result and abort/exit when false to prevent proceeding without a ready proxy.
- Around line 6047-6062: The waitUntil predicate currently treats any
non-throwing call to policies.applyPreset(sandboxName, name) as success; change
it so you call policies.applyPreset(...) and inspect its boolean return: return
true only when the call returns true, and when it returns false set lastErr to
an Error describing that the preset was not applied for the given
sandboxName/name and return false so the loop continues; keep the existing
catch-path that captures thrown errors and rethrows non-"sandbox not found"
errors, and preserve the final if (!success && lastErr) throw lastErr behavior
to surface failures (this makes the non-interactive path match
syncPresetSelection).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: da63b43f-6a28-4d43-8dc9-f2f503a3effc

📥 Commits

Reviewing files that changed from the base of the PR and between d3c5e5c and be73a0b.

📒 Files selected for processing (2)

src/lib/onboard.ts
test/onboard.test.ts

✅ Files skipped from review due to trivial changes (1)

test/onboard.test.ts

coderabbitai

♻️ Duplicate comments (2)

src/lib/onboard.ts (2)

2310-2318: ⚠️ Potential issue | 🟠 Major

Propagate proxy restart failure instead of silently continuing.

ensureOllamaAuthProxy() still can't tell its caller that restart failed. On Line 2312 it publishes the persisted token before the new proxy is verified, and this path never checks that the spawned PID is actually ollama-auth-proxy.js. If another process is already listening on :${OLLAMA_PROXY_PORT}, waitForPort() returns true and setupInference() continues with a stale token against the wrong service. Return a boolean here, verify the spawned PID like startOllamaAuthProxy(), and abort the Ollama setup when it returns false.

Proposed fix

-function ensureOllamaAuthProxy(): void {
+function ensureOllamaAuthProxy(): boolean {
   // Try to load persisted token first — if none, this isn't an Ollama setup.
   const token = loadPersistedProxyToken();
-  if (!token) return;
+  if (!token) return true;

   const pid = loadPersistedProxyPid();
   if (isOllamaProxyProcess(pid)) {
     ollamaProxyToken = token;
-    return;
+    return true;
   }

   // Proxy not running — restart it with the persisted token.
   killStaleProxy();
-  ollamaProxyToken = token;
-  spawnOllamaAuthProxy(token);
+  const restartedPid = spawnOllamaAuthProxy(token);
   if (!waitForPort(OLLAMA_PROXY_PORT, 2)) {
     console.error(`  Error: Ollama auth proxy did not become ready on :${OLLAMA_PROXY_PORT} within timeout.`);
-    // Void function, return early
-    return;
+    return false;
   }
+  if (!isOllamaProxyProcess(restartedPid)) {
+    console.error(`  Error: Ollama auth proxy failed to restart on :${OLLAMA_PROXY_PORT}`);
+    return false;
+  }
+  ollamaProxyToken = token;
+  return true;
 }

-      ensureOllamaAuthProxy();
+      if (!ensureOllamaAuthProxy()) {
+        process.exit(1);
+      }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 2310 - 2318, ensureOllamaAuthProxy()
currently publishes ollamaProxyToken and spawns a proxy without verifying the
new process or port owner; change it to return a boolean indicating success,
move assignment of ollamaProxyToken until after verification, and replace the
spawn/verify sequence with a call to startOllamaAuthProxy() (which should return
the spawned PID or process identity) so you can check the PID/process matches
the expected ollama-auth-proxy; only set ollamaProxyToken and return true if
waitForPort(OLLAMA_PROXY_PORT, ...) succeeds and the spawned PID/process matches
the expected binary, otherwise kill any started process, log the failure, return
false so setupInference() can abort when ensureOllamaAuthProxy() returns false.

3589-3611: ⚠️ Potential issue | 🟠 Major

Tighten the CoreDNS readiness predicate before treating the patch as complete.

These probes can still pass too early. The current check only requires some "Running" and some "true" with no "false", so a rollout with one ready pod and another Pending pod that has no containerStatuses yet will be treated as ready. In the recovery path, the timeout result is also discarded entirely. Parse the pod JSON and require every matched pod to be Running with every container ready === true, then handle timeout consistently in both call sites.

Proposed fix

+function areCorednsPodsReady(): boolean {
+  const raw = runCaptureOpenshell(
+    [
+      "doctor",
+      "exec",
+      "--",
+      "kubectl",
+      "get",
+      "pods",
+      "-n",
+      "kube-system",
+      "-l",
+      "k8s-app=kube-dns",
+      "-o",
+      "json",
+    ],
+    { ignoreError: true },
+  );
+  try {
+    const parsed = JSON.parse(raw || "{}");
+    const items = Array.isArray(parsed.items) ? parsed.items : [];
+    return (
+      items.length > 0 &&
+      items.every((item) => {
+        const statuses = Array.isArray(item?.status?.containerStatuses)
+          ? item.status.containerStatuses
+          : [];
+        return (
+          item?.status?.phase === "Running" &&
+          statuses.length > 0 &&
+          statuses.every((status) => status?.ready === true)
+        );
+      })
+    );
+  } catch {
+    return false;
+  }
+}

-    const corednsReady = waitUntil(() => {
-      const check = runCaptureOpenshell(...);
-      return check.includes("Running") && !check.includes("false") && check.includes("true");
-    }, 10);
+    const corednsReady = waitUntil(areCorednsPodsReady, 10);
     if (!corednsReady) {
       console.warn("  CoreDNS did not report ready within timeout; continuing may cause DNS flakiness.");
     }

-        waitUntil(() => {
-          const check = runCaptureOpenshell(...);
-          return check.includes("Running") && !check.includes("false") && check.includes("true");
-        }, 10);
+        const corednsReady = waitUntil(areCorednsPodsReady, 10);
+        if (!corednsReady) {
+          console.warn("  CoreDNS did not report ready within timeout during recovery; continuing may cause DNS flakiness.");
+        }

Also applies to: 3690-3709

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 3589 - 3611, The current CoreDNS readiness
predicate (corednsReady) uses string checks on runCaptureOpenshell output and
can falsely report ready; update the predicate in waitUntil to call kubectl with
'-o json' (via runCaptureOpenshell), JSON.parse the output, and require that
every returned pod has status.phase === "Running" and that every
pod.status.containerStatuses exists and every containerStatus.ready === true;
ensure waitUntil returns false on timeout and use that false value consistently
at both call sites (the corednsReady check and the duplicate site) to log a
warning or abort as appropriate rather than discarding the timeout result.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 2310-2318: ensureOllamaAuthProxy() currently publishes
ollamaProxyToken and spawns a proxy without verifying the new process or port
owner; change it to return a boolean indicating success, move assignment of
ollamaProxyToken until after verification, and replace the spawn/verify sequence
with a call to startOllamaAuthProxy() (which should return the spawned PID or
process identity) so you can check the PID/process matches the expected
ollama-auth-proxy; only set ollamaProxyToken and return true if
waitForPort(OLLAMA_PROXY_PORT, ...) succeeds and the spawned PID/process matches
the expected binary, otherwise kill any started process, log the failure, return
false so setupInference() can abort when ensureOllamaAuthProxy() returns false.
- Around line 3589-3611: The current CoreDNS readiness predicate (corednsReady)
uses string checks on runCaptureOpenshell output and can falsely report ready;
update the predicate in waitUntil to call kubectl with '-o json' (via
runCaptureOpenshell), JSON.parse the output, and require that every returned pod
has status.phase === "Running" and that every pod.status.containerStatuses
exists and every containerStatus.ready === true; ensure waitUntil returns false
on timeout and use that false value consistently at both call sites (the
corednsReady check and the duplicate site) to log a warning or abort as
appropriate rather than discarding the timeout result.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 67ec12f5-c28b-43bc-b699-7eb1fda0245c

📥 Commits

Reviewing files that changed from the base of the PR and between be73a0b and 761374d.

📒 Files selected for processing (1)

src/lib/onboard.ts

wscurran · 2026-04-27T15:21:34Z

✨ Thanks for submitting this pull request that proposes a way to refactor the NemoClaw CLI by replacing arbitrary sleep delays with readiness probes. This change improves performance by proceeding as soon as readiness is achieved and prevents race conditions on slower systems.

cv

Love it!

…recover modules (NVIDIA#2398) ## Summary Extract the scattered dashboard delivery chain logic — URL derivation, CORS origins, port forwarding target, health probing, and recovery — from the monolith files (onboard.ts, nemoclaw.ts) into three focused modules that serve as the single source of truth for dashboard reachability. Fixes NVIDIA#2342 false "Offline" reading when device auth is enabled. ## Related Issue Fixes NVIDIA#2390 Fixes NVIDIA#2342 ## Changes - **New `src/lib/dashboard-contract.ts`**: Pure `buildChain()` function — single source of truth for all dashboard config (access URL, CORS origins, forward target, health endpoint, port, bind address). No `process.env` reads. - **New `src/lib/dashboard-health.ts`**: `verifyDashboardChain()` checks all 3 delivery chain links (gateway, forward, CORS) with injected deps. Accepts HTTP 200 and 401 as "alive" (fixes NVIDIA#2342). - **New `src/lib/dashboard-recover.ts`**: `recoverDashboardChain()` — link-aware, idempotent recovery that only fixes broken links (gateway → forward, CORS diagnose-only). - **Deleted `src/lib/dashboard.ts`** — replaced by dashboard-contract.ts. - **Deleted 7 dashboard helper functions from `src/lib/onboard.ts`** (~90 lines removed): `getDashboardForwardPort`, `getDashboardForwardTarget`, `getDashboardForwardStartCommand`, `buildAuthenticatedDashboardUrl`, `getWslHostAddress`, `getDashboardAccessInfo`, `getDashboardGuidanceLines`. - **Rewritten `ensureDashboardForward`** in onboard.ts to use `buildChain()` instead of re-deriving config. - **Fixed `isSandboxGatewayRunning()`** in nemoclaw.ts to probe `/health` and accept 401 as alive. - **Deleted `recoverSandboxProcesses()` and `ensureSandboxPortForward()`** from nemoclaw.ts — `checkAndRecoverSandboxProcesses()` delegates to `recoverDashboardChain()`. - **Updated `test/onboard.test.ts`** — imports from new modules, assertions rewritten for contract API. - **52 new tests** across 3 test files (32 contract + 13 health + 7 recovery). ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure - [x] AI-assisted — tool: Claude Code (pi agent) --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>  ## Summary by CodeRabbit * **New Features** * Improved dashboard health checks with automated, link-aware recovery and better WSL URL/forwarding support. * **Bug Fixes** * More accurate gateway liveness detection and more reliable port/forwarding behavior across loopback, IPv6, and remote hosts. * **Tests** * Added comprehensive test suites covering dashboard configuration, health verification, and recovery flows. * **Refactor** * Consolidated dashboard URL/forwarding logic into a dedicated contract for clearer, consistent behavior.  Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

…VIDIA#2343) (NVIDIA#2408) ## Summary Follow-up to NVIDIA#2356. The adversarial PR review found that `procps` tools (`ps`, `top`, `free`, `uptime`, `vmstat`) were still missing inside a real sandbox runtime even after NVIDIA#2356 added `procps` to `Dockerfile.base`. Root cause: the GHCR base image may not have been rebuilt yet, and the `Dockerfile` hardening step (`apt-get autoremove --purge`) could strip it. This PR adds a three-layer defense in the production `Dockerfile`: - **`apt-mark manual procps`** before the autoremove step, so apt never considers it auto-removable - **Conditional `apt-get install procps`** if `ps` is still missing after hardening (handles stale GHCR base images that predate NVIDIA#2356) - **Build-time `ps --version`** smoke test that fails the Docker build if procps didn't make it through Plus regression tests: - Static guards in `sandbox-provisioning.test.ts` verifying both Dockerfiles contain procps provisions - Runtime E2E test in `e2e-test.sh` verifying all five procps tools are executable inside the sandbox container ## Test plan - [ ] `hadolint Dockerfile` passes - [ ] `hadolint Dockerfile.base` passes - [ ] `vitest run test/sandbox-provisioning.test.ts` passes - [ ] CI `test-e2e-sandbox` job passes (runs `e2e-test.sh` inside container, including new test 11) - [ ] Docker build succeeds with both fresh base image and stale GHCR base 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Tests** * Added end-to-end checks that verify common system debug tools (ps, top, free, uptime, vmstat) run in the sandbox. * Added regression tests ensuring Docker provisioning includes the debug tooling. * **Chores** * Improved Docker provisioning to ensure procps/debug tools are present at runtime, with a fallback install path for older base images.  --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Julie Yaunches <jyaunches@nvidia.com>

@ts-nocheck

This PR completes the repo-wide type-safety cleanup by removing the remaining `// @ts-nocheck` markers and tightening broad escape hatches across the CLI, plugin, blueprint, scripts, and tests. It replaces `any`/`unknown`/assertion- heavy paths with narrower local domains while keeping the full correctness gate green. - remove all remaining `// @ts-nocheck` markers and retire explicit `any`, `Record<string, unknown>`, type assertions, and non-null assertions that were still inflating the hotspot report - tighten high-churn CLI/onboarding paths in `src/lib/onboard.ts`, `src/lib/onboard-session.ts`, `src/lib/runner.ts`, `src/lib/credentials.ts`, `src/lib/deploy.ts`, and related helpers with narrower persisted-session, errno, config, and subprocess types - tighten plugin and blueprint parsing paths in `nemoclaw/src/blueprint/runner.ts`, `snapshot.ts`, `state.ts`, `onboard/config.ts`, and `commands/migration-state.ts` with concrete source shapes around YAML and JSON boundaries - update scripts and tests, including `scripts/type-safety-hotspots.ts`, `scripts/bump-version.ts`, and many `test/*.ts` files, so the stricter typing is exercised without weakening lint, TypeScript, or test gates - reduce the tracked type-safety hotspots from `59 @ts-nocheck / 22 explicit any / 83 record unknown / 229 type assertions / 50 non-null assertions / 251 unknowns` to `0 / 0 / 0 / 0 / 0 / 32` - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) - [x] AI-assisted — tool: pi coding agent --- Signed-off-by: Carlos Villela <cvillela@nvidia.com>  * **Improvements** * Stronger runtime validation and safer parsing across configs, snapshots, and state—reduces crashes on corrupt or malformed files and yields clearer error messages. * Tighter input handling for plugins, onboarding, and runtime commands to improve reliability and predictability. * **Bug Fixes** * More robust network/proxy and TLS handling during websocket/https upgrades to avoid misrouted requests and fallback safely. * **Tests** * Expanded and hardened test coverage to prevent regressions and improve stability.

) (NVIDIA#2446) ## Summary Create a typed command registry (`src/lib/command-registry.ts`) as the single source of truth for all 46 NemoClaw CLI commands. Derive help output, dispatch arrays, and docs-parity checking from it, replacing three independent hand-maintained sources. Re-enable the `cloud-experimental-e2e` nightly job now that the check-docs parity check passes reliably. ## Related Issue Fixes NVIDIA#2388 ## Changes - **New `src/lib/command-registry.ts`**: Defines `CommandDef` interface, `CommandGroup` union type, and `COMMANDS` array with all 46 CLI commands (23 global + 23 sandbox). Exports helper functions: `globalCommands`, `sandboxCommands`, `visibleCommands`, `commandsByGroup`, `canonicalUsageList`, `globalCommandTokens`, `sandboxActionTokens`. - **New `src/lib/command-registry.test.ts`**: 21 unit tests covering all registry invariants — counts, deduplication, ordering, token extraction, group membership. - **Rewritten `help()` in `src/nemoclaw.ts`**: Now iterates `commandsByGroup()` from registry instead of hand-written console.log lines. Reconfiguration section rewritten with bullet points to prevent phantom `nemoclaw`-prefixed parser matches. - **Derived dispatch arrays**: `GLOBAL_COMMANDS` now calls `globalCommandTokens()` and `sandboxActions` now calls `sandboxActionTokens()` — no more hand-maintained sets/arrays. - **Added `--dump-commands` internal flag**: Prints canonical command list for check-docs.sh parity checks, replacing the fragile perl help-parsing pipeline. - **Simplified `check-docs.sh`**: Phase 1 now uses `--dump-commands` instead of perl regex extraction. Phase 2 strips `<arg>` and `[optional]` placeholders from doc headings for clean comparison. - **Added missing doc headings**: `nemoclaw onboard --from`, `nemoclaw setup`, `nemoclaw start`, `nemoclaw stop` in `docs/reference/commands.md`. - **Re-enabled `cloud-experimental-e2e`**: Removed DISABLED comment, `CLOUD_EXPERIMENTAL_E2E_ENABLED` variable gate, and `SKIP_CHECK_DOCS` flag. - **Updated `test/image-cleanup.test.ts`**: Adapted to verify `gc` via registry source instead of regex-matching the old `GLOBAL_COMMANDS` Set literal. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [x] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure - [x] AI-assisted — tool: Claude Code (pi agent) --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>  ## Summary by CodeRabbit * **Documentation** * Added `nemoclaw onboard --from <Dockerfile>` usage guide and documented deprecated aliases (`nemoclaw start`, `nemoclaw stop`, `nemoclaw setup`) with warnings. * **New Features** * Added a developer-facing `--dump-commands` option to emit the canonical command list. * **Improvements** * Overhauled help/command organization for clearer grouping and more accurate, stable command listings.  --------- Signed-off-by: Carlos Villela <cvillela@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>

) Add tokenFormat/tokenFormatHint and appTokenFormat/appTokenFormatHint to ChannelDef interface. Wire format checks into setupMessagingChannels() so invalid tokens are rejected at the prompt instead of silently saved and surfacing as cryptic auth failures after sandbox build. - Slack bot token: ^xoxb-[A-Za-z0-9_-]+$ - Slack app token: ^xapp-[A-Za-z0-9_-]+$ - Non-interactive mode intentionally unchanged (env-var tokens assumed intentional) - Future channels can opt in by adding tokenFormat/appTokenFormat to their definition Closes NVIDIA#1912 Co-authored-by: latenighthackathon <latenighthackathon@users.noreply.github.com>

…VIDIA#2447) ## Summary Follow-up to PR NVIDIA#2422 (`refactor(types): tighten repo-wide type boundaries`). Addresses the review warnings and suggestions from triage. ## Changes ### 1. Shared `isErrnoException` / `isPermissionError` — `src/lib/errno.ts` PR NVIDIA#2422 removed all `@ts-nocheck` markers, which required each module to define its own `isErrnoException` + `ErrnoLike` type locally. This resulted in 6 slightly different copies across `config-io`, `credentials`, `http-probe`, `onboard`, `onboard-session`, and `registry`. This PR extracts a single canonical version into `src/lib/errno.ts` that: - Accepts `unknown` so callers never need an `instanceof Error` pre-cast - Checks for both `code` and `errno` properties (covering all Node.js error shapes) - Includes a convenience `isPermissionError` for the common EACCES/EPERM guard This also simplifies many catch blocks that previously did `error instanceof Error && isErrnoException(error)` — the redundant `instanceof Error` check is now handled inside the shared function. ### 2. Shared JSON recursive types — `src/lib/json-types.ts` PR NVIDIA#2422 introduced identical Scalar/Value/Object type triples in multiple modules: - `LooseScalar` / `LooseValue` / `LooseObject` in `onboard.ts` and `agent-onboard.ts` - `PolicyScalar` / `PolicyValue` / `PolicyObject` in `policies.ts` - `SessionJsonPrimitive` / `SessionJsonValue` / `UnknownRecord` in `onboard-session.ts` This PR extracts canonical `JsonScalar` / `JsonValue` / `JsonObject` types into `src/lib/json-types.ts`. Each module re-aliases them under its domain names for readability. > **Note:** The plugin types (`PluginScalar`/`PluginValue`/`PluginRecord` in `nemoclaw/src/index.ts`) are intentionally kept in-place because the plugin compiles separately from the CLI and cannot share imports. ### 3. `Reflect.get` / `Reflect.set` convention comment Adds a clarifying comment at the first usage site (`usage-notice.ts`) explaining why `Reflect.get` is preferred over `as Record<…>` casts throughout the codebase — it avoids widening the target type and satisfies `no-unsafe-member-access` lint rules. ### 4. `ws-proxy-fix.ts` behavioral guard documentation Documents the `!host` null guard in the Discord WebSocket tunnel path, explaining that the previous code would have attempted a tunnel with an undefined host (which would fail downstream anyway). The guard now explicitly falls through to the original `https.request` unchanged. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `tsc -p tsconfig.cli.json --noEmit` passes - [x] `npm test` passes (only pre-existing failures in install-preflight and ssrf-parity) - [x] Tests added for new `errno.ts` module (15 test cases) - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) ## AI Disclosure - [x] AI-assisted — tool: pi coding agent --- Signed-off-by: Jess Yaunches <jyaunches@nvidia.com>  ## Summary by CodeRabbit * **Bug Fixes** * Improved WebSocket proxy handling to avoid creating tunnels when the request target host is undefined. * **Tests** * Added tests validating errno/permission detection utilities. * **Documentation** * Added explanatory comment clarifying property-access rationale. * **Chores** * Centralized JSON-type aliases and errno helpers for consistent error handling across modules.  --------- Signed-off-by: Jess Yaunches <jyaunches@nvidia.com>

…VIDIA#2458) ## Summary Tip of `main` currently fails `npm run typecheck:cli` with 8 errors in `test/onboard.test.ts` — patterns introduced by NVIDIA#2130 (Slack token validation tests) don't satisfy the `strict: true` setting introduced by NVIDIA#2422. Every PR opened against main inherits these errors, which makes the `checks` CI job red on unrelated PRs (e.g. NVIDIA#2454). ## Related Follows from NVIDIA#2130 + NVIDIA#2422 interaction. No linked issue — local-only breakage caught by CI on downstream PRs. ## Changes `test/onboard.test.ts` — 8 annotations, zero behavior change: - **4× `.pop()!`** — `Array.pop()` returns `string | undefined`. Each site is immediately preceded by an `assert.equal(result.status, 0)` that guarantees non-empty stdout, so the non-null assertion is safe. - **3× `(c: { key: string }) => c.key === …`** — `saveCalls` entries only read `.key`. - **1× `(c: { name: string }) => c.name === …`** — `MESSAGING_CHANNELS` entry lookup only reads `.name`. The tests themselves still exercise the same cases; these changes are purely to make the TypeScript compiler happy. ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] `npx tsc -p tsconfig.cli.json` — exits 0 (was 2 with 8 errors) - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>  ## Summary by CodeRabbit ## Tests * Improved reliability of JSON payload extraction in onboarding tests with defensive null-checks before parsing. * Enhanced type safety in test callbacks and iterators with explicit TypeScript type annotations.  Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

## Summary Prevents the class of regression that required NVIDIA#2458. Adds an unconditional `typecheck:cli` step to the shared CI action so future hook-filter drift can't hide a tsc error. ## Related Follow-up to NVIDIA#2458 (which fixed the symptom — this addresses the root cause). ## Root cause recap - `.pre-commit-config.yaml` had `files: ^(bin|scripts)/` on the `tsc-cli` hook. - PRs that touched only `test/` or `src/` never triggered the check. - NVIDIA#2130 (test-only Slack validation) slipped through that gap. - NVIDIA#2422 later turned on `strict: true`, and the latent errors surfaced — but for every downstream PR, not the PR that introduced them. - Six open PRs ended up blocked on the same error cluster before anyone noticed. ## Changes `.github/actions/basic-checks/action.yaml` — new **Typecheck CLI + tests (strict)** step running `npm run typecheck:cli` unconditionally. Independent of prek hook configuration, so future filter drift can't hide a tsc error from CI. ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] YAML parses - [x] `npm run typecheck:cli` exits 0 on this branch (requires NVIDIA#2458 merged to main first) - [x] Tests added or updated for new or changed behavior (N/A — infra change, existing suite covers) - [x] No secrets, API keys, or credentials committed ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com>  ## Summary by CodeRabbit * **Chores** * Enhanced continuous integration pipeline with additional TypeScript typechecking to improve code quality assurance and detect type-related issues earlier in the development process.  Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

NVIDIA#2436) ## Summary - Adds `lstatSync`-based symlink rejection to `createSnapshot()` and `rollbackFromSnapshot()` in `nemoclaw/src/blueprint/snapshot.ts`, closing two attack surfaces reported by an external contributor doing Yocto integration: 1. **`createSnapshot()`** calls `cpSync(OPENCLAW_DIR, dest)` with no `lstatSync` check on the source — if `~/.openclaw` is a symlink (e.g. to `/etc`), `cpSync` follows it and copies arbitrary files into the snapshot 2. **`rollbackFromSnapshot()`** calls `cpSync(source, OPENCLAW_DIR)` without verifying the destination isn't a symlink — an attacker-planted symlink could redirect restored content to an arbitrary directory - Updates `collectFiles()` to detect symlinks via `isSymbolicLink()` and record them in the snapshot manifest (previously symlinks were silently omitted) - Follows the `rejectSymlinksOnPath()` pattern established in PR NVIDIA#2290 for the `~/.nemoclaw` credentials directory - **Also guards the CLI code path** (`src/lib/sandbox-state.ts::backupSandboxState()`) that `nemoclaw <sandbox> snapshot create` actually uses — the blueprint helper is a separate implementation not called from the CLI - `rollbackFromSnapshot()` symlink rejection returns `false` (preserving the boolean-return contract) rather than throwing ### Severity Low-Medium. Both attacks require the attacker to already have user-level write access to `$HOME` to plant the symlink, and the sandbox's Landlock / `chattr +i` / read-only mount protections prevent in-sandbox exploitation. However, this is the same symlink attack class fixed in NVIDIA#2290 (different code paths: `cpSync` + directory walker vs. `mkdirSync`), and defense-in-depth warrants closing the gap. ## Test plan - [x] New test: `createSnapshot` rejects when `~/.openclaw` is a symlink - [x] New test: `createSnapshot` rejects when an ancestor of `~/.nemoclaw` is a symlink - [x] New test: `createSnapshot` records symlinks in manifest when present in tree - [x] New test: `rollbackFromSnapshot` returns false (not throws) when `~/.openclaw` is a symlink - [x] All existing snapshot tests pass (no regressions) - [x] Full plugin test suite: 362 passed - [x] CLI test suite: 1959/1961 passed (2 pre-existing flaky installer timeouts) - [x] TypeScript build clean (`tsc --noEmit`) - [x] All pre-commit and pre-push linters/formatters pass 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **New Features** * Snapshots now detect and record symbolic links in the snapshot manifest and refuse snapshot creation or restoration when critical config directories or their ancestor paths are symlinks. * Backup destination paths are validated to prevent writing into locations with symlinked ancestor paths. * **Tests** * Added tests covering symlink detection, manifest recording, rejection scenarios, and rollback behavior when config paths are symlinks.  --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…VIDIA#2440) ## Summary - When `CHAT_UI_URL` specifies a custom port, `NEMOCLAW_DASHBOARD_PORT` was only injected into the sandbox if the host env var was explicitly set, causing the gateway to listen on the wrong port - Always derive the effective port from `chatUiUrl` via `getDashboardForwardPort()` and inject it unconditionally - Bump vitest timeouts on flaky openshell gateway tests Closes NVIDIA#2267 ## Test plan - [x] All 2302 existing tests pass - [ ] Verify sandbox starts with correct `NEMOCLAW_DASHBOARD_PORT` when `CHAT_UI_URL` uses a non-default port (e.g. `:18790`) - [ ] Verify default port (`18789`) still works when `CHAT_UI_URL` is unset 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Bug Fixes** * Dashboard forwarding now consistently derives and injects the dashboard port from CHAT_UI_URL (including custom ports), preventing mismatches that could block dashboard access. * **New Features** * Deterministic dashboard port derivation from CHAT_UI_URL for sandbox startup and health checks. * **Documentation** * Updated onboarding, quickstart, reference, and troubleshooting guides and examples to prefer CHAT_UI_URL with auto-derived ports. * **Tests** * Added/adjusted tests and extended timeouts to validate port injection and improve CLI/onboard reliability.  --------- Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Ubuntu <ubuntu@nemoclaw-f6a691-inst-3cs11f3ytmmzubymi5nzdncnwia.us-west1-a.c.brevdevprod.internal> Co-authored-by: Aaron Erickson 🦞 <aerickson@nvidia.com>

…DIA#2445) ## Summary Reorganize the Get Started section, document the full runtime topology in the Architecture reference, and clarify how to uninstall NemoClaw. Prerequisites now live on their own page (with the tested-platform matrix included directly from the README), Windows pre-setup is renamed to `windows-preparation.md`, and DGX Spark readers are pointed to the NVIDIA Spark playbook rather than an in-repo tutorial. ## Changes **Get Started reorganization** - New `docs/get-started/prerequisites.md`. Extracts Hardware, Software, and the tested-platform matrix from Quickstart into its own page. The matrix is pulled in via Sphinx `{include}` from the README so there is a single source of truth. - `docs/get-started/windows-setup.md` → `docs/get-started/windows-preparation.md` (renamed to "Prepare Windows for NemoClaw"). Nested under Prerequisites in the toctree. - `docs/get-started/quickstart.md` no longer duplicates the Prerequisites section; it links to the new page instead. - `docs/index.md` Get Started toctree is now `Prerequisites` then `Quickstart`. - `docs/conf.py` redirects `get-started/windows-setup` → `windows-preparation.html` so the pre-reorg URL still resolves. **Architecture reference: Deployment Topology** - New `## Deployment Topology` section in `docs/reference/architecture.md` with a mermaid diagram showing the runtime substrate — host CLI → Docker daemon → OpenShell gateway container → embedded k3s cluster → sandbox pod → OpenClaw agent + plugin — plus the inference router's role on the path to external providers. - Per-layer role table (host CLI, Docker daemon, gateway container, k3s, sandbox pod, inference router). - Points DGX Spark readers to the [NVIDIA Spark playbook](https://build.nvidia.com/spark/nemoclaw) for the Spark-specific variant. **Uninstall documentation** - `docs/reference/commands.md` `nemoclaw uninstall` section corrects a drift bug (the CLI does *not* auto-fall-back to a remote script; it prints the versioned URL for manual review) and adds a comparison table between `nemoclaw uninstall` and the hosted `curl … | bash` form. - README and Quickstart Uninstall sections now lead with `nemoclaw uninstall` (version-pinned, no network fetch) and keep the `curl … | bash` form as a fallback for when the CLI is missing or broken. Both link to the comparison table. **Troubleshooting reference** - New `## DGX Spark` section in `docs/reference/troubleshooting.md` covering CoreDNS CrashLoop, `k3s` image-pull failure after a local build, GPU passthrough (not CI-tested on Spark), `pip install` system-packages errors, and the AI Workbench port 3000 conflict. - Section header points to the Spark playbook for end-to-end walkthroughs. - Windows cross-links retargeted to `../get-started/windows-preparation.md`. **Platform matrix source of truth** - `ci/platform-matrix.json` DGX Spark notes now link the Spark playbook. README matrix regenerated. - README Documentation table gains a Prerequisites row and updates the pointer below the matrix to point at Prerequisites instead of a now-removed hub. - `scripts/generate-platform-docs.py` drops `docs/get-started/quickstart.md` from its patch targets — the matrix moved to `prerequisites.md` and is pulled via `{include}` from README, so only README needs patching. - `.pre-commit-config.yaml` `platform-matrix-sync` hook updated to match (removed the quickstart path from both `git add` and `files`). **Agent skill generator (`scripts/docs-to-skills.py`)** Goal: skill files are self-contained. They never reference repository paths like `../../../docs/…`, and they never couple the skill tree to the repo layout. - Rewrite precedence in `rewrite_doc_paths`: 1. If the link target maps to a generated skill, emit ``text (use the `<skill>` skill)``. 2. Otherwise, if the target is a page inside `docs/`, emit an absolute `[text](<html_baseurl><page>.html#frag)`. 3. Otherwise (target outside `docs/`, or no base URL), strip the hyperlink and keep the link text. Self-containment wins over navigability in the fallback. - New `load_html_baseurl(docs_dir)` reads the `html_baseurl` assignment from Sphinx's `conf.py` via `ast` (no `exec` side effects), so the script is project-agnostic: any docs tree with that assignment works. A warning is printed (and the script continues) if the base URL is missing. - Include placeholders (`> *Content included from … — see the original doc for full text.*`) follow the same precedence: emit a published URL if available, else drop the breadcrumb since the content is already inlined. - Fragment anchors are now preserved across the rewrite (the pre-existing rewrite silently stripped them). - Scope the `index.md` exclusion to the top-level `docs/index.md` only. Subdirectory index pages (for example `docs/get-started/platform-setup/index.md`) are legitimate hub content and can now participate in skill grouping — which is what makes rule 1 above kick in for links to those hubs. - Because emitted content no longer depends on where it is written, the earlier `output_file` parameter plumbing and the multi-output-dir invariant guard are gone. `output_dirs` can be any mirrors (symlinks, bind mounts, independent trees) without caveat. **Repo-root stub** - `spark-install.md` rewritten as a short redirect paragraph pointing to the NVIDIA Spark playbook and the Quickstart. **Agent skills** - `.agents/skills/` regenerated from the new doc tree. `nemoclaw-user-get-started` now covers `prerequisites.md` + `windows-preparation.md`; `nemoclaw-user-reference` picks up the revised architecture, commands, and troubleshooting pages. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) The only non-doc code changes are to the skill-generator script (`scripts/docs-to-skills.py`), the platform-matrix-generator target list (`scripts/generate-platform-docs.py`), and the pre-commit hook entry. Everything else is docs or generated docs. ## Verification - [x] `check-docs.sh --only-links` phase 1 (local links) passes. Phase 2 flags `prerequisites.html` as "unreachable" because it is a new page not yet published; it will resolve after this PR merges and the docs site rebuilds. - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [x] New doc pages include SPDX header and frontmatter (new pages only) Note on unchecked boxes: prek's CLI vitest suite surfaces pre-existing environmental failures on my machine (`nemoclaw/dist/blueprint/private-networks.js` missing and integration tests that try to reach a live gateway). No source file under `src/`, `nemoclaw/src/`, `bin/`, or `test/` is touched by this diff, so those failures are unrelated. `make docs` wasn't run locally. ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>  ## Summary by CodeRabbit * **New Features** * Added a dedicated Prerequisites page with minimum/recommended hardware, software versions, and platform notes. * Added DGX Spark troubleshooting and an end-to-end local Ollama walkthrough reference. * **Documentation** * Reorganized Get Started to surface Prerequisites and updated Windows preparation wording. * Reworked uninstall guidance to prefer the CLI `nemoclaw uninstall` (local, version‑pinned) and document a hosted-script fallback and comparison. * Replaced a lengthy Spark install guide with an external playbook pointer. * **Reference** * Expanded architecture and commands docs to clarify deployment topology, credential flow, and uninstall behavior. * **Chores** * Updated docs generation/redirect tooling to emit published links and target platform docs appropriately.  --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

## Summary The squash merge of NVIDIA#2351 skipped pre-commit hooks, leaving shfmt formatting violations that broke the `main` workflow checks. This applies `shfmt -w` to fix them. ## Test plan - [x] `shfmt -d` produces no diff on both files  ## Summary by CodeRabbit * **Documentation** * Added documentation for `nemoclaw onboard --from <Dockerfile>` option for building sandbox images from custom Dockerfiles. * Added documentation for deprecated CLI aliases `nemoclaw start`/`nemoclaw stop` (use `nemoclaw tunnel start`/`nemoclaw tunnel stop` instead). * Added documentation for deprecated `nemoclaw setup` command (use `nemoclaw onboard` instead).  --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…NVIDIA#2435) ## Summary Tests in `install-preflight.test.ts` and `install-openshell-version-check.test.ts` flake persistently under the default 5s vitest timeout. Each test spawns `bash install.sh` — these are integration tests, not unit tests, and belong in CI. ## Changes - Set `{ timeout: 15_000 }` on the `describe` block in both test files (keeps them passing when run explicitly) - Add a new `installer-integration` vitest project enabled only by `CI=1` or `NEMOCLAW_RUN_INSTALLER_TESTS=1` - Exclude the two files from the `cli` project so pre-push hooks skip them Run explicitly: `npx vitest run --project installer-integration` ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] `npx prek run --all-files` passes - [x] Both test files pass: 65/65 tests - [x] No secrets, API keys, or credentials committed --- Signed-off-by: Aaron Erickson <aerickson@nvidia.com>  ## Summary by CodeRabbit * **Tests** * Added explicit 15s timeouts to installer test suites to reduce flaky runs. * Moved installer-specific tests into a separate CI-gated project so they run only in CI or when explicitly enabled, keeping regular test runs lean.  --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Summary Catches up the user-facing docs for changes merged since v0.0.24 and bumps `docs/versions1.json` so the next docs build publishes as 0.0.25. ## Changes - `docs/reference/troubleshooting.md`: corrected the JetPack 7 (L4T 39.x) entry to reflect the new conditional `br_netfilter` auto-load on R39 hosts that lack it (from NVIDIA#2419), and added a new "DNS resolution from inside docker fails (corporate firewall)" section covering the new container-DNS preflight probe and platform-specific remediation paths (from NVIDIA#2349). - `docs/get-started/quickstart.md`: added a "Review the Configuration Before the Sandbox Build" subsection documenting the new `[Y/n]` review gate that fires after the sandbox-name prompt (from NVIDIA#2221). - `docs/versions1.json`: promoted 0.0.25 to `preferred: true`, demoted 0.0.24, and trimmed to ten entries to match what `scripts/bump-version.ts` writes. `docs/project.json` regenerates from `versions1.json` at build time and now reports `0.0.25`. - `.agents/skills/nemoclaw-user-get-started/` and `.agents/skills/nemoclaw-user-reference/`: regenerated via `scripts/docs-to-skills.py`. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [x] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>  ## Summary by CodeRabbit * **New Features** * Onboarding now shows a configuration review and requires explicit confirmation before building a sandbox; non-interactive mode prints the summary without prompting. * **Documentation** * Expanded Jetson troubleshooting for JetPack 7 / L4T 39.x and conditional br_netfilter guidance. * Added corporate-firewall DNS troubleshooting with platform-specific remediation and a verification nslookup step. * Quickstart updated to document the new confirmation flow. * **Chores** * Documentation version index updated.  --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…VIDIA#2471) ## Summary - Reverts the `nemoclaw.ts` changes from NVIDIA#2398 that caused `nemoclaw status` to hang indefinitely - Restores the original `isSandboxGatewayRunning`, `recoverSandboxProcesses`, `ensureSandboxPortForward`, and `checkAndRecoverSandboxProcesses` implementations - Leaves the dashboard-contract/health/recover modules in place (onboard.ts depends on them) ## Root cause PR NVIDIA#2398 replaced the simple gateway recovery path with `recoverDashboardChain()` which introduced unbounded calls that hang in CI. The bisect confirmed NVIDIA#2398 as the sole culprit: | Run | Commit | Result | |-----|--------|--------| | Last good | `de97a00d` (Apr 24 16:06) | pass | | Bisect 4 | `9fbfbaca` (NVIDIA#2398 only) | **hang** | | First bad | `79c8e2a9` (Apr 25 00:10) | hang | sandbox-survival, skip-permissions, sandbox-operations, and cloud-e2e all hang at `nemoclaw <name> status` after NVIDIA#2398. ## What this reopens - NVIDIA#2342 (false "Offline" with device auth) — the old `curl -sf /` probe returns failure on 401. This needs a targeted fix, not a refactor. - NVIDIA#2390 — dashboard delivery chain consolidation goal. Can be re-attempted with proper E2E validation. ## Test plan - [ ] `npx tsc -p tsconfig.src.json --noEmit` passes - [ ] Dashboard unit tests pass - [ ] Nightly E2E: sandbox-survival, skip-permissions, sandbox-operations, cloud-e2e pass  ## Summary by CodeRabbit * **Bug Fixes** * Improved gateway health check reliability by using explicit status outputs instead of HTTP code interpretation. * Enhanced automatic recovery mechanism for sandbox gateway outages with better restart and port forwarding restoration. * Updated diagnostic logging to better reflect gateway restart and connectivity restoration activities.

…IA#2437)  ## Summary The installer used to auto-attach `--resume` to `nemoclaw onboard` whenever a non-complete session file existed on disk. A run that failed at step 3 or 4 (for example, Ollama proxy unreachable) writes `status="failed"` with `resumable=true`, which matched that predicate, so re-running `curl | bash` silently replayed the same failed choice. The user had no way to pick a different provider short of deleting `~/.nemoclaw/onboard-session.json`. Fixed by tightening the installer's session classifier and adding an end-to-end `--fresh` opt-in across both the installer and the `nemoclaw onboard` CLI. ## Related Issue Resolves NVIDIA#2430. ## Changes **Installer (`scripts/install.sh`)** The install script now classifies the on-disk session into one of four states and branches on it instead of treating "non-complete" as one bucket: - `resume` — `status === "in_progress"` only. Auto-attaches `--resume` (the legitimate Ctrl-C / interrupted-mid-step case). - `failed` — `status === "failed"` or a populated `failure` field. In a TTY, prompts `[R/f]`; in non-interactive mode, refuses with a non-zero exit so scripted re-runs cannot loop. The refusal message points at `--fresh` and `nemoclaw onboard --resume` as the two explicit opt-ins. - `skip` — complete / `resumable: false` / missing. No `--resume` attached. - `corrupt` — file exists but cannot be parsed, or status does not match any value `onboard-session.ts` actually writes. Warns and starts fresh rather than auto-resuming an unreadable file. A new `--fresh` flag (and `NEMOCLAW_FRESH=1` env var) short-circuits the classifier entirely and is forwarded to the downstream CLI invocation. **CLI (`nemoclaw onboard --fresh`)** `onboard-command.ts` now accepts `--fresh`, exposed through the parser, the help text, and the `OnboardCommandOptions` type. When set, `onboard.ts` calls `onboardSession.clearSession()` before `createSession + saveSession`, so the previous session file is unlinked outright rather than overwritten in place. `--fresh` and `--resume` are mutually exclusive, validated at both the parser and the runtime entry point. **Docs** New troubleshooting entry under *Onboarding* (`Previous onboarding session failed`) explaining the interactive / non-interactive / `--fresh` / `nemoclaw onboard --resume` paths and the `~/.nemoclaw/onboard-session.json` last-resort delete. **Tests** - `test/install-preflight.test.ts`: two new cases — non-interactive refusal of a failed session, and `--fresh` short-circuiting auto-resume + forwarding to the CLI. - `src/lib/onboard-command.test.ts`: `--fresh` parsing and the `--resume` / `--fresh` mutual-exclusion error. ## Type of Change - [X] Code change (feature, bug fix, or refactor) - [X] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification  - [X] `npx prek run --all-files` passes - [X] `npm test` passes - [X] Tests added or updated for new or changed behavior - [X] No secrets, API keys, or credentials committed - [X] Docs updated for user-facing behavior changes - [X] `make docs` builds without warnings (doc changes only) - [X] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure  - [X] AI-assisted — tool: Claude Code ---  Signed-off-by: Tinson Lai <tinsonl@nvidia.com>  ## Summary by CodeRabbit * **New Features** * Added a `--fresh` option to start onboarding without resuming previous sessions; onboarding now avoids silent automatic retries after failures. * Interactive runs prompt to resume (`R`/Enter) or discard (`f`) a recorded failed session; non-interactive runs refuse to retry and exit non-zero unless explicitly opted in. * **Documentation** * Help text and docs updated to explain session persistence, recovery options, and how to use `--fresh` or `--resume`. * **Tests** * Added integration and unit tests covering fresh/resume behavior and mutual-exclusion of options.  --------- Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

…obes

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

ksapru · 2026-05-14T15:54:31Z

Thanks @cv, @ericksoa, and everyone who helped get this merged. Really appreciate the reviews and fixes.

Glad this landed, and I’ll keep the follow-ups in mind.

refactor(cli): replace race-condition sleeps with readiness probes

c6414af

fix(cli): bubble up errors when policy preset wait loop times out, fi…

1961db8

…x flaky test timeout

ksapru force-pushed the fix/replace-race-condition-sleeps branch from b248aa5 to 1961db8 Compare April 24, 2026 19:56

ksapru marked this pull request as ready for review April 24, 2026 19:57

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread src/lib/onboard.ts Outdated

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread src/lib/onboard.ts Outdated

fix(cli): add strict readiness check

d3c5e5c

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread src/lib/onboard.ts Outdated

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

wscurran added NemoClaw CLI refactor PR restructures code without intended behavior change labels Apr 27, 2026

cv closed this May 12, 2026

cv reopened this May 12, 2026

ksapru force-pushed the fix/replace-race-condition-sleeps branch from bf098d1 to c0b43c2 Compare May 12, 2026 16:18

cv approved these changes May 12, 2026

View reviewed changes

cv added the v0.0.40 label May 12, 2026

ksapru force-pushed the fix/replace-race-condition-sleeps branch from 76c9220 to c0b43c2 Compare May 12, 2026 20:49

jyaunches and others added 7 commits May 12, 2026 16:54

cjagwani and others added 11 commits May 12, 2026 16:54

fix(cli): resolve merge conflicts and improve onboarding readiness pr…

1e69692

…obes

refactor(cli): extract streamGatewayStart to reduce onboard.ts size

9507a85

ksapru force-pushed the fix/replace-race-condition-sleeps branch from c0b43c2 to 9507a85 Compare May 12, 2026 20:55

Merge current main into fix/replace-race-condition-sleeps

98b0b30

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>

ericksoa merged commit d4ae22d into NVIDIA:main May 13, 2026
54 of 58 checks passed

This was referenced May 13, 2026

fix(onboard): probe Ollama proxy reachability from sandbox network on Linux #3465

Closed

perf(policy): batch onboarding preset application #3502

Merged

cv mentioned this pull request May 14, 2026

perf: investigate and reduce networking latency during onboard and validation #2001

Closed

9 tasks

This was referenced May 17, 2026

Ggg #3666

Closed

fix: resolve silent output and remove nc dependency #3669

Closed

Conversation

ksapru commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Type of Change

Verification

AI Disclosure

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

wscurran commented Apr 27, 2026

Uh oh!

cv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ksapru commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

ksapru commented Apr 24, 2026 •

edited

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading