refactor(cli): replace race-condition sleeps with readiness probes#2442
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces fixed sleep delays with polling-based readiness utilities. Adds Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as Onboard CLI
participant Proxy as Auth Proxy
participant K8s as CoreDNS / kube-dns
participant Ollama as Ollama Server
participant Policy as Policy Service
CLI->>Proxy: spawn auth-proxy process
CLI->>Proxy: waitForPort(port)
alt port reachable
Proxy-->>CLI: success
else timeout
Proxy-->>CLI: log error, return/abort
end
CLI->>K8s: run fix-coredns.sh
CLI->>K8s: waitUntil(kube-dns ready)
alt ready
K8s-->>CLI: continue
else timeout
K8s-->>CLI: warn, continue startup
end
CLI->>Ollama: start `ollama serve`
CLI->>Ollama: waitForHttp(root URL)
alt http OK
Ollama-->>CLI: continue
else timeout
Ollama-->>CLI: log error, exit or return to selection
end
CLI->>Policy: apply/remove preset
CLI->>Policy: waitUntil(apply/remove succeeds, retry only on "sandbox not found")
alt success
Policy-->>CLI: done
else timeout or non-retry error
Policy-->>CLI: fail/throw
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
…x flaky test timeout
b248aa5 to
1961db8
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
5631-5647: Extract shared retry wrapper for sandbox-not-found transient handling.The same
waitUntil + lastErr + sandbox not foundpattern is repeated in three places. A small helper would reduce drift and simplify maintenance.Also applies to: 6303-6321, 6326-6348
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 5631 - 5647, Extract the repeated "waitUntil + lastErr + sandbox not found" retry logic into a small helper (e.g., retryUntilSandboxReady or runWithSandboxRetry) that accepts an action function and retries using the existing waitUntil semantics; inside the helper call the provided action (the body that currently calls policies.applyPreset(sandboxName, name)), catch errors, store the last error, treat errors whose message includes "sandbox not found" as transient (return false for retry) and re-throw other errors, and finally re-throw lastErr if retries fail; then replace the three duplicated blocks (the code around policies.applyPreset(...) and the similar blocks at the other locations) with calls to this helper passing the appropriate parameters (sandboxName, name or other action closures).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Line 2064: The waitForPort readiness checks are currently invoked without
checking their boolean result, allowing execution to continue on timeout; update
every call site of waitForPort (e.g., the calls in src/lib/onboard.ts where
OLLAMA_PROXY_PORT and the other ports are waited on) to handle failure: await
the boolean result and if it is false, log a clear error and abort the
onboarding flow (throw or return a failure status) instead of proceeding, so the
flow does not race ahead when services aren't ready. Ensure you apply this
pattern consistently to all occurrences of waitForPort in the file.
---
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 5631-5647: Extract the repeated "waitUntil + lastErr + sandbox not
found" retry logic into a small helper (e.g., retryUntilSandboxReady or
runWithSandboxRetry) that accepts an action function and retries using the
existing waitUntil semantics; inside the helper call the provided action (the
body that currently calls policies.applyPreset(sandboxName, name)), catch
errors, store the last error, treat errors whose message includes "sandbox not
found" as transient (return false for retry) and re-throw other errors, and
finally re-throw lastErr if retries fail; then replace the three duplicated
blocks (the code around policies.applyPreset(...) and the similar blocks at the
other locations) with calls to this helper passing the appropriate parameters
(sandboxName, name or other action closures).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a9fc8c1c-769e-4fdc-ae2a-a41cfca3a64b
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/wait.tstest/onboard.test.ts
Readiness probes (waitForPort, waitForHttp, waitUntil) were returning boolean outcomes that were ignored at their call sites. This allowed the onboarding flow to race ahead even if services like Ollama or CoreDNS failed to become ready within their timeout windows. This adds explicit error handling to abort initialization or surface warnings if these components fail to start, matching the expected fail-fast behavior.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/lib/onboard.ts (1)
2098-2102:⚠️ Potential issue | 🟠 MajorPropagate proxy restart failure instead of only logging it.
On Lines 2098-2102 this helper returns early on timeout, but
setupInference()still continues with the persisted token. That leaves onboarding configured forollama-localeven though nothing is listening onOLLAMA_PROXY_PORT.Suggested fix
-function ensureOllamaAuthProxy(): void { +function ensureOllamaAuthProxy(): boolean { // Try to load persisted token first — if none, this isn't an Ollama setup. const token = loadPersistedProxyToken(); - if (!token) return; + if (!token) return true; const pid = loadPersistedProxyPid(); if (isOllamaProxyProcess(pid)) { ollamaProxyToken = token; - return; + return true; } // Proxy not running — restart it with the persisted token. killStaleProxy(); ollamaProxyToken = token; spawnOllamaAuthProxy(token); if (!waitForPort(OLLAMA_PROXY_PORT, 2)) { console.error(` Error: Ollama auth proxy did not become ready on :${OLLAMA_PROXY_PORT} within timeout.`); - // Void function, return early - return; + ollamaProxyToken = null; + return false; } + return true; }if (!ensureOllamaAuthProxy()) { process.exit(1); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 2098 - 2102, The current check using waitForPort(OLLAMA_PROXY_PORT, 2) only logs and returns early, allowing setupInference() to continue with a persisted token despite the proxy not running; change the helper so the failure is propagated (either throw an error or return a boolean) and ensure callers handle it—e.g., implement/replace with ensureOllamaAuthProxy() that returns false on timeout (or throws), update the caller of this check to call ensureOllamaAuthProxy() and call process.exit(1) (or rethrow) when it returns false to prevent setupInference() from proceeding without a running Ollama auth proxy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 3342-3348: The current coredns readiness check (corednsReady using
waitUntil and runCaptureOpenshell with "kubectl get pods" looking for the string
"coredns") only detects pod listing and can be true while the pod is not Ready;
move this check inside the branch that runs when shouldPatchCoredns(runtime) is
true (so it only runs when we actually patched CoreDNS), and change the kubectl
invocation to assert actual readiness (e.g., query pod status/conditions or
containerStatuses via "kubectl get pods -n kube-system -l <coredns-label> -o
json" or use jsonpath to ensure status.phase == Running and all
containerStatuses[].ready == true) and have waitUntil rely on that more precise
readiness condition instead of a substring match from runCaptureOpenshell.
---
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 2098-2102: The current check using waitForPort(OLLAMA_PROXY_PORT,
2) only logs and returns early, allowing setupInference() to continue with a
persisted token despite the proxy not running; change the helper so the failure
is propagated (either throw an error or return a boolean) and ensure callers
handle it—e.g., implement/replace with ensureOllamaAuthProxy() that returns
false on timeout (or throws), update the caller of this check to call
ensureOllamaAuthProxy() and call process.exit(1) (or rethrow) when it returns
false to prevent setupInference() from proceeding without a running Ollama auth
proxy.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f00ad4ce-2dfc-456b-92d6-6b88e596fdfe
📒 Files selected for processing (2)
.agents/skills/nemoclaw-user-reference/references/commands.mdsrc/lib/onboard.ts
✅ Files skipped from review due to trivial changes (1)
- .agents/skills/nemoclaw-user-reference/references/commands.md
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/lib/onboard.ts (2)
3341-3360:⚠️ Potential issue | 🟠 MajorThis CoreDNS probe can still pass before all pods are ready.
The combined
jsonpathstring only proves that at least one matched pod isRunning, at least one container reportedtrue, and no literalfalseappeared. If one CoreDNS pod is ready and another is stillPendingwith nocontainerStatuses, this returnstrueand the flow moves on too early. Parse the pod list structurally and require every matched pod to beRunningwith all containers ready, or switch to a directkubectl wait --for=condition=Readycheck.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 3341 - 3360, The current coredns readiness probe in corednsReady (using waitUntil and runCaptureOpenshell) can return true prematurely because the jsonpath check only verifies at-least-one-running and presence/absence of literal strings; change the probe to either (A) run kubectl get pods -n kube-system -l k8s-app=kube-dns -o json, parse the returned JSON and ensure every item has .status.phase === "Running" and every container in .status.containerStatuses has ready === true before returning success, or (B) replace the check with a direct kubectl wait --for=condition=Ready pod -n kube-system -l k8s-app=kube-dns --timeout=<sensible timeout> and use runCaptureOpenshell to evaluate its success; update corednsReady accordingly.
2098-2102:⚠️ Potential issue | 🟠 MajorPropagate proxy restart failure to the caller.
On Line 2098,
ensureOllamaAuthProxy()only logs and returns, but it leavesollamaProxyTokenpopulated.setupInference()can then continue configuringollama-localagainst a dead proxy, so resume/reconnect flows fail later and less clearly. Make this helper return abooleanlikestartOllamaAuthProxy()and abort the caller onfalse.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 2098 - 2102, ensureOllamaAuthProxy currently logs and returns on proxy start timeout but leaves ollamaProxyToken set, allowing setupInference to proceed against a dead proxy; change ensureOllamaAuthProxy to return a boolean (e.g., rename or implement as startOllamaAuthProxy/ensureOllamaAuthProxy returning true on success, false on failure), ensure it clears ollamaProxyToken when failing, and update callers (notably setupInference) to abort/return early when the helper returns false so the caller does not continue configuring ollama-local against a nonfunctional proxy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 5665-5680: The waitUntil loop assumes
policies.applyPreset(sandboxName, name) either throws or succeeds, but
applyPreset can return false; update the waitUntil callback to capture the
return value of policies.applyPreset(sandboxName, name) (call it e.g. applied),
and if applied is falsy set lastErr = new Error(`applyPreset returned false for
preset ${name}`) and return false so the loop will retry and ultimately trigger
the existing failure path; keep the existing catch behavior for thrown errors
unchanged.
---
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 3341-3360: The current coredns readiness probe in corednsReady
(using waitUntil and runCaptureOpenshell) can return true prematurely because
the jsonpath check only verifies at-least-one-running and presence/absence of
literal strings; change the probe to either (A) run kubectl get pods -n
kube-system -l k8s-app=kube-dns -o json, parse the returned JSON and ensure
every item has .status.phase === "Running" and every container in
.status.containerStatuses has ready === true before returning success, or (B)
replace the check with a direct kubectl wait --for=condition=Ready pod -n
kube-system -l k8s-app=kube-dns --timeout=<sensible timeout> and use
runCaptureOpenshell to evaluate its success; update corednsReady accordingly.
- Around line 2098-2102: ensureOllamaAuthProxy currently logs and returns on
proxy start timeout but leaves ollamaProxyToken set, allowing setupInference to
proceed against a dead proxy; change ensureOllamaAuthProxy to return a boolean
(e.g., rename or implement as startOllamaAuthProxy/ensureOllamaAuthProxy
returning true on success, false on failure), ensure it clears ollamaProxyToken
when failing, and update callers (notably setupInference) to abort/return early
when the helper returns false so the caller does not continue configuring
ollama-local against a nonfunctional proxy.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 21700ac2-f12a-4a5a-bd5f-6bab1ffa07bf
📒 Files selected for processing (1)
src/lib/onboard.ts
There was a problem hiding this comment.
♻️ Duplicate comments (3)
src/lib/onboard.ts (3)
2314-2318:⚠️ Potential issue | 🟠 MajorPropagate proxy restart failure out of
ensureOllamaAuthProxy().Returning from this
voidhelper only stops the helper. Line 5572 still continues with the persisted token, so onboarding can configureollama-localeven though the proxy never came back. Make this helper return abooleanand abort the caller onfalse.Suggested fix
-function ensureOllamaAuthProxy(): void { +function ensureOllamaAuthProxy(): boolean { // Try to load persisted token first — if none, this isn't an Ollama setup. const token = loadPersistedProxyToken(); - if (!token) return; + if (!token) return true; @@ if (isOllamaProxyProcess(pid)) { ollamaProxyToken = token; - return; + return true; } @@ if (!waitForPort(OLLAMA_PROXY_PORT, 2)) { console.error(` Error: Ollama auth proxy did not become ready on :${OLLAMA_PROXY_PORT} within timeout.`); - // Void function, return early - return; + return false; } + + return true; }if (!isWsl() && !ensureOllamaAuthProxy()) { process.exit(1); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 2314 - 2318, The helper ensureOllamaAuthProxy currently returns void and simply returns early when waitForPort(OLLAMA_PROXY_PORT, 2) fails, allowing the caller to continue with a persisted token; change ensureOllamaAuthProxy to return a boolean (true on success, false on failure), have the waitForPort failure branch return false instead of void, and update callers (e.g., the code path that checks isWsl() and later configures ollama-local) to check the boolean result and abort/exit when false to prevent proceeding without a ready proxy.
3589-3611:⚠️ Potential issue | 🟠 MajorThis CoreDNS predicate can still pass before all pods are ready.
The aggregated substring check succeeds for states like
Running true Pending—one old pod ready, one replacement pod still pending—because there's no literalfalseto trip the guard. That reopens the DNS race window, and the recovery path at Lines 3690-3709 drops the timeout result entirely. Parse pod JSON per item and require every matched pod to beRunningwith allcontainerStatuses[].ready === true, then surface timeout consistently in both call sites.Also applies to: 3690-3709
6047-6062:⚠️ Potential issue | 🟠 MajorHandle
applyPreset()returningfalsein the non-interactive path.
waitUntil()currently treats any non-throwingapplyPreset()call as success. IfapplyPreset()returnsfalse, this loop exits immediately and onboarding reports success even though the preset was never applied.syncPresetSelection()already handles this correctly later in the file; this path should match it.Suggested fix
let lastErr: Error | null = null; const success = waitUntil(() => { try { - policies.applyPreset(sandboxName, name); + if (!policies.applyPreset(sandboxName, name)) { + lastErr = new Error(`applyPreset returned false for preset '${name}'.`); + return false; + } return true; } catch (err) { lastErr = err as Error;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 6047 - 6062, The waitUntil predicate currently treats any non-throwing call to policies.applyPreset(sandboxName, name) as success; change it so you call policies.applyPreset(...) and inspect its boolean return: return true only when the call returns true, and when it returns false set lastErr to an Error describing that the preset was not applied for the given sandboxName/name and return false so the loop continues; keep the existing catch-path that captures thrown errors and rethrows non-"sandbox not found" errors, and preserve the final if (!success && lastErr) throw lastErr behavior to surface failures (this makes the non-interactive path match syncPresetSelection).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 2314-2318: The helper ensureOllamaAuthProxy currently returns void
and simply returns early when waitForPort(OLLAMA_PROXY_PORT, 2) fails, allowing
the caller to continue with a persisted token; change ensureOllamaAuthProxy to
return a boolean (true on success, false on failure), have the waitForPort
failure branch return false instead of void, and update callers (e.g., the code
path that checks isWsl() and later configures ollama-local) to check the boolean
result and abort/exit when false to prevent proceeding without a ready proxy.
- Around line 6047-6062: The waitUntil predicate currently treats any
non-throwing call to policies.applyPreset(sandboxName, name) as success; change
it so you call policies.applyPreset(...) and inspect its boolean return: return
true only when the call returns true, and when it returns false set lastErr to
an Error describing that the preset was not applied for the given
sandboxName/name and return false so the loop continues; keep the existing
catch-path that captures thrown errors and rethrows non-"sandbox not found"
errors, and preserve the final if (!success && lastErr) throw lastErr behavior
to surface failures (this makes the non-interactive path match
syncPresetSelection).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: da63b43f-6a28-4d43-8dc9-f2f503a3effc
📒 Files selected for processing (2)
src/lib/onboard.tstest/onboard.test.ts
✅ Files skipped from review due to trivial changes (1)
- test/onboard.test.ts
There was a problem hiding this comment.
♻️ Duplicate comments (2)
src/lib/onboard.ts (2)
2310-2318:⚠️ Potential issue | 🟠 MajorPropagate proxy restart failure instead of silently continuing.
ensureOllamaAuthProxy()still can't tell its caller that restart failed. On Line 2312 it publishes the persisted token before the new proxy is verified, and this path never checks that the spawned PID is actuallyollama-auth-proxy.js. If another process is already listening on:${OLLAMA_PROXY_PORT},waitForPort()returns true andsetupInference()continues with a stale token against the wrong service. Return a boolean here, verify the spawned PID likestartOllamaAuthProxy(), and abort the Ollama setup when it returns false.Proposed fix
-function ensureOllamaAuthProxy(): void { +function ensureOllamaAuthProxy(): boolean { // Try to load persisted token first — if none, this isn't an Ollama setup. const token = loadPersistedProxyToken(); - if (!token) return; + if (!token) return true; const pid = loadPersistedProxyPid(); if (isOllamaProxyProcess(pid)) { ollamaProxyToken = token; - return; + return true; } // Proxy not running — restart it with the persisted token. killStaleProxy(); - ollamaProxyToken = token; - spawnOllamaAuthProxy(token); + const restartedPid = spawnOllamaAuthProxy(token); if (!waitForPort(OLLAMA_PROXY_PORT, 2)) { console.error(` Error: Ollama auth proxy did not become ready on :${OLLAMA_PROXY_PORT} within timeout.`); - // Void function, return early - return; + return false; } + if (!isOllamaProxyProcess(restartedPid)) { + console.error(` Error: Ollama auth proxy failed to restart on :${OLLAMA_PROXY_PORT}`); + return false; + } + ollamaProxyToken = token; + return true; }- ensureOllamaAuthProxy(); + if (!ensureOllamaAuthProxy()) { + process.exit(1); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 2310 - 2318, ensureOllamaAuthProxy() currently publishes ollamaProxyToken and spawns a proxy without verifying the new process or port owner; change it to return a boolean indicating success, move assignment of ollamaProxyToken until after verification, and replace the spawn/verify sequence with a call to startOllamaAuthProxy() (which should return the spawned PID or process identity) so you can check the PID/process matches the expected ollama-auth-proxy; only set ollamaProxyToken and return true if waitForPort(OLLAMA_PROXY_PORT, ...) succeeds and the spawned PID/process matches the expected binary, otherwise kill any started process, log the failure, return false so setupInference() can abort when ensureOllamaAuthProxy() returns false.
3589-3611:⚠️ Potential issue | 🟠 MajorTighten the CoreDNS readiness predicate before treating the patch as complete.
These probes can still pass too early. The current check only requires some
"Running"and some"true"with no"false", so a rollout with one ready pod and anotherPendingpod that has nocontainerStatusesyet will be treated as ready. In the recovery path, the timeout result is also discarded entirely. Parse the pod JSON and require every matched pod to beRunningwith every containerready === true, then handle timeout consistently in both call sites.Proposed fix
+function areCorednsPodsReady(): boolean { + const raw = runCaptureOpenshell( + [ + "doctor", + "exec", + "--", + "kubectl", + "get", + "pods", + "-n", + "kube-system", + "-l", + "k8s-app=kube-dns", + "-o", + "json", + ], + { ignoreError: true }, + ); + try { + const parsed = JSON.parse(raw || "{}"); + const items = Array.isArray(parsed.items) ? parsed.items : []; + return ( + items.length > 0 && + items.every((item) => { + const statuses = Array.isArray(item?.status?.containerStatuses) + ? item.status.containerStatuses + : []; + return ( + item?.status?.phase === "Running" && + statuses.length > 0 && + statuses.every((status) => status?.ready === true) + ); + }) + ); + } catch { + return false; + } +}- const corednsReady = waitUntil(() => { - const check = runCaptureOpenshell(...); - return check.includes("Running") && !check.includes("false") && check.includes("true"); - }, 10); + const corednsReady = waitUntil(areCorednsPodsReady, 10); if (!corednsReady) { console.warn(" CoreDNS did not report ready within timeout; continuing may cause DNS flakiness."); }- waitUntil(() => { - const check = runCaptureOpenshell(...); - return check.includes("Running") && !check.includes("false") && check.includes("true"); - }, 10); + const corednsReady = waitUntil(areCorednsPodsReady, 10); + if (!corednsReady) { + console.warn(" CoreDNS did not report ready within timeout during recovery; continuing may cause DNS flakiness."); + }Also applies to: 3690-3709
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/onboard.ts` around lines 3589 - 3611, The current CoreDNS readiness predicate (corednsReady) uses string checks on runCaptureOpenshell output and can falsely report ready; update the predicate in waitUntil to call kubectl with '-o json' (via runCaptureOpenshell), JSON.parse the output, and require that every returned pod has status.phase === "Running" and that every pod.status.containerStatuses exists and every containerStatus.ready === true; ensure waitUntil returns false on timeout and use that false value consistently at both call sites (the corednsReady check and the duplicate site) to log a warning or abort as appropriate rather than discarding the timeout result.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 2310-2318: ensureOllamaAuthProxy() currently publishes
ollamaProxyToken and spawns a proxy without verifying the new process or port
owner; change it to return a boolean indicating success, move assignment of
ollamaProxyToken until after verification, and replace the spawn/verify sequence
with a call to startOllamaAuthProxy() (which should return the spawned PID or
process identity) so you can check the PID/process matches the expected
ollama-auth-proxy; only set ollamaProxyToken and return true if
waitForPort(OLLAMA_PROXY_PORT, ...) succeeds and the spawned PID/process matches
the expected binary, otherwise kill any started process, log the failure, return
false so setupInference() can abort when ensureOllamaAuthProxy() returns false.
- Around line 3589-3611: The current CoreDNS readiness predicate (corednsReady)
uses string checks on runCaptureOpenshell output and can falsely report ready;
update the predicate in waitUntil to call kubectl with '-o json' (via
runCaptureOpenshell), JSON.parse the output, and require that every returned pod
has status.phase === "Running" and that every pod.status.containerStatuses
exists and every containerStatus.ready === true; ensure waitUntil returns false
on timeout and use that false value consistently at both call sites (the
corednsReady check and the duplicate site) to log a warning or abort as
appropriate rather than discarding the timeout result.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 67ec12f5-c28b-43bc-b699-7eb1fda0245c
📒 Files selected for processing (1)
src/lib/onboard.ts
|
✨ Thanks for submitting this pull request that proposes a way to refactor the NemoClaw CLI by replacing arbitrary sleep delays with readiness probes. This change improves performance by proceeding as soon as readiness is achieved and prevents race conditions on slower systems. |
bf098d1 to
c0b43c2
Compare
76c9220 to
c0b43c2
Compare
…recover modules (NVIDIA#2398) ## Summary Extract the scattered dashboard delivery chain logic — URL derivation, CORS origins, port forwarding target, health probing, and recovery — from the monolith files (onboard.ts, nemoclaw.ts) into three focused modules that serve as the single source of truth for dashboard reachability. Fixes NVIDIA#2342 false "Offline" reading when device auth is enabled. ## Related Issue Fixes NVIDIA#2390 Fixes NVIDIA#2342 ## Changes - **New `src/lib/dashboard-contract.ts`**: Pure `buildChain()` function — single source of truth for all dashboard config (access URL, CORS origins, forward target, health endpoint, port, bind address). No `process.env` reads. - **New `src/lib/dashboard-health.ts`**: `verifyDashboardChain()` checks all 3 delivery chain links (gateway, forward, CORS) with injected deps. Accepts HTTP 200 and 401 as "alive" (fixes NVIDIA#2342). - **New `src/lib/dashboard-recover.ts`**: `recoverDashboardChain()` — link-aware, idempotent recovery that only fixes broken links (gateway → forward, CORS diagnose-only). - **Deleted `src/lib/dashboard.ts`** — replaced by dashboard-contract.ts. - **Deleted 7 dashboard helper functions from `src/lib/onboard.ts`** (~90 lines removed): `getDashboardForwardPort`, `getDashboardForwardTarget`, `getDashboardForwardStartCommand`, `buildAuthenticatedDashboardUrl`, `getWslHostAddress`, `getDashboardAccessInfo`, `getDashboardGuidanceLines`. - **Rewritten `ensureDashboardForward`** in onboard.ts to use `buildChain()` instead of re-deriving config. - **Fixed `isSandboxGatewayRunning()`** in nemoclaw.ts to probe `/health` and accept 401 as alive. - **Deleted `recoverSandboxProcesses()` and `ensureSandboxPortForward()`** from nemoclaw.ts — `checkAndRecoverSandboxProcesses()` delegates to `recoverDashboardChain()`. - **Updated `test/onboard.test.ts`** — imports from new modules, assertions rewritten for contract API. - **52 new tests** across 3 test files (32 contract + 13 health + 7 recovery). ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure - [x] AI-assisted — tool: Claude Code (pi agent) --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Improved dashboard health checks with automated, link-aware recovery and better WSL URL/forwarding support. * **Bug Fixes** * More accurate gateway liveness detection and more reliable port/forwarding behavior across loopback, IPv6, and remote hosts. * **Tests** * Added comprehensive test suites covering dashboard configuration, health verification, and recovery flows. * **Refactor** * Consolidated dashboard URL/forwarding logic into a dedicated contract for clearer, consistent behavior. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
…VIDIA#2343) (NVIDIA#2408) ## Summary Follow-up to NVIDIA#2356. The adversarial PR review found that `procps` tools (`ps`, `top`, `free`, `uptime`, `vmstat`) were still missing inside a real sandbox runtime even after NVIDIA#2356 added `procps` to `Dockerfile.base`. Root cause: the GHCR base image may not have been rebuilt yet, and the `Dockerfile` hardening step (`apt-get autoremove --purge`) could strip it. This PR adds a three-layer defense in the production `Dockerfile`: - **`apt-mark manual procps`** before the autoremove step, so apt never considers it auto-removable - **Conditional `apt-get install procps`** if `ps` is still missing after hardening (handles stale GHCR base images that predate NVIDIA#2356) - **Build-time `ps --version`** smoke test that fails the Docker build if procps didn't make it through Plus regression tests: - Static guards in `sandbox-provisioning.test.ts` verifying both Dockerfiles contain procps provisions - Runtime E2E test in `e2e-test.sh` verifying all five procps tools are executable inside the sandbox container ## Test plan - [ ] `hadolint Dockerfile` passes - [ ] `hadolint Dockerfile.base` passes - [ ] `vitest run test/sandbox-provisioning.test.ts` passes - [ ] CI `test-e2e-sandbox` job passes (runs `e2e-test.sh` inside container, including new test 11) - [ ] Docker build succeeds with both fresh base image and stale GHCR base 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added end-to-end checks that verify common system debug tools (ps, top, free, uptime, vmstat) run in the sandbox. * Added regression tests ensuring Docker provisioning includes the debug tooling. * **Chores** * Improved Docker provisioning to ensure procps/debug tools are present at runtime, with a fallback install path for older base images. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Julie Yaunches <jyaunches@nvidia.com>
This PR completes the repo-wide type-safety cleanup by removing the remaining `// @ts-nocheck` markers and tightening broad escape hatches across the CLI, plugin, blueprint, scripts, and tests. It replaces `any`/`unknown`/assertion- heavy paths with narrower local domains while keeping the full correctness gate green. - remove all remaining `// @ts-nocheck` markers and retire explicit `any`, `Record<string, unknown>`, type assertions, and non-null assertions that were still inflating the hotspot report - tighten high-churn CLI/onboarding paths in `src/lib/onboard.ts`, `src/lib/onboard-session.ts`, `src/lib/runner.ts`, `src/lib/credentials.ts`, `src/lib/deploy.ts`, and related helpers with narrower persisted-session, errno, config, and subprocess types - tighten plugin and blueprint parsing paths in `nemoclaw/src/blueprint/runner.ts`, `snapshot.ts`, `state.ts`, `onboard/config.ts`, and `commands/migration-state.ts` with concrete source shapes around YAML and JSON boundaries - update scripts and tests, including `scripts/type-safety-hotspots.ts`, `scripts/bump-version.ts`, and many `test/*.ts` files, so the stricter typing is exercised without weakening lint, TypeScript, or test gates - reduce the tracked type-safety hotspots from `59 @ts-nocheck / 22 explicit any / 83 record unknown / 229 type assertions / 50 non-null assertions / 251 unknowns` to `0 / 0 / 0 / 0 / 0 / 32` - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) - [x] AI-assisted — tool: pi coding agent --- Signed-off-by: Carlos Villela <cvillela@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> * **Improvements** * Stronger runtime validation and safer parsing across configs, snapshots, and state—reduces crashes on corrupt or malformed files and yields clearer error messages. * Tighter input handling for plugins, onboarding, and runtime commands to improve reliability and predictability. * **Bug Fixes** * More robust network/proxy and TLS handling during websocket/https upgrades to avoid misrouted requests and fallback safely. * **Tests** * Expanded and hardened test coverage to prevent regressions and improve stability. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
) (NVIDIA#2446) ## Summary Create a typed command registry (`src/lib/command-registry.ts`) as the single source of truth for all 46 NemoClaw CLI commands. Derive help output, dispatch arrays, and docs-parity checking from it, replacing three independent hand-maintained sources. Re-enable the `cloud-experimental-e2e` nightly job now that the check-docs parity check passes reliably. ## Related Issue Fixes NVIDIA#2388 ## Changes - **New `src/lib/command-registry.ts`**: Defines `CommandDef` interface, `CommandGroup` union type, and `COMMANDS` array with all 46 CLI commands (23 global + 23 sandbox). Exports helper functions: `globalCommands`, `sandboxCommands`, `visibleCommands`, `commandsByGroup`, `canonicalUsageList`, `globalCommandTokens`, `sandboxActionTokens`. - **New `src/lib/command-registry.test.ts`**: 21 unit tests covering all registry invariants — counts, deduplication, ordering, token extraction, group membership. - **Rewritten `help()` in `src/nemoclaw.ts`**: Now iterates `commandsByGroup()` from registry instead of hand-written console.log lines. Reconfiguration section rewritten with bullet points to prevent phantom `nemoclaw`-prefixed parser matches. - **Derived dispatch arrays**: `GLOBAL_COMMANDS` now calls `globalCommandTokens()` and `sandboxActions` now calls `sandboxActionTokens()` — no more hand-maintained sets/arrays. - **Added `--dump-commands` internal flag**: Prints canonical command list for check-docs.sh parity checks, replacing the fragile perl help-parsing pipeline. - **Simplified `check-docs.sh`**: Phase 1 now uses `--dump-commands` instead of perl regex extraction. Phase 2 strips `<arg>` and `[optional]` placeholders from doc headings for clean comparison. - **Added missing doc headings**: `nemoclaw onboard --from`, `nemoclaw setup`, `nemoclaw start`, `nemoclaw stop` in `docs/reference/commands.md`. - **Re-enabled `cloud-experimental-e2e`**: Removed DISABLED comment, `CLOUD_EXPERIMENTAL_E2E_ENABLED` variable gate, and `SKIP_CHECK_DOCS` flag. - **Updated `test/image-cleanup.test.ts`**: Adapted to verify `gc` via registry source instead of regex-matching the old `GLOBAL_COMMANDS` Set literal. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [x] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [ ] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure - [x] AI-assisted — tool: Claude Code (pi agent) --- Signed-off-by: Julie Yaunches <jyaunches@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added `nemoclaw onboard --from <Dockerfile>` usage guide and documented deprecated aliases (`nemoclaw start`, `nemoclaw stop`, `nemoclaw setup`) with warnings. * **New Features** * Added a developer-facing `--dump-commands` option to emit the canonical command list. * **Improvements** * Overhauled help/command organization for clearer grouping and more accurate, stable command listings. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Carlos Villela <cvillela@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
) Add tokenFormat/tokenFormatHint and appTokenFormat/appTokenFormatHint to ChannelDef interface. Wire format checks into setupMessagingChannels() so invalid tokens are rejected at the prompt instead of silently saved and surfacing as cryptic auth failures after sandbox build. - Slack bot token: ^xoxb-[A-Za-z0-9_-]+$ - Slack app token: ^xapp-[A-Za-z0-9_-]+$ - Non-interactive mode intentionally unchanged (env-var tokens assumed intentional) - Future channels can opt in by adding tokenFormat/appTokenFormat to their definition Closes NVIDIA#1912 Co-authored-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
…VIDIA#2447) ## Summary Follow-up to PR NVIDIA#2422 (`refactor(types): tighten repo-wide type boundaries`). Addresses the review warnings and suggestions from triage. ## Changes ### 1. Shared `isErrnoException` / `isPermissionError` — `src/lib/errno.ts` PR NVIDIA#2422 removed all `@ts-nocheck` markers, which required each module to define its own `isErrnoException` + `ErrnoLike` type locally. This resulted in 6 slightly different copies across `config-io`, `credentials`, `http-probe`, `onboard`, `onboard-session`, and `registry`. This PR extracts a single canonical version into `src/lib/errno.ts` that: - Accepts `unknown` so callers never need an `instanceof Error` pre-cast - Checks for both `code` and `errno` properties (covering all Node.js error shapes) - Includes a convenience `isPermissionError` for the common EACCES/EPERM guard This also simplifies many catch blocks that previously did `error instanceof Error && isErrnoException(error)` — the redundant `instanceof Error` check is now handled inside the shared function. ### 2. Shared JSON recursive types — `src/lib/json-types.ts` PR NVIDIA#2422 introduced identical Scalar/Value/Object type triples in multiple modules: - `LooseScalar` / `LooseValue` / `LooseObject` in `onboard.ts` and `agent-onboard.ts` - `PolicyScalar` / `PolicyValue` / `PolicyObject` in `policies.ts` - `SessionJsonPrimitive` / `SessionJsonValue` / `UnknownRecord` in `onboard-session.ts` This PR extracts canonical `JsonScalar` / `JsonValue` / `JsonObject` types into `src/lib/json-types.ts`. Each module re-aliases them under its domain names for readability. > **Note:** The plugin types (`PluginScalar`/`PluginValue`/`PluginRecord` in `nemoclaw/src/index.ts`) are intentionally kept in-place because the plugin compiles separately from the CLI and cannot share imports. ### 3. `Reflect.get` / `Reflect.set` convention comment Adds a clarifying comment at the first usage site (`usage-notice.ts`) explaining why `Reflect.get` is preferred over `as Record<…>` casts throughout the codebase — it avoids widening the target type and satisfies `no-unsafe-member-access` lint rules. ### 4. `ws-proxy-fix.ts` behavioral guard documentation Documents the `!host` null guard in the Discord WebSocket tunnel path, explaining that the previous code would have attempted a tunnel with an undefined host (which would fail downstream anyway). The guard now explicitly falls through to the original `https.request` unchanged. ## Type of Change - [x] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `tsc -p tsconfig.cli.json --noEmit` passes - [x] `npm test` passes (only pre-existing failures in install-preflight and ssrf-parity) - [x] Tests added for new `errno.ts` module (15 test cases) - [x] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) ## AI Disclosure - [x] AI-assisted — tool: pi coding agent --- Signed-off-by: Jess Yaunches <jyaunches@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved WebSocket proxy handling to avoid creating tunnels when the request target host is undefined. * **Tests** * Added tests validating errno/permission detection utilities. * **Documentation** * Added explanatory comment clarifying property-access rationale. * **Chores** * Centralized JSON-type aliases and errno helpers for consistent error handling across modules. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Jess Yaunches <jyaunches@nvidia.com>
…VIDIA#2458) ## Summary Tip of `main` currently fails `npm run typecheck:cli` with 8 errors in `test/onboard.test.ts` — patterns introduced by NVIDIA#2130 (Slack token validation tests) don't satisfy the `strict: true` setting introduced by NVIDIA#2422. Every PR opened against main inherits these errors, which makes the `checks` CI job red on unrelated PRs (e.g. NVIDIA#2454). ## Related Follows from NVIDIA#2130 + NVIDIA#2422 interaction. No linked issue — local-only breakage caught by CI on downstream PRs. ## Changes `test/onboard.test.ts` — 8 annotations, zero behavior change: - **4× `.pop()!`** — `Array.pop()` returns `string | undefined`. Each site is immediately preceded by an `assert.equal(result.status, 0)` that guarantees non-empty stdout, so the non-null assertion is safe. - **3× `(c: { key: string }) => c.key === …`** — `saveCalls` entries only read `.key`. - **1× `(c: { name: string }) => c.name === …`** — `MESSAGING_CHANNELS` entry lookup only reads `.name`. The tests themselves still exercise the same cases; these changes are purely to make the TypeScript compiler happy. ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] `npx tsc -p tsconfig.cli.json` — exits 0 (was 2 with 8 errors) - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Tests * Improved reliability of JSON payload extraction in onboarding tests with defensive null-checks before parsing. * Enhanced type safety in test callbacks and iterators with explicit TypeScript type annotations. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary Prevents the class of regression that required NVIDIA#2458. Adds an unconditional `typecheck:cli` step to the shared CI action so future hook-filter drift can't hide a tsc error. ## Related Follow-up to NVIDIA#2458 (which fixed the symptom — this addresses the root cause). ## Root cause recap - `.pre-commit-config.yaml` had `files: ^(bin|scripts)/` on the `tsc-cli` hook. - PRs that touched only `test/` or `src/` never triggered the check. - NVIDIA#2130 (test-only Slack validation) slipped through that gap. - NVIDIA#2422 later turned on `strict: true`, and the latent errors surfaced — but for every downstream PR, not the PR that introduced them. - Six open PRs ended up blocked on the same error cluster before anyone noticed. ## Changes `.github/actions/basic-checks/action.yaml` — new **Typecheck CLI + tests (strict)** step running `npm run typecheck:cli` unconditionally. Independent of prek hook configuration, so future filter drift can't hide a tsc error from CI. ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] YAML parses - [x] `npm run typecheck:cli` exits 0 on this branch (requires NVIDIA#2458 merged to main first) - [x] Tests added or updated for new or changed behavior (N/A — infra change, existing suite covers) - [x] No secrets, API keys, or credentials committed ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Enhanced continuous integration pipeline with additional TypeScript typechecking to improve code quality assurance and detect type-related issues earlier in the development process. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Charan Jagwani <cjagwani@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NVIDIA#2436) ## Summary - Adds `lstatSync`-based symlink rejection to `createSnapshot()` and `rollbackFromSnapshot()` in `nemoclaw/src/blueprint/snapshot.ts`, closing two attack surfaces reported by an external contributor doing Yocto integration: 1. **`createSnapshot()`** calls `cpSync(OPENCLAW_DIR, dest)` with no `lstatSync` check on the source — if `~/.openclaw` is a symlink (e.g. to `/etc`), `cpSync` follows it and copies arbitrary files into the snapshot 2. **`rollbackFromSnapshot()`** calls `cpSync(source, OPENCLAW_DIR)` without verifying the destination isn't a symlink — an attacker-planted symlink could redirect restored content to an arbitrary directory - Updates `collectFiles()` to detect symlinks via `isSymbolicLink()` and record them in the snapshot manifest (previously symlinks were silently omitted) - Follows the `rejectSymlinksOnPath()` pattern established in PR NVIDIA#2290 for the `~/.nemoclaw` credentials directory - **Also guards the CLI code path** (`src/lib/sandbox-state.ts::backupSandboxState()`) that `nemoclaw <sandbox> snapshot create` actually uses — the blueprint helper is a separate implementation not called from the CLI - `rollbackFromSnapshot()` symlink rejection returns `false` (preserving the boolean-return contract) rather than throwing ### Severity Low-Medium. Both attacks require the attacker to already have user-level write access to `$HOME` to plant the symlink, and the sandbox's Landlock / `chattr +i` / read-only mount protections prevent in-sandbox exploitation. However, this is the same symlink attack class fixed in NVIDIA#2290 (different code paths: `cpSync` + directory walker vs. `mkdirSync`), and defense-in-depth warrants closing the gap. ## Test plan - [x] New test: `createSnapshot` rejects when `~/.openclaw` is a symlink - [x] New test: `createSnapshot` rejects when an ancestor of `~/.nemoclaw` is a symlink - [x] New test: `createSnapshot` records symlinks in manifest when present in tree - [x] New test: `rollbackFromSnapshot` returns false (not throws) when `~/.openclaw` is a symlink - [x] All existing snapshot tests pass (no regressions) - [x] Full plugin test suite: 362 passed - [x] CLI test suite: 1959/1961 passed (2 pre-existing flaky installer timeouts) - [x] TypeScript build clean (`tsc --noEmit`) - [x] All pre-commit and pre-push linters/formatters pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Snapshots now detect and record symbolic links in the snapshot manifest and refuse snapshot creation or restoration when critical config directories or their ancestor paths are symlinks. * Backup destination paths are validated to prevent writing into locations with symlinked ancestor paths. * **Tests** * Added tests covering symlink detection, manifest recording, rejection scenarios, and rollback behavior when config paths are symlinks. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…VIDIA#2440) ## Summary - When `CHAT_UI_URL` specifies a custom port, `NEMOCLAW_DASHBOARD_PORT` was only injected into the sandbox if the host env var was explicitly set, causing the gateway to listen on the wrong port - Always derive the effective port from `chatUiUrl` via `getDashboardForwardPort()` and inject it unconditionally - Bump vitest timeouts on flaky openshell gateway tests Closes NVIDIA#2267 ## Test plan - [x] All 2302 existing tests pass - [ ] Verify sandbox starts with correct `NEMOCLAW_DASHBOARD_PORT` when `CHAT_UI_URL` uses a non-default port (e.g. `:18790`) - [ ] Verify default port (`18789`) still works when `CHAT_UI_URL` is unset 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Dashboard forwarding now consistently derives and injects the dashboard port from CHAT_UI_URL (including custom ports), preventing mismatches that could block dashboard access. * **New Features** * Deterministic dashboard port derivation from CHAT_UI_URL for sandbox startup and health checks. * **Documentation** * Updated onboarding, quickstart, reference, and troubleshooting guides and examples to prefer CHAT_UI_URL with auto-derived ports. * **Tests** * Added/adjusted tests and extended timeouts to validate port injection and improve CLI/onboard reliability. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Ubuntu <ubuntu@nemoclaw-f6a691-inst-3cs11f3ytmmzubymi5nzdncnwia.us-west1-a.c.brevdevprod.internal> Co-authored-by: Aaron Erickson 🦞 <aerickson@nvidia.com>
…DIA#2445) ## Summary Reorganize the Get Started section, document the full runtime topology in the Architecture reference, and clarify how to uninstall NemoClaw. Prerequisites now live on their own page (with the tested-platform matrix included directly from the README), Windows pre-setup is renamed to `windows-preparation.md`, and DGX Spark readers are pointed to the NVIDIA Spark playbook rather than an in-repo tutorial. ## Changes **Get Started reorganization** - New `docs/get-started/prerequisites.md`. Extracts Hardware, Software, and the tested-platform matrix from Quickstart into its own page. The matrix is pulled in via Sphinx `{include}` from the README so there is a single source of truth. - `docs/get-started/windows-setup.md` → `docs/get-started/windows-preparation.md` (renamed to "Prepare Windows for NemoClaw"). Nested under Prerequisites in the toctree. - `docs/get-started/quickstart.md` no longer duplicates the Prerequisites section; it links to the new page instead. - `docs/index.md` Get Started toctree is now `Prerequisites` then `Quickstart`. - `docs/conf.py` redirects `get-started/windows-setup` → `windows-preparation.html` so the pre-reorg URL still resolves. **Architecture reference: Deployment Topology** - New `## Deployment Topology` section in `docs/reference/architecture.md` with a mermaid diagram showing the runtime substrate — host CLI → Docker daemon → OpenShell gateway container → embedded k3s cluster → sandbox pod → OpenClaw agent + plugin — plus the inference router's role on the path to external providers. - Per-layer role table (host CLI, Docker daemon, gateway container, k3s, sandbox pod, inference router). - Points DGX Spark readers to the [NVIDIA Spark playbook](https://build.nvidia.com/spark/nemoclaw) for the Spark-specific variant. **Uninstall documentation** - `docs/reference/commands.md` `nemoclaw uninstall` section corrects a drift bug (the CLI does *not* auto-fall-back to a remote script; it prints the versioned URL for manual review) and adds a comparison table between `nemoclaw uninstall` and the hosted `curl … | bash` form. - README and Quickstart Uninstall sections now lead with `nemoclaw uninstall` (version-pinned, no network fetch) and keep the `curl … | bash` form as a fallback for when the CLI is missing or broken. Both link to the comparison table. **Troubleshooting reference** - New `## DGX Spark` section in `docs/reference/troubleshooting.md` covering CoreDNS CrashLoop, `k3s` image-pull failure after a local build, GPU passthrough (not CI-tested on Spark), `pip install` system-packages errors, and the AI Workbench port 3000 conflict. - Section header points to the Spark playbook for end-to-end walkthroughs. - Windows cross-links retargeted to `../get-started/windows-preparation.md`. **Platform matrix source of truth** - `ci/platform-matrix.json` DGX Spark notes now link the Spark playbook. README matrix regenerated. - README Documentation table gains a Prerequisites row and updates the pointer below the matrix to point at Prerequisites instead of a now-removed hub. - `scripts/generate-platform-docs.py` drops `docs/get-started/quickstart.md` from its patch targets — the matrix moved to `prerequisites.md` and is pulled via `{include}` from README, so only README needs patching. - `.pre-commit-config.yaml` `platform-matrix-sync` hook updated to match (removed the quickstart path from both `git add` and `files`). **Agent skill generator (`scripts/docs-to-skills.py`)** Goal: skill files are self-contained. They never reference repository paths like `../../../docs/…`, and they never couple the skill tree to the repo layout. - Rewrite precedence in `rewrite_doc_paths`: 1. If the link target maps to a generated skill, emit ``text (use the `<skill>` skill)``. 2. Otherwise, if the target is a page inside `docs/`, emit an absolute `[text](<html_baseurl><page>.html#frag)`. 3. Otherwise (target outside `docs/`, or no base URL), strip the hyperlink and keep the link text. Self-containment wins over navigability in the fallback. - New `load_html_baseurl(docs_dir)` reads the `html_baseurl` assignment from Sphinx's `conf.py` via `ast` (no `exec` side effects), so the script is project-agnostic: any docs tree with that assignment works. A warning is printed (and the script continues) if the base URL is missing. - Include placeholders (`> *Content included from … — see the original doc for full text.*`) follow the same precedence: emit a published URL if available, else drop the breadcrumb since the content is already inlined. - Fragment anchors are now preserved across the rewrite (the pre-existing rewrite silently stripped them). - Scope the `index.md` exclusion to the top-level `docs/index.md` only. Subdirectory index pages (for example `docs/get-started/platform-setup/index.md`) are legitimate hub content and can now participate in skill grouping — which is what makes rule 1 above kick in for links to those hubs. - Because emitted content no longer depends on where it is written, the earlier `output_file` parameter plumbing and the multi-output-dir invariant guard are gone. `output_dirs` can be any mirrors (symlinks, bind mounts, independent trees) without caveat. **Repo-root stub** - `spark-install.md` rewritten as a short redirect paragraph pointing to the NVIDIA Spark playbook and the Quickstart. **Agent skills** - `.agents/skills/` regenerated from the new doc tree. `nemoclaw-user-get-started` now covers `prerequisites.md` + `windows-preparation.md`; `nemoclaw-user-reference` picks up the revised architecture, commands, and troubleshooting pages. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) The only non-doc code changes are to the skill-generator script (`scripts/docs-to-skills.py`), the platform-matrix-generator target list (`scripts/generate-platform-docs.py`), and the pre-commit hook entry. Everything else is docs or generated docs. ## Verification - [x] `check-docs.sh --only-links` phase 1 (local links) passes. Phase 2 flags `prerequisites.html` as "unreachable" because it is a new page not yet published; it will resolve after this PR merges and the docs site rebuilds. - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [x] New doc pages include SPDX header and frontmatter (new pages only) Note on unchecked boxes: prek's CLI vitest suite surfaces pre-existing environmental failures on my machine (`nemoclaw/dist/blueprint/private-networks.js` missing and integration tests that try to reach a live gateway). No source file under `src/`, `nemoclaw/src/`, `bin/`, or `test/` is touched by this diff, so those failures are unrelated. `make docs` wasn't run locally. ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a dedicated Prerequisites page with minimum/recommended hardware, software versions, and platform notes. * Added DGX Spark troubleshooting and an end-to-end local Ollama walkthrough reference. * **Documentation** * Reorganized Get Started to surface Prerequisites and updated Windows preparation wording. * Reworked uninstall guidance to prefer the CLI `nemoclaw uninstall` (local, version‑pinned) and document a hosted-script fallback and comparison. * Replaced a lengthy Spark install guide with an external playbook pointer. * **Reference** * Expanded architecture and commands docs to clarify deployment topology, credential flow, and uninstall behavior. * **Chores** * Updated docs generation/redirect tooling to emit published links and target platform docs appropriately. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary The squash merge of NVIDIA#2351 skipped pre-commit hooks, leaving shfmt formatting violations that broke the `main` workflow checks. This applies `shfmt -w` to fix them. ## Test plan - [x] `shfmt -d` produces no diff on both files <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added documentation for `nemoclaw onboard --from <Dockerfile>` option for building sandbox images from custom Dockerfiles. * Added documentation for deprecated CLI aliases `nemoclaw start`/`nemoclaw stop` (use `nemoclaw tunnel start`/`nemoclaw tunnel stop` instead). * Added documentation for deprecated `nemoclaw setup` command (use `nemoclaw onboard` instead). <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…NVIDIA#2435) ## Summary Tests in `install-preflight.test.ts` and `install-openshell-version-check.test.ts` flake persistently under the default 5s vitest timeout. Each test spawns `bash install.sh` — these are integration tests, not unit tests, and belong in CI. ## Changes - Set `{ timeout: 15_000 }` on the `describe` block in both test files (keeps them passing when run explicitly) - Add a new `installer-integration` vitest project enabled only by `CI=1` or `NEMOCLAW_RUN_INSTALLER_TESTS=1` - Exclude the two files from the `cli` project so pre-push hooks skip them Run explicitly: `npx vitest run --project installer-integration` ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] `npx prek run --all-files` passes - [x] Both test files pass: 65/65 tests - [x] No secrets, API keys, or credentials committed --- Signed-off-by: Aaron Erickson <aerickson@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added explicit 15s timeouts to installer test suites to reduce flaky runs. * Moved installer-specific tests into a separate CI-gated project so they run only in CI or when explicitly enabled, keeping regular test runs lean. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary Catches up the user-facing docs for changes merged since v0.0.24 and bumps `docs/versions1.json` so the next docs build publishes as 0.0.25. ## Changes - `docs/reference/troubleshooting.md`: corrected the JetPack 7 (L4T 39.x) entry to reflect the new conditional `br_netfilter` auto-load on R39 hosts that lack it (from NVIDIA#2419), and added a new "DNS resolution from inside docker fails (corporate firewall)" section covering the new container-DNS preflight probe and platform-specific remediation paths (from NVIDIA#2349). - `docs/get-started/quickstart.md`: added a "Review the Configuration Before the Sandbox Build" subsection documenting the new `[Y/n]` review gate that fires after the sandbox-name prompt (from NVIDIA#2221). - `docs/versions1.json`: promoted 0.0.25 to `preferred: true`, demoted 0.0.24, and trimmed to ten entries to match what `scripts/bump-version.ts` writes. `docs/project.json` regenerates from `versions1.json` at build time and now reports `0.0.25`. - `.agents/skills/nemoclaw-user-get-started/` and `.agents/skills/nemoclaw-user-reference/`: regenerated via `scripts/docs-to-skills.py`. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [x] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes - [x] `npm test` passes - [x] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Onboarding now shows a configuration review and requires explicit confirmation before building a sandbox; non-interactive mode prints the summary without prompting. * **Documentation** * Expanded Jetson troubleshooting for JetPack 7 / L4T 39.x and conditional br_netfilter guidance. * Added corporate-firewall DNS troubleshooting with platform-specific remediation and a verification nslookup step. * Quickstart updated to document the new confirmation flow. * **Chores** * Documentation version index updated. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…VIDIA#2471) ## Summary - Reverts the `nemoclaw.ts` changes from NVIDIA#2398 that caused `nemoclaw status` to hang indefinitely - Restores the original `isSandboxGatewayRunning`, `recoverSandboxProcesses`, `ensureSandboxPortForward`, and `checkAndRecoverSandboxProcesses` implementations - Leaves the dashboard-contract/health/recover modules in place (onboard.ts depends on them) ## Root cause PR NVIDIA#2398 replaced the simple gateway recovery path with `recoverDashboardChain()` which introduced unbounded calls that hang in CI. The bisect confirmed NVIDIA#2398 as the sole culprit: | Run | Commit | Result | |-----|--------|--------| | Last good | `de97a00d` (Apr 24 16:06) | pass | | Bisect 4 | `9fbfbaca` (NVIDIA#2398 only) | **hang** | | First bad | `79c8e2a9` (Apr 25 00:10) | hang | sandbox-survival, skip-permissions, sandbox-operations, and cloud-e2e all hang at `nemoclaw <name> status` after NVIDIA#2398. ## What this reopens - NVIDIA#2342 (false "Offline" with device auth) — the old `curl -sf /` probe returns failure on 401. This needs a targeted fix, not a refactor. - NVIDIA#2390 — dashboard delivery chain consolidation goal. Can be re-attempted with proper E2E validation. ## Test plan - [ ] `npx tsc -p tsconfig.src.json --noEmit` passes - [ ] Dashboard unit tests pass - [ ] Nightly E2E: sandbox-survival, skip-permissions, sandbox-operations, cloud-e2e pass <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved gateway health check reliability by using explicit status outputs instead of HTTP code interpretation. * Enhanced automatic recovery mechanism for sandbox gateway outages with better restart and port forwarding restoration. * Updated diagnostic logging to better reflect gateway restart and connectivity restoration activities. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…IA#2437) <!-- markdownlint-disable MD041 --> ## Summary The installer used to auto-attach `--resume` to `nemoclaw onboard` whenever a non-complete session file existed on disk. A run that failed at step 3 or 4 (for example, Ollama proxy unreachable) writes `status="failed"` with `resumable=true`, which matched that predicate, so re-running `curl | bash` silently replayed the same failed choice. The user had no way to pick a different provider short of deleting `~/.nemoclaw/onboard-session.json`. Fixed by tightening the installer's session classifier and adding an end-to-end `--fresh` opt-in across both the installer and the `nemoclaw onboard` CLI. ## Related Issue Resolves NVIDIA#2430. ## Changes **Installer (`scripts/install.sh`)** The install script now classifies the on-disk session into one of four states and branches on it instead of treating "non-complete" as one bucket: - `resume` — `status === "in_progress"` only. Auto-attaches `--resume` (the legitimate Ctrl-C / interrupted-mid-step case). - `failed` — `status === "failed"` or a populated `failure` field. In a TTY, prompts `[R/f]`; in non-interactive mode, refuses with a non-zero exit so scripted re-runs cannot loop. The refusal message points at `--fresh` and `nemoclaw onboard --resume` as the two explicit opt-ins. - `skip` — complete / `resumable: false` / missing. No `--resume` attached. - `corrupt` — file exists but cannot be parsed, or status does not match any value `onboard-session.ts` actually writes. Warns and starts fresh rather than auto-resuming an unreadable file. A new `--fresh` flag (and `NEMOCLAW_FRESH=1` env var) short-circuits the classifier entirely and is forwarded to the downstream CLI invocation. **CLI (`nemoclaw onboard --fresh`)** `onboard-command.ts` now accepts `--fresh`, exposed through the parser, the help text, and the `OnboardCommandOptions` type. When set, `onboard.ts` calls `onboardSession.clearSession()` before `createSession + saveSession`, so the previous session file is unlinked outright rather than overwritten in place. `--fresh` and `--resume` are mutually exclusive, validated at both the parser and the runtime entry point. **Docs** New troubleshooting entry under *Onboarding* (`Previous onboarding session failed`) explaining the interactive / non-interactive / `--fresh` / `nemoclaw onboard --resume` paths and the `~/.nemoclaw/onboard-session.json` last-resort delete. **Tests** - `test/install-preflight.test.ts`: two new cases — non-interactive refusal of a failed session, and `--fresh` short-circuiting auto-resume + forwarding to the CLI. - `src/lib/onboard-command.test.ts`: `--fresh` parsing and the `--resume` / `--fresh` mutual-exclusion error. ## Type of Change - [X] Code change (feature, bug fix, or refactor) - [X] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification <!-- Check each item you ran and confirmed. Leave unchecked items you skipped. --> - [X] `npx prek run --all-files` passes - [X] `npm test` passes - [X] Tests added or updated for new or changed behavior - [X] No secrets, API keys, or credentials committed - [X] Docs updated for user-facing behavior changes - [X] `make docs` builds without warnings (doc changes only) - [X] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) ## AI Disclosure <!-- If an AI agent authored or co-authored this PR, check the box and name the tool. Remove this section for fully human-authored PRs. --> - [X] AI-assisted — tool: Claude Code<!-- e.g., Claude Code, Cursor, GitHub Copilot --> --- <!-- DCO sign-off required by CI. Run: git config user.name && git config user.email --> Signed-off-by: Tinson Lai <tinsonl@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a `--fresh` option to start onboarding without resuming previous sessions; onboarding now avoids silent automatic retries after failures. * Interactive runs prompt to resume (`R`/Enter) or discard (`f`) a recorded failed session; non-interactive runs refuse to retry and exit non-zero unless explicitly opted in. * **Documentation** * Help text and docs updated to explain session persistence, recovery options, and how to use `--fresh` or `--resume`. * **Tests** * Added integration and unit tests covering fresh/resume behavior and mutual-exclusion of options. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
c0b43c2 to
9507a85
Compare
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Summary
Replaced arbitrary
sleep()delays with deterministic readiness probes (waitForPort,waitForHttp,waitUntil) across the NemoClaw CLI onboarding flow. This improves performance by proceeding as soon as readiness is achieved and prevents race conditions on slower systems.Refs #2001
Changes
waitForPort,waitForHttp,waitUntil) tosrc/lib/wait.ts.sleep(1)withwaitForPort(OLLAMA_PROXY_PORT, 2)during proxy initialization.sleep(2)withwaitForHttp(http://127.0.0.1:${OLLAMA_PORT}/, 10)when starting the Ollama server.sleep(5)withwaitUntilpolling CoreDNS readiness viakubectl get pods.sleep(2)calls withwaitUntilfor applying and removing policy presets.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)AI Disclosure
Signed-off-by: Krish Sapru ksapru@bu.edu
Summary by CodeRabbit
New Features
Bug Fixes
Tests